What is the method of CDH5 Solr performance tuning 07/08 Update SLTechnology News&Howtos

What is the method of CDH5 Solr performance tuning

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what is the method of CDH5 Solr performance tuning". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the method of CDH5 Solr performance tuning".

Solr performance tuning

Solr performance tuning is a complex process, and the purpose of this article is to describe the considerations for performance optimization in the use of Solr.

Tuning after installation is complete

Some configurations are best modified immediately after installation to avoid repeated indexing after changing the configuration.

Configure a required Lucene version

Configure the latest version of Lucene that we installed, which will have the latest features and fixes for some known bug. It is recommended to use the latest version of lucene of solr, which is modified in the solrconfig.xml file.

4.4

The version of Lucene used by Solr in CDH5.3.2 is 4.4, and it is recommended that you do not modify this.

Schema design

When we create a schema, we need to use the correct data type to describe the corresponding data field, such as:

Use the tdate data type to describe the date type, rather than the date of the string type.

It is recommended to use the text type instead of the string type to adapt to the system locale. Because the text type can return a subset of the results of an input entry, for example, when we query 'John', we may find the data results of' John Smith'. If it is a string type, only matching results will be returned.

For the IDs field, use the string type.

General tuning

1. For Faceting queries, start facet.thread to specify multithreaded concurrent queries, such as:

Http://localhost:8983/solr/collection1/select?q=*:*&facet=true&fl=id&facet.field=f0_ws&facet.threads=100

The above is to configure 100 threads to query concurrently. For more information on the specific use of Faceting, please see: https://cwiki.apache.org/confluence/display/solr/Faceting

two。 Configure the number of block caches of HDFS through the solr.hdfs.blockcache.slab.count parameter. By default, a HDFS block cache is 128m. It is recommended to use 10% or 20% of physical memory to configure the number of count. For example, for a machine with 50 GB of memory, it is recommended to use 5G~10G memory. Then the configured number of count is within the range of 5: 1024, 128, 10, 10, 10, 24, 128. This parameter is referenced in the solrconfig.xml file, as shown below:

${solr.hdfs.blockcache.enabled:true}

${solr.hdfs.blockcache.slab.count:1}

${solr.hdfs.blockcache.direct.memory.allocation:true}

${solr.hdfs.blockcache.blocksperbank:16384}

${solr.hdfs.blockcache.read.enabled:true}

${solr.hdfs.blockcache.write.enabled:true}

${solr.hdfs.nrtcachingdirectory.enable:true}

${solr.hdfs.nrtcachingdirectory.maxmergesizemb:16}

${solr.hdfs.nrtcachingdirectory.maxcachedmb:192}

Where solr.hdfs.blockcache.slab.count reads the solr.hdfs.blockcache.slab.count parameter configured by the system, and defaults to 1 if it is not configured. This parameter is modified and adjusted in Cloudera Manager under Solr- > configuration-> Solr Server Default Group- > Resource Management.

3. After adding the block cache for hdfs, we have to increase the memory size of JVM to avoid OOM exceptions. For manual installation, we need to add the following configuration under / etc/default/solr (or / opt/cloudera/parcel/CDH-*/etc/default/solr if installed in parcel mode):

CATALINA_OPTS= "- Xmx10g-XX:MaxDirectMemorySize=15g-XX:+UseLargePages-Dsolr.hdfs.blockcache.slab.count=60"

If you are using Cloudera Manager, you can go to Solr- > configure-> Solr Server Default Group- > Resource Management under Solr Server.

The Java stack size (in bytes) and the Java direct memory size (bytes) parameters of the Solr service are found. The above is based on 50 GB of physical memory, of which Xmx recommends about 20% of the physical memory and MaxDirectMeorySize recommends about 30% of the physical memory.

4. In order to improve performance, cloudera recommends that you modify the number of swap space in linxu, as follows:

# minimize swappiness

Sudo sysctl vm.swappiness=10

Sudo bash-c 'echo "vm.swappiness=10" > > / etc/sysctl.conf'

# disable swap space until next reboot:

Sudo / sbin/swapoff-a

5. Choosing different GC mechanisms in different environments can better improve the performance of Solr. There are two-way GC mechanisms to choose from:

Concurrent low pause collector: referred to as CMS, the main application scenario is that response time is more important than throughput, it can withstand garbage collection threads and application threads to share processing resources, and there are many applications with long life cycle objects. Mainly for the collection of the older generation, the goal is to minimize the pause time of the application, reduce the probability of full gc occurrence, and use the garbage collection threads concurrent with the application thread to mark the older generation clearly. Enable CMS:-XX:+UseConcMarkSweepGC

Throughput collector: a garbage collection mechanism designed for maximum throughput, which mainly uses parallel collection algorithms to collect the younger generation. If solr requires higher throughput than user experience, you can use this mechanism, but it usually affects the user experience due to connection timeout. Enable this mechanism:-XX:+UseParallelGC

The CMS mechanism used by CDH5 by default can be modified in the Java configuration options of Solr- > configuration-> Solr Server Default Group > Advanced-> Solr Server.

6. If we have extra hardware resources, we can improve query throughput through replica. Of course, adding replica will have a slight impact on the write performance of the first replica, but this should be the least negative impact.

The ramBufferSizeMB parameter in the 7.solrconfig.xml file means that when adding or deleting documents, in order to reduce frequent index updates, solr will choose to cache in memory. When the file size in memory is greater than this value, it will be updated to the index database. A larger value will consume more memory. We need to make sure that the value is lower than the memory value of JVM. Of course, the bigger the better, the bigger means the more difficult it is when GC. Since the index is written to HDFS in CDH, the value of ramBufferSizeMB here should be the same as the value set by solr.hdfs.blockcache.slab.count above. If solr.hdfs.blockcache.slab.count is configured to 4, then the numeric value is configured to 428 (the default block size for HDFS). It is worth noting that there is also a maxBufferedDocs parameter corresponding to this parameter, which means that when the number of indexes exceeds the configured value, it will be refreshed to the index database, because we do not know the specific data size of each index. If you configure this parameter, it may cause the ramBufferSizeMB parameter to become invalid, so it is not recommended to enable this parameter.

The maxIndexingThreads parameter in the 8.solrconfig.xml file indicates the maximum number of concurrent threads when indexing. If the number of threads exceeds this configured value when indexing data, other threads will wait. This value is related to the processing power of CPU. The default value is 8. 0.

The filterCache parameter in the 9.solrconfig.xml file represents the dataset obtained by caching the filter queries (that is, the query parameter fq). There are two query parameters, one is Q and the other is fq. If fq exists, we will first query the data in fq, then query the data in Q, and finally take union, when we do multi-parameter query, if we use Q parameter query, the query hit rate will be very low, and take up more memory space, we can optimize the query, using the form of fq to find the intersection of two data will be a good hint of performance. FilterCache enabled via

Parameter, where class is a cache implementation based on the LRU algorithm. If the data of cache is inserted more than queries, then solr.LRUCache; is used. If there are more queries and fewer inserts, then solr.FastLRUCache is used. Size represents the maximum number of big data entries saved in the cache, initialSize indicates the size of the cache initialization, and autowarmCount indicates that the new SolrIndexSearcher can be preheated when the SolrIndexSearcher is switched. This parameter indicates how much data is taken from the old SolrIndexSearcher and rereferenced in the new SolrIndexSearcher. If it is near real-time search, it is not recommended to enable it. 0 indicates that it is not enabled.

The useCompoundFile parameter in the 10.solrconfig.xml file means that multiple files in a segment are merged into unique files. It takes about 7% / 33% of the indexing time to enable this feature. Before version 3.6, it defaults to true, and then defaults to false. Of course, after setting it to false, you should pay attention to whether there is a limit on the number of files allowed to be opened by configuring the linux process. If there is a limit, you can modify it in the ulimit parameter.

10. Start the local shard priority and add preferLocalShard=true to the request to start the feature. When this feature is enabled, the data stored in the local shard is preferred, thus reducing the data transfer of the network IO.

11. We need to note that SolrCloud has done read-write separation, and when our write request link is replica, replica will automatically forward the request to leader, and then leader will distribute it to other replica.

Thank you for reading, the above is the content of "what is the CDH5 Solr performance tuning method". After the study of this article, I believe you have a deeper understanding of what the CDH5 Solr performance tuning method is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.