Example Analysis of the Open Source big data Index Project hive-solr 04/28 Update SLTechnology News&Howtos

Example Analysis of the Open Source big data Index Project hive-solr

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of the open source big data index project hive-solr, which is introduced in great detail and has a certain reference value. Interested friends must finish reading it!

Latest update:

(1) added support for solrcloud cluster

(2) fixed bug for handling null columns and null values in hive when inverting sequences

(3) optimized the ignorance of null and null values when building the index.

Some tests:

Data volume: about 12 million, 8 fields, one of which is large text, two are word segmentation fields, and the data volume before indexing is about 20g.

Total indexing time: about 15 minutes

Volume after indexing: about 6G per shard, a total of about 18G

Hive: limit the maximum number of concurrent map to 30, for fear of affecting the Hbase service. Note that after indexing using Hive, you need to manually commit once to make the memory index flush to disk.

Batch processing: 100000 data in each map is submitted once for batch processing. This value is not commit. This value is set according to the situation. If the value is too large, it is easy for solrcloud to lose data, and too small will affect the speed.

Solrcloud cluster version 5.1 uses 3 machines, each with one shard, no copy, and 10 GB of jetty memory

CPU:24 core, note that the large text segmentation field consumes a lot of cpu

Jvm parameter adjustment of solr:

(1) increase the proportion of SurvivorRatio area and reduce the memory space of survivor area.

(2) reduce the proportion of NewRatio area and increase the memory space of the new generation.

(3) increase the permanent MaxPermSize memory to 256m

(4) adjust MaxTenuringThreshold=0 to accelerate large objects into the old age, avoid copying back and forth in survivor and eden areas, and use YGC more times.

Other parameters are still configured by default.

Solr server configuration:

(1) disable automatic commit

(2) set ramBufferSizeMB to 1000, equal to 1G

(3) set maxBufferedDocs and so on-1, disable maxBufferDocs

(4) set mergeFactor to 100

The above is all the contents of this article "sample Analysis of the Open Source big data Index Project hive-solr". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.