Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Import data from Hive to ES

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Big data is in the ascendant, Hive in the industry, is big data's standard. Therefore, it is common to add hive data to ES application scenarios.

Learn the official es-hadoop of ES, which can be used to guide data from Hive to ES. The experiment is feasible.

Version of hive: hive-1.1.0-cdh6.9.0

The specific steps are as follows:

Step1 adds elasticsearch-hadoop-hive-version.jar to hive

Wget https://artifacts.elastic.co/downloads/elasticsearch-hadoop/elasticsearch-hadoop-6.3.0.zipunzip elasticsearch-hadoop-6.3.0.ziphdfs dfs-mkdir / user/test/es_hadoop/hdfs dfs-put elasticsearch-hadoop-hive-6.3.0.jar / user/test/es_hadoop/ADD JAR hdfs://test/user/test/es_hadoop/elasticsearch-hadoop-hive-6.3.0.jar

Step2 creates the Hive table:

CREATE EXTERNAL TABLE elastic_table (uuid string, key1 int, key2 int, day string) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'TBLPROPERTIES (' es.resource'='index/type','es.nodes'='serverIP:port','es.index.auto.create'='TRUE','es.mapping.id' = 'uuid')

Step3 add data

INSERT OVERWRITE TABLE elastc_tableSELECT uuid, key1,key2, day FROM source s

To avoid problems with the client version, es-hadoop imports data using es's restfull interface, which uses the Http protocol.

Usually using ES, the first problem is: how to quickly import huge amounts of data into ES? Because the data of ES needs to be indexed inverted, the bottleneck of importing data into ES is often here in ES.

This article documented the method of importing data from the Hive table into ES. The power of mapreduce, the cluster, is hidden behind it. Here is a series of blogs on how to maximize the performance of ES indexed data. The foothold is ES.

Https://qbox.io/blog/series/how-to-maximize-elasticsearch-indexing-performance

The author summarizes three points:

Create a mapping according to the application scenario, and remove unnecessary fields, such as _ all, _ source

Here, we start with the application scenario to avoid storing unnecessary information to improve the performance of index data.

Modify es/lucene default settings, such as

Refresh_interval

Index.number_of_replicas

Index.merge.scheduler.max_thread_count

Index.translog.interval

Indices.memory.index_buffer_size

Index.index_concurrency

Wait for parameters. This is tuning from a cluster perspective, which is usually used to import data into ES in bulk.

If the first two still fail to solve the problem, you need to scale out the cluster, such as increasing the number of shards in the cluster.

When the cluster is large, the functions of each node need to be simplified and focused.

For example, nodes only undertake data-related tasks.

Node.master: falsenode.data: truenode.ingest: false

The batch value of bulk api needs to be tested to find the best parameters. It is suggested that the size of bulk is 5M~10M.

Use the SSD hard drive. When indexing data, the number of copies is set to 0.

Reference:

Http://note4code.com/2016/06/17/hive-%E5%90%91-elasticsearch-%E5%AF%BC%E5%87%BA%E6%95%B0%E6%8D%AE/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report