ES Learning Notes-Research on the implementation of importing hive data into es by elasticsearch-hadoop 07/19 Update SLTechnology News&Howtos

ES Learning Notes-Research on the implementation of importing hive data into es by elasticsearch-hadoop

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Each business data is "collected into hive, processed by ETL and exported to database" is a typical business process of big data's products. Among them, sqoop (offline) and kafka (real-time) are almost the standard configuration of data bus.

But some businesses are also non-standard, such as importing hive data into ES. Hive data is imported into ES, and the official component is elasticsearch-hadoop. Its usage has been introduced in the previous blog. So what is the implementation principle? Or how on earth did this guy es-hadoop get the data from the hive table into the es? In order to figure this out, we first need to have a local source environment.

S1: download the elasticsearch-hadoop source code.

Git clone https://github.com/elastic/elasticsearch-hadoop.git

S2: compile the source code. You can compile master directly.

Gradlew distZip

S3: after the compilation is successful, import it to intellij. Note that you import the build.gradle file here, just as the maven project imports the pom file.

S4: compile the project once in intellij.

S5: launch an es locally, and the default port is fine.

S6: run the test case AbstractHiveSaveTest.testBasicSave (). Running directly will report an error, and you need to modify the code slightly to add the properties of a class:

@ Cla***ule public static ExternalResource hive = HiveSuite.hive

If you are in a windows environment, you need to create a new packageorg.apache.hadoop.io.nativeio, and then create a NativeIO.java class under that package. Modify the code as follows:

/ / old public static boolean access (String path, Acce***ight desiredAccess) throws IOException {return access0 (path, desiredAccess.acce***ight ());} / new public static boolean access (String path, Acce***ight desiredAccess) throws IOException {return true;}

This runs a local hive to es code. You can debug to understand the detailed process.

In elasticsearch-hadoop, a relatively large project, it is also troublesome to modify the code, so you can create a separate project hive-shgy, and then modify the test class to run through testBasicSave ().

Since you are not familiar with gradle, you should set up a maven project. The project relies on the following:

Spring-libs http://repo.spring.io/libs-milestone/ org.apache.logging.log4j log4j-1.2-api 2.6.2 test org.apache.logging.log4j log4j-slf4j-impl 2.6.2 Test com.lmax disruptor 3.3.6 test junit junit 4.11 test org.apache.hive hive-cli 1.2.1 provided Org.apache.logging.log4j log4j-slf4j-impl org.slf4j slf4j-log4j12 org.apache.hadoop hadoop -client 2.2.0 provided org.apache.logging.log4j log4j-slf4j-impl org.slf4j slf4j-log4j12 Org.elasticsearch elasticsearch-hadoop 6.3.0 test

Log4j2 is used here, so the log class comes first.

Next, migrate the test code. The principle of migration is not to add classes if it is not necessary. If only one method of the class is used, only one method is migrated. The test code migration here is actually built around HiveEmbeddedServer2. Personally, what feels clever here is that an embedded hive instance is launched through HiveEmbeddedServer2. Being able to execute hive sql, and in a jvm, is cool for studying the principles of hive implementation.

After the basic environment is set up, you can study the source code of elasticsearch-hadoop. First, look at the structure of the source code:

Elasticsearch-hadoop/hive/src/main/java/org/elasticsearch/hadoop/hive$ tree.. ├── EsHiveInputFormat.java ├── EsHiveOutputFormat.java ├── EsSerDe.java ├── EsStorageHandler.java ├── HiveBytesArrayWritable.java ├── HiveBytesConverter.java ├── HiveConstants.java ├── HiveFieldExtractor.java ├── HiveType.java ├── HiveUtils.java ├── HiveValueReader.java ├── HiveValueWriter.java ├── HiveWritableValueWriter.java └── package-info.java0 directories, 14 files

Here is a brief description of how elasticsearch-hadoop synchronizes hive data to es. Hive opens the interface to StorageHandler. With StoreageHandler, you can use SQL to write data to es, and you can also use SQL to read data in ES. So, the whole es-hive, its entry class is EsStorageHandler, which is the framework of the whole function. Now that you know EsStorageHandler, the next important class is EsSerDe, which is the functional component of serialization and deserialization. It is a bridge to convert ES data types to Hive data types. These are the core classes.

Understand the principle and structure of the code, you can emulate the realization of hive data synchronization to mongo, hive data synchronization to redis and other functions. The advantage of this is that it has nothing to do with business, once developed, and used many times. Convenient for management and maintenance.

In conclusion, this paper does not give the answer directly, but records the process of finding the answer. Through this process, learn to synchronize hive data to other NoSQL, this practice is more important than understanding the source code.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.