Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Hive to integrate Solr?

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

(1) introduction to Hive+Solr

As an offline data warehouse in Hadoop ecosystem, Hive can easily use SQL to analyze massive historical data offline, and do some other things according to the analysis results, such as report statistics query and so on.

As a high-performance search server, Solr can provide fast and powerful full-text retrieval functions.

(2) Why is it necessary for hive to integrate solr?

Sometimes, we need to store the results of hive analysis into solr for full-text search services. For example, we used to have a business, using hive analysis to store the search logs of our e-commerce website into solr for report query, because it involves search keywords, and this field needs to be able to query with and without word segmentation. Through the word segmentation query, you can see a trend chart of the related products in a certain period of time. Sometimes, we need to load the data in solr into hive and use sql to complete some join analysis functions. The advantages and disadvantages of the two complement each other to better adapt to our business needs. Of course, there are some hive integration solr open source projects on the Internet, but because the version is relatively old, it can not run in the new version, and the modified and patched version can run in the latest version.

(3) how can hive be integrated with solr?

The so-called integration is actually rewriting some components of hadoop's MR programming interface. We all know that the programming interface of MR is very flexible and highly abstract. MR can load data sources not only from HDFS, but also from any non-HDFS system, provided that we need to customize:

InputFormat

OutputFormat

RecordReader

RecordWriter

InputSplit

Components, although slightly cumbersome, it is possible to load data from anywhere, including mysql,sqlserver,oracle,mongodb, solr,es,redis, and so on.

What is mentioned above is to customize the MR programming interface of Hadoop. In Hive, in addition to some of the above components, you also need to define SerDe components and assemble StorageHandler. In hive, SerDe refers to Serializer and Deserializer, that is, serialization and deserialization. Hive needs to use serde and fileinput to read and write rows of data in the hive table.

The process of reading:

HDFS files / every source-> InputFileFormat-- >-- > Deserializer-- > Row object

The process of writing:

Row object-- > Serializer-- >-- > OutputFileFormat-- > HDFS files / every source

(4) what can hive do after integrating solr?

(1) read solr data, use SQL syntax supported by hive, and perform all kinds of aggregation, statistics, analysis, join, etc.

(2) generate solr index, with a sentence of SQL, you can build an index for large-scale data by means of MR.

(e) how to install, deploy and use it?

The source code is here, no longer pasted, github has been uploaded, friends in need can use git clone https://github.com/qindongliang/hive-solr, modify a few pom files, and execute

Mvn clean package

Command to build and generate the jar package, and copy the jar package to the lib directory of hive

Examples are as follows:

(1) hive reads solr data

Build a table:

Sql code

-- delete if there is a table

Drop table if exists solr

-- create an external table

Create external table solr (

-- define fields, which need to be the same as those in solr.

Rowkey string

Sname string

)

-- define the stored storehandler

Stored by "com.easy.hive.store.SolrStorageHandler"

-- configure the solr property

Tblproperties ('solr.url' =' http://192.168.1.28:8983/solr/a',

'solr.query' ='*: *'

'solr.cursor.batch.size'='10000'

'solr.primary_key'='rowkey'

);

Execute the bin/hive command to execute the command line terminal of hive:

-- query all data

Select * from solr limit 5

-- query specified fields

Select rowkey from solr

-- aggregate statistical solr data in the way of mr

Select sname, count (*) as c from solr group by sname order by c desc

(2) an example of using hive to build an index for solr

First build the data source table:

Sql code

-- delete if it exists

Drop table if exists index_source

-- build a data table

CREATE TABLE index_source (id string, yname string,sname string) ROW FORMAT DELIMITED FIELDS TERMINATED BY', 'STORED AS TEXTFILE

-- Import local data into the data source

Load data local inpath'/ ROOT/server/hive/test_solr' into table index_source

Second, build the associated table for solr:

-- Delete existing tables

Drop table if exists index_solr

-- create an associated solr table

Create external table index_solr (

Id string

Yname string

Sname string

)

-- define the storage engine

Stored by "com.easy.hive.store.SolrStorageHandler"

-- set solr service properties

Tblproperties ('solr.url' =' http://192.168.1.28:8983/solr/b',

'solr.query' ='*: *'

'solr.cursor.batch.size'='10000'

'solr.primary_key'='id'

);

Finally, execute the following sql command to build a solr index for the data in the data source:

Sql code

-- register the jar package of hive-solr, otherwise when running in MR mode, it will not start normally

Add jar / ROOT/server/hive/lib/hive-solr.jar

-execute the insert command

INSERT OVERWRITE TABLE index_solr SELECT * FROM index_source

After the execution is successful, you can view it in the terminal interface of solr, or execute the following solr query in hive

Select * from index_solr limit 10

(6) can they integrate other frameworks?

Of course, as an open source independent framework, we can carry out various combinations, hive can also be integrated with elasticsearch, can also be integrated with mongodb, solr can also be integrated with spark, can also be integrated with pig, but we need to customize related components, the idea is roughly the same as the idea of this project.

(7) the basic environment in which the test passed

Apache Hadoop2.7.1

Apache Hive1.2.1

Apache Solr5.1.0

(8) thanks and reference materials:

Https://github.com/mongodb/mongo-hadoop/tree/master/hive/src/main/java/com/mongodb/hadoop/hive

Https://github.com/lucidworks/hive-solr

Https://github.com/chimpler/hive-solr

Https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HowtoWriteYourOwnSerDe

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report