How to realize non-automatic word Segmentation by importing data from elasticsearch-hadoop hive 04/18 Update SLTechnology News&Howtos

How to realize non-automatic word Segmentation by importing data from elasticsearch-hadoop hive

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "elasticsearch-hadoop hive import data how to achieve non-automatic word segmentation", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "elasticsearch-hadoop hive import data how to achieve non-automatic word segmentation" bar!

Background

Based on our company's use of es scenarios, no word segmentation function is required. When the es string type is used, the word will be segmented automatically, resulting in the segmentation of province, region and other fields.

Specific mode of use

Create an elasticsearc template (_ template), using the command:

Curl-XPUT localhost:9200/_template/dmp_down_result-d'

{

"template": "dmp_down_*", # defines the name of the template that will be used by indexes starting with dmp_down_

"settings": {

"number_of_shards": 14, # set the number of shards

"number_of_replicas": 1, # set the number of copies

"index.refresh_interval": "30s" # refresh interval (but not set)

}

"aliases": {

"dmp_down_result": {} # alias

}

"mappings": {

"dmp_es_result1": the type name in the {# index, which needs to be the same as the index being built.

"properties": # specific field mapping settings

{"user_id": a field in {# hive data that must correspond to it

"type": "multi_field", # type multimedia

"fields": {

"user_id": {"type": "string", "index": "not_analyzed"}, # type string,not_analyzed: does not use a participle, using a participle: analyzed

}

"phone": {

"type": "multi_field"

"fields": {

"imei": {"type": "string", "index": "not_analyzed"}

}

"address": {

"type": "multi_field"

"fields": {

"idfa": {"type": "string", "index": "not_analyzed"}

}

}}

}

After you have created the template, you can check whether it takes effect through http://localhost:9200/_template.

This completes the creation operation. The following drawing is attached

Import data as reported: maybe it contains illegal characters? This error cannot be checked during import. You can view the specific error information by manually creating an index / type. For example, {"error": {"root_cause": [{"type": "remote_transport_exception", "reason": "[dmp_es-16] [10.8.1.16 remote_transport_exception 9300] [indices:admin/create]"}], "type": "illegal_state_exception", "reason": "index and alias names need to be unique, but alias [dmp_keyword_result] and index [dmp_keyword_result] have the same name"} "status": 500} this can be more intuitive to see the problem. The index name is duplicated, just restart an index name.

Attached: es-hadoop hive data synchronization method:

Download the eslaticsearch-hadoop jar package, which needs to correspond to the current elasticsearch version

Upload eslaticsearch-hadoop jar to the cluster

Execute from the hive command line: add jar / home/hdroot/ elasticsearch-hadoop-2.2.0.jar

Table creation command:

CREATE EXTERNAL TABLE dmp_es_result2 (

User_id string

Imei string

Idfa string

Email string

Type_id array

Province string

Region string

Dt string

Terminal_brand string

System string)

STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'

TBLPROPERTIES ('es.resource' =' Index name / Type'

'es.index.auto.create' = 'true'

'es.nodes'='localhost'

'es.port' =' 9200'

'es.field.read.empty.as.null' =' true')

Es.resource: specifies the index name / type name that is synchronized to es

Es.index.auto.create: whether to use active primary key, if not, you can specify es.mapping.id= primary key

Es.nodes:es cluster node address. Any node can be separated by a comma (,). For example, 192.168.1.1purl 9200192.168.1.2purl 9200

Es.port:es cluster port number, such as multiple es.nodes specified, this may not be used

Es.field.read.empty.as.null: how to handle empty and null fields. Add this parameter to make the data import more ready (a little uncertain here)

Import data hive statement:

INSERT OVERWRITE TABLE dmp_es_result2 select user_id,imei,idfa,email,type_id,province,region,dt,terminal_brand,system from temp_zy_game_result01

If the execution is successful, you can see that the data synchronization in es has passed.

Thank you for your reading, the above is the "elasticsearch-hadoop hive import data how to achieve non-automatic word segmentation" content, after the study of this article, I believe you on the elasticsearch-hadoop hive import data how to achieve non-automatic word segmentation this problem has a deeper understanding, the specific use of the situation also needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.