Use ES to segment Chinese articles and sort the word frequency. 07/06 Update SLTechnology News&Howtos

Use ES to segment Chinese articles and sort the word frequency.

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

First of all, there is such a demand that we need to count a 10000-word article and which words appear more frequently. What is more important is how to segment a paragraph in the article. For example, "Beijing is the capital of × ×", "Beijing", "× ×", "Zhonghua", "Chinese", "people", "Republic" and "capital" are a word that needs to be cut out. The words "Beijing is" and "the Communist Party of the people" are not meaningful words, so they cannot be distinguished. It is very troublesome to write these participle rules by yourself, which can be easily achieved by using the open source IK participle. And the granularity of participle can be determined according to the pattern of participle.

Ik_max_word: the text will be split in the finest granularity, for example, the "× × national anthem" will be divided into "the Chinese people, the Chinese people, the people's Republic, the people, the people, the Republic, the country, the national anthem", and will exhaust all possible combinations.

Ik_smart: will do the coarsest grain split, for example, will split the "× × national anthem" into "× ×, national anthem".

First of all, prepare the environment.

If there is an ES environment that can skip the first two steps, here I assume that you only have a newly installed CentOS6.X system, which makes it easy for you to run through the process.

(1) install jdk.

$wget http://download.oracle.com/otn-pub/java/jdk/8u111-b14/jdk-8u111-linux-x64.rpm$ rpm-ivh jdk-8u111-linux-x64.rpm

(2) install ES

$wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.4.2/elasticsearch-2.4.2.rpm$ rpm-iv elasticsearch-2.4.2.rpm

(3) install IK word splitter

Download version 1.10.2 of the ik participle on github. Note: the es version is 2.4.2 and the compatible version is 1.10.2.

$mkdir / usr/share/elasticsearch/plugins/ik$ wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v1.10.2/elasticsearch-analysis-ik-1.10.2.zip$ unzip elasticsearch-analysis-ik-1.10.2.zip-d / usr/share/elasticsearch/plugins/ik

(4) configure ES

$vim / etc/elasticsearch/elasticsearch.yml# Cluster # cluster.name: test# Node # node.name: test-10.10.10.10node.master: truenode.data: true# Index # index.number_of_shards: 5index.number_of_replicas: "Path # path.data: / data/elk/espath.logs: / var/log/elasticsearchpath.plugins: / usr/share/elasticsearch/plugins# Refresh # refresh_interval: 5s# Memory # bootstrap.mlockall: true# Network # network.publish_host: 10.10.10.10network.bind_host: 0.0.0.0transport.tcp.port: 9300# Http # http.enabled: truehttp.port: 9200 # IK # index.analysis.analyzer.ik.alias: [ik_analyzer] index.analysis.analyzer.ik.type: ikindex.analysis.analyzer.ik_max_word.type: ikindex.analysis.analyzer.ik_max_word.use_smart: falseindex.analysis.analyzer.ik_smart.type: ikindex.analysis.analyzer.ik_smart.use_smart: trueindex.analysis.analyzer.default.type: ik

(5) start ES

$/ etc/init.d/elasticsearch start

(6) check the status of es nodes

$curl localhost:9200/_cat/nodes?v # sees a node with a normal host ip heap.percent ram.percent load node.role master name10.10.10.10 10.10.10.10 16 52 0.00d * test-10.10.10.10$ curl localhost:9200/_cat/health?v # cluster status of greenepoch timestamp cluster status Node.total node.data shards pri relo init1483672233 11:10:33 test green 11 0 0 0

Second, detect the function of word segmentation.

(1) create a test index

$curl-XPUT http://localhost:9200/test

(2) create mapping

$curl-XPOST http://localhost:9200/test/fulltext/_mapping-d' {"fulltext": {"_ all": {"analyzer": "ik"}, "properties": {"content": {"type": "string", "boost": 8.0 "term_vector": "with_positions_offsets", "analyzer": "ik", "include_in_all": true}}'

(3) Test data

$curl 'http://localhost:9200/index/_analyze?analyzer=ik&pretty=true'-d' {"text": "is the United States leaving Iraq a mess?"

Return content:

{"tokens": [{"token": "USA", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 0}, {"token": "reserved", "start_offset": 2, "end_offset": 4, "type": "CN_WORD" "position": 1}, {"token": "Iraq", "start_offset": 4, "end_offset": 7, "type": "CN_WORD", "position": 2}, {"token": "start_offset": 4, "end_offset": 5, "type": "CN_WORD" "position": 3}, {"token": "pull", "start_offset": 5, "end_offset": 6, "type": "CN_CHAR", "position": 4}, {"token": "gram", "start_offset": 6, "end_offset": 7, "type": "CN_WORD" "position": 5}, {"token": "a", "start_offset": 9, "end_offset": 10, "type": "CN_CHAR", "position": 6}, {"token": "mess", "start_offset": 10, "end_offset": 13, "type": "CN_WORD" "position": 7}, {"token": "stall", "start_offset": 11, "end_offset": 13, "type": "CN_WORD", "position": 8}, {"token": "stall", "start_offset": 11, "end_offset": 12, "type": "CN_WORD" "position": 9}, {"token": "son", "start_offset": 12, "end_offset": 13, "type": "CN_CHAR", "position": 10}, {"token": "Mo", "start_offset": 13, "end_offset": 14, "type": "CN_CHAR" "position": 11}]}

Three: start importing real data

(1) upload the Chinese text file to linux.

$cat / tmp/zhongwen.txt Beijing-Tianjin-Hebei heavy pollution Weather continuous Supervision found that some enterprises maliciously produced "narcissism" accused of "stingy acting" producer: special effects Obama insisted on moving out of Guantanamo despite Trump's opposition. South Korean media: Japan calls off negotiations on currency swaps between South Korea and Japan, the Korean Ministry of Finance expresses regret that China has to pay more than 400,000 tax elites with an annual salary of one million yuan, but has no choice but to go abroad for development.

Note: make sure the text file is encoded as utf-8, otherwise the es will be garbled later.

$vim / tmp/zhongwen.txt

Enter: set fineencoding in command mode to see fileencoding=utf-8.

If it is fileencoding=utf-16le, enter: set fineencoding=utf-8

(2) create index and mapping

Create an index

$curl-XPUT http://localhost:9200/index

Create mapping # to set the word splitter and fielddata for the field message to be segmented.

$curl-XPOST http://localhost:9200/index/logs/_mapping-d'{"logs": {"_ all": {"analyzer": "ik"}, "properties": {"path": {"type": "string"}, "@ timestamp": {"format": "strict_date_optional_time | | epoch_millis" "type": "date"}, "@ version": {"type": "string"}, "host": {"type": "string"}, "message": {"include_in_all": true, "analyzer": "ik", "term_vector": "with_positions_offsets" "boost": 8, "type": "string", "fielddata": {"format": "true"}, "tags": {"type": "string"}'

(3) use logstash to write the text file to es

Install logstash

$wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/rpm/elasticsearch/2.1.1/elasticsearch-2.1.1.rpm$ rpm-ivh logstash-2.1.1.rpm

Configure logstash

$vim / etc/logstash/conf.d/logstash.confinput {file {codec = > 'json' path = > "/ tmp/zhongwen.txt" start_position = > "beginning"} output {elasticsearch {hosts = > "10.10.10.10 index 9200" index = > "index" flush_size = > 3000 idle_flush_time = > 2 workers = > 4} stdout {codec = > rubydebug}}

Start

$/ etc/init.d/logstash start

By looking at the stdout output, you can determine whether or not to write it to the es.

$tail-f / var/log/logstash.stdout

(4) check whether there is any data in the index.

$curl 'localhost:9200/_cat/indices/index?v' # can see 6007 pieces of data. Health status index pri rep docs.count docs.deleted store.size pri.store.size green open index 50 6007 0 2.5mb 2.5mb$ curl-XPOST "http://localhost:9200/index/_search?pretty"{" took ": 1," timed_out ": false," _ shards ": {" total ": 5," successful ": 5," failed ": 0} "hits": {"total": 5227, "max_score": 5227, "hits": [{"_ index": "index", "_ type": "logs", "_ id": "AVluC7Dpbw7ZlXPmUTSG", "_ score": 1.0 "_ source": {"message": "more than 400,000 tax elites have no choice but to go abroad for development", "tags": ["_ jsonparsefailure"], "@ version": "1", "@ timestamp": "2017-01-05T09:52:56.150Z", "host": "0.0.0.0" "path": "/ tmp/333.log"}}, {"_ index": "index", "_ type": "logs", "_ id": "AVluC7Dpbw7ZlXPmUTSN", "_ score": 1.0, Touch sources ": {" message ":" Obama insists on moving prisoners out of Guantanamo despite Trump's opposition " "tags": ["_ jsonparsefailure"], "@ version": "1", "@ timestamp": "2017-01-05T09:52:56.222Z", "host": "0.0.0.0", "path": "/ tmp/333.log"}}

Four: start to calculate the word frequency and sort of word segmentation.

(1) query the top10 with the highest frequency of all words

$curl-XGET "http://localhost:9200/index/_search?pretty"-d' {" size ": 0," aggs ": {" messages ": {" terms ": {" size ": 10," field ":" message "}'

Return the result

{"took": 3, "timed_out": false, "_ shards": {"total": 5, "successful": 5, "failed": 0}, "hits": {"total": 6007, "max_score": 6007, "hits": []} "aggregations": {"messages": {"doc_count_error_upper_bound": 154," sum_other_doc_count ": 94992," buckets ": [{" key ":" one "," doc_count ": 1582}, {" key ":" after "," doc_count ": 1582} {"key": "people", "doc_count": 541}, {"key": "Home", "doc_count": 538}, {"key": "out", "doc_count": 489}, {"key": "doc_count": 451} {"key": "doc_count": 440}, {"key": "state", "doc_count": 421}, {"key": "year", "doc_count": 405}, {"key": "son" "doc_count": 402}]}}

(2) query the top10 with the highest frequency of all two words

$curl-XGET "http://localhost:9200/index/_search?pretty"-d' {" size ": 0," aggs ": {" messages ": {" terms ": {" size ": 10," field ":" message " "include": "[\ u4E00 -\ u9FA5] [\ u4E00 -\ u9FA5]"}, "highlight": {"fields": {"message": {}'

Return

{"took": 22, "timed_out": false, "_ shards": {"total": 5, "successful": 5, "failed": 0}, "hits": {"total": 6007, "max_score": 6007, "hits": []} "aggregations": {"messages": {"doc_count_error_upper_bound": 73, "sum_other_doc_count": 42415, "buckets": [{"key": "woman", "doc_count": 291}, {"key": "man", "doc_count": 264} {"key": "unexpectedly", "doc_count": 257}, {"key": "Shanghai", "doc_count": 255}, {"key": "this", "doc_count": 238}, {"key": "Girl" "doc_count": 174}, {"key": "these", "doc_count": 167}, {"key": "one", "doc_count": 159}, {"key": "attention", "doc_count": 143} {"key": "so", "doc_count": 142}]}

(3) top10 with the highest frequency of querying all two words but excluding the word "female".

Curl-XGET "http://localhost:9200/index/_search?pretty"-d' {" size ": 0," aggs ": {" messages ": {" terms ": {" size ": 10," field ":" message " "include": "[\ u4E00 -\ u9FA5] [\ u4E00 -\ u9FA5]", "exclude": "female. *"}, "highlight": {"fields": {"message": {}'

Return

{"took": 19, "timed_out": false, "_ shards": {"total": 5, "successful": 5, "failed": 0}, "hits": {"total": 5227, "max_score": 5227, "hits": []} "aggregations": {"messages": {"doc_count_error_upper_bound": 71, "sum_other_doc_count": 41773, "buckets": [{"key": "Man", "doc_count": 264}, {"key": "unexpectedly", "doc_count": 41773} {"key": "Shanghai", "doc_count": 255}, {"key": "this", "doc_count": 238}, {"key": "these", "doc_count": 167}, {"key": "one" "doc_count": 159}, {"key": "attention", "doc_count": 143}, {"key": "like this", "doc_count": 142}, {"key": "Chongqing", "doc_count": 142} {"key": "result", "doc_count": 137}]}

There are more word segmentation strategies, such as setting synonyms for "tomato" and "tomato", searching for "tomato", and "tomato" will also come out), setting pinyin segmentation (searching for "zhonghua", "Zhonghua" can also be found), and so on.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.