Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the knowledge points of Elasticsearch?

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the knowledge points of Elasticsearch". Friends who are interested might as well take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what are the knowledge points of Elasticsearch"?

The main contents of this article are as follows:

Preface

In a project, we always use the Kibana interface to search the logs in the test or production environment to see if there is any abnormal information. Kibana is what we often call the K of ELK.

The Kibana interface is shown in the following figure:

Kibana interface

But what is the principle of these log retrieval? This is where our Elasticsearch search engine comes in.

A brief introduction to Elasticsearch

1.1 what is Elasticsearch?

Elasticsearch is a distributed open source search and analysis engine for all types of data, including text, numbers, geospatial, structured and unstructured data. To put it simply, ES can do anything related to search and analysis.

1.2 what is the purpose of Elasticsearch?

Elasticsearch performs well in terms of speed and scalability, and has the ability to index many types of content, which means it can be used in a variety of use cases:

For example, an online store where you can allow customers to search for the products you sell. In this case, you can use Elasticsearch to store the entire product catalog and inventory and provide them with search and auto-completion suggestions.

Search for mobile phones

For example, collect log or transaction data, and analyze and mine this data to find trends, statistics, summaries, or anomalies. In this case, you can use Logstash (part of the Elasticsearch / Logstash / Kibana stack) to collect, aggregate, and parse the data, and then have Logstash provide the data to Elasticsearch. After the data is put into Elasticsearch, you can run searches and aggregations to mine any information you are interested in.

1.3How does Elasticsearch work?

ELK schematic diagram

Elasticsearch is built on the basis of Lucene. ES has made a lot of enhancements on Lucence.

Lucene, a sub-project of the jakarta project team of apache Software Foundation 4, is an open source full-text search engine toolkit, but it is not a complete full-text search engine, but a full-text search engine architecture, which provides a complete query engine and indexing engine, some text analysis engines (English and German two western languages). The purpose of Lucene is to provide a simple and easy-to-use toolkit for software developers to conveniently realize the function of full-text retrieval in the target system, or to establish a complete full-text retrieval engine based on it. (from Baidu encyclopedia)

Where does the raw data of Elasticsearch come from?

Raw data is entered into Elasticsearch from multiple sources, including logs, system metrics, and network applications.

How is the data collected from Elasticsearch?

Data collection refers to the process of parsing, standardizing and enriching the original data before indexing in Elasticsearch. Once this data is indexed in Elasticsearch, users can run complex queries against their data and use aggregations to retrieve complex summaries of their own data. Logstash is used here, which will be described later.

How to visually view the data you want to retrieve?

Kibana is about to be used here, where users can search based on their own data, view data views, and so on.

1.4 what is the Elasticsearch index?

An Elasticsearch index refers to a collection of documents that are related to each other. Elasticsearch stores data as an JSON document. Each document establishes a relationship between a set of keys (the name of a field or property) and their corresponding values (strings, numbers, Boolean values, dates, numeric groups, geographic locations, or other types of data).

Elasticsearch uses a data structure called inverted index, which is designed to allow full-text search to be done very quickly. The inverted index lists each unique word that appears in all documents, and all documents that contain each word can be found.

During the indexing process, Elasticsearch stores the document and builds an inverted index so that users can search the document data in near real time. The indexing process is initiated in the index API, which allows you to add JSON documents to or change JSON documents in a specific index.

What is the purpose of 1.5 Logstash?

Logstash is the L of ELK.

Logstash is one of the core products of Elastic Stack, which can be used to aggregate and process data and send it to Elasticsearch. Logstash is an open source server-side data processing pipeline that allows you to collect, enrich and transform data from multiple sources at the same time before indexing it to Elasticsearch.

1.6 what is the purpose of Kibana?

Kibana is a data visualization and management tool for Elasticsearch, which can provide real-time histogram, linear graph and so on.

1.7 Why use Elasticsearch

ES is a fast, near real-time search platform.

ES is distributed in nature.

ES includes a wide range of functions, such as data aggregation and index lifecycle management.

Official document: https://www.elastic.co/cn/what-is/elasticsearch

II. Basic concepts of ES

2.1 Index (Index)

Verb: equivalent to insert in Mysql

Noun: equivalent to database in Mysql

Comparison with mysql

Serial number MysqlElasticsearch1Mysql Service ES Cluster Service 2 Database Database Index Index3 Table Table Type Type4 record Records (row record) document Document (JSON format)

2.2 inverted index

If the database has the following movie records:

1-A Chinese Odyssey

2-rumors of A Chinese Odyssey

3-Analysis of A Chinese Odyssey

4-Journey to the West to conquer the devil

5-exclusive Analysis of Fantasy Journey to the West

Participle: to split a whole sentence into words

The serial number is saved to the ES word corresponding to the movie record serial number A Journey to the West 1 Magic 2, 3 Magic 4, 5B lie 1 Magic 2, 3 C gossip 2 Magic 4, 5D parsing 3 Magic 5e falling Magic 4F Dream 5G exclusive 5

Search: exclusive A Chinese Odyssey

Split the exclusive A Chinese Odyssey analysis into exclusive, boast and Journey to the West.

There is one of these three words in the records of A, B and G in ES, so the related words are hit in the records of 1, 3, 4 and 5.

Record 1 hit 2 times, both An and B (hit 2 times), and record 1 has 2 words, correlation score: 2 times / 2 words = 1

Record 2 hits both An and B (2 hits), and record 2 has 2 words, correlation score: 2 / 3 words = 0.67

Record 3 hit 2 words An and B (hit 2 times), and record 3 has 2 words, correlation score: 2 times / 3 words = 0.67

Record 4 hits 2 words A (1 hit), and record 4 has 2 words, correlation score: 1 / 3 words = 0.33

Record 5 hit 2 words A (hit 2 times), and record 4 had 4 words, correlation score: 2 times / 4 words = 0.5

So the order of the retrieved records is as follows:

1-A Chinese Odyssey (correlation score: 1)

2-A Chinese Odyssey's anecdotal score: 0.67

3-analyze A Chinese Odyssey (relevance score: 0.67)

5-exclusive analysis of Fantasy Journey to the West (fantasy score: 0.5)

4-Journey to the West to conquer demons (desire score: 0.33)

Third, Docker builds the environment

3.1. Build Elasticsearch environment

1) download the image file

Docker pull elasticsearch:7.4.2

2) create an instance

Mapping Profil

Configure mapping folder mkdir-p / mydata/elasticsearch/config configure mapping folder mkdir-p / mydata/elasticsearch/data set folder permissions for any user readable and writable chmod 777 / mydata/elasticsearch-R configuration http.host echo "http.host: 0.0.0.0" > > / mydata/elasticsearch/config/elasticsearch.yml

Start the elasticsearch container

Docker run-- name elasticsearch-p 9200 discovery.type 9200-p 9300 discovery.type = "single-node"\-e ES_JAVA_OPTS= "- Xms64m-Xmx128m"\-v / mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml\-v / mydata/elasticsearch/data:/usr/share/elasticsearch/data\-v / mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins\-d elasticsearch:7.4.2

Access the elasticsearch service

Visit: http://192.168.56.10:9200

Returned reponse

{"name": "8448ec5f3312", "cluster_name": "elasticsearch", "cluster_uuid": "xC72O3nKSjWavYZ-EPt9Gw", "version": {"number": "7.4.2", "build_flavor": "default", "build_type": "docker", "build_hash": "2f90bbf7b93631e52bafb59b3b049cb44ec25e96" "build_date": "2019-10-28T20:40:44.881551Z", "build_snapshot": false, "lucene_version": "8.2.0", "minimum_wire_compatibility_version": "6.8.0", "minimum_index_compatibility_version": "6.0.0-beta1"}, "tagline": "You Know, for Search"}

Access: http://192.168.56.10:9200/_cat access node information

127.0.0.162 900 0 0.06 0.10 0.05 dilm * 8448ec5f3312

3.2. Build Kibana environment

Docker pull kibana:7.4.2 docker run-- name kibana-e ELASTICSEARCH_HOSTS= http://192.168.56.10:9200-p 5601 ELASTICSEARCH_HOSTS= 5601-d kibana:7.4.2

Visit kibana: http://192.168.56.10:5601/

Fourth, the primary retrieval method

4.1._cat usage

GET / _ cat/nodes: view all nodes GET / _ cat/health: view es health status GET / _ cat/master: view master node GET / _ cat/indices: view summary of all index queries: / _ cat/allocation / _ cat/shards/ _ cat/shards/ {index} / _ cat/master / _ cat/nodes / _ cat/tasks / _ cat/indices/ _ cat/indices/ {index} / _ cat/segments / _ Cat/segments/ {index} / _ cat/count/ _ cat/count/ {index} / _ cat/recovery/ _ cat/recovery/ {index} / _ cat/health / _ cat/pending_tasks / _ cat/aliases/ _ cat/aliases/ {alias} / _ cat/thread_pool/ _ cat/thread_pool/ {thread_pools} / _ cat/plugins / _ cat/fielddata/ _ cat/fielddata/ {fields} / _ cat/nodeattrs / _ cat/repositories / _ cat/snapshots/ {repository} / _ cat/templates

4.2. Index a document (save)

Example: save the data identified as 1 under the external type under the customer index.

Use Kibana's Dev Tools to create

PUT member/external/1 {"name": "jay huang"}

Reponse:

{"_ index": "member", / / in which index "_ type": "external", / / in that type "_ id": "2", / / record id "_ version": 7successful / version number "result": "updated", / / Operation type "_ shards": {"total": 2, "successful": 1 "failed": 0}, "_ seq_no": 9, "_ primary_term": 1}

You can also send a request through the Postman tool to create a record.

Create a record

Note:

Both PUT and POST can create records.

POST: if id is not specified, id is generated automatically. If you specify id, modify the record and add a new version number.

PUT: id must be specified. If there is no record, add it, and if so, update it.

4.3 query documents

Request: http://192.168.56.10:9200/member/external/2 Reposne: {"_ index": "member", / / in which index "_ type": "external", / / in that type "_ id": "2", / / record id "_ version": 7, / / version number "_ seq_no": 9 / / concurrency control field Each update will be + 1, which is used to make the optimistic lock "_ primary_term": 1, / / as above, the main shard will be redistributed. If you restart, it will change "found": true, "_ source": {/ / the real content "name": "jay huang"}}.

_ seq_no as an optimistic lock

Each time the data is updated, _ seq_no is + 1, so it can be used as concurrency control.

When updating a record, if the _ seq_no is inconsistent with the preset value, the record has been updated at least once, and this update is not allowed.

The usage is as follows:

Request to update record 2: http://192.168.56.10:9200/member/external/2?if_seq_no=9&&if_primary_term=1 returned result: {"_ index": "member", "_ type": "external", "_ id": "2", "_ version": 9, "result": "updated", "_ shards": {"total": 2 "successful": 1, "failed": 0}, "_ seq_no": 11, "_ primary_term": 1}

_ seq_no is equal to 10, and the data is updated when _ primary_term=1. After executing a request, the above request will report an error: version conflict.

{"error": {"root_cause": [{"type": "version_conflict_engine_exception", "reason": "[2]: version conflict, required seqNo [10], primary term [1]. Current document has seqNo [11] and primary term [1] "," index_uuid ":" CX6uwPBKRByWpuym9rMuxQ "," shard ":" 0 "," index ":" member "}]," type ":" version_conflict_engine_exception "," reason ":" [2]: version conflict, required seqNo [10], primary term [1]. Current document has seqNo [11] and primary term [1] "," index_uuid ":" CX6uwPBKRByWpuym9rMuxQ "," shard ":" 0 "," index ":" member "}," status ": 409}

4.4 Update documentation

Usage

POST update operation with _ update. If the original data has not changed, the result in repsonse returns noop (without any action) and the version will not change.

The request data needs to be wrapped in doc in the request body.

POST request: http://192.168.56.10:9200/member/external/2/_update {"doc": {"name": "jay huang"}} response: {"_ index": "member", "_ type": "external", "_ id": "2", "_ version": 12, "result": "noop" "_ shards": {"total": 0, "successful": 0, "failed": 0}, "_ seq_no": 14, "_ primary_term": 1}

Usage scenario: for large concurrent updates, it is recommended not to take _ update. For large concurrent queries, for scenarios with a few updates, you can update them with _ update.

Add attributes when updating

Request weight gain age attribute

Http://192.168.56.10:9200/member/external/2/_update request: {"doc": {"name": "jay huang", "age": 18}} response: {"_ index": "member", "_ type": "external", "_ id": "2", "_ version": 13, "result": "updated" "_ shards": {"total": 2, "successful": 1, "failed": 0}, "_ seq_no": 15, "_ primary_term": 1}

4.5 Delete documents and indexes

Delete document

DELETE request: http://192.168.56.10:9200/member/external/2 response: {"_ index": "member", "_ type": "external", "_ id": "2", "_ version": 2, "result": "deleted", "_ shards": {"total": 2, "successful": 1 "failed": 0}, "_ seq_no": 1, "_ primary_term": 1}

Delete index

DELETE request: http://192.168.56.10:9200/member repsonse: {"acknowledged": true}

There is no function to delete a type

4.6 Import data in bulk

Using kinaba's dev tools tool, enter the following statement

POST / member/external/_bulk {"index": {"_ id": "1"} {"name": "Jay Huang"} {"index": {"_ id": "2"}} {"name": "Jackson Huang"}

The execution result is shown in the following figure:

Copy official sample data

Https://raw.githubusercontent.com/elastic/elasticsearch/master/docs/src/test/resources/accounts.json

Execute the script in kibana

POST / bank/account/_bulk {"index": {"_ id": "1"} {"account_number": 1, "balance": 39225, "firstname": "Amber", "lastname": "Duke", "age": 32, "gender": "M", "address": "880Holmes Lane", "employer": "Pyrami", "email": "amberduke@pyrami.com", "city": "Brogan" "state": "IL"} {"index": {"_ id": "6"}.

Execution results of batch insertion of sample data

View all indexes

View all indexes

You can see from the returned results that the bank index has 1000 pieces of data, which takes up 440.2kb storage space.

Fifth, the high-level retrieval method

5.1 two query methods

5.1.1 parameters followed by URL

GET bank/_search?q=*&sort=account_number: asc

```/ _ search?q=*&sort=account_number: Asc`

Query all the data, a total of 1000 pieces of data, time-consuming 1ms, only show 10 pieces of data (ES pagination)

URL followed by parameters

Attribute value description:

Time for took-ES to perform search (in milliseconds) whether timed_out-ES timed out _ shards-how many shards have been searched, and statistics of successful / failed / skipped search fragments max_score-highest score hits.total.value-number of records hit hits.sort-sorted key key of the result If not, sort hits._score by score-relevance score reference document: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-search.html

5.1.2 URL plus request body for retrieval (QueryDSL)

Write query conditions in the request body

Syntax:

GET bank/_search {"query": {"match_all": {}}, "sort": [{"account_number": "asc"}]}

Example: query all, first sort by accout_number ascending order, and then sort by balance descending order

URL plus request body for retrieval

5.2 detailed explanation of QueryDSL query

DSL: Domain Specific Language

5.2.1 all match match_all

Example: query all records, sort by balance descending order, return only 11 records to 20 records, and display only the balance and firstname fields.

GET bank/_search {"query": {"match_all": {}}, "sort": [{"balance": {"order": "desc"}], "from": 10, "size": 10, "_ source": ["balance", "firstname"]}

5.2.2 matching query match

Basic type (non-string), exact match

GET bank/_search {"query": {"match": {"account_number": "30"}

String, full-text search

GET bank/_search {"query": {"match": {"address": "mill road"}

String full-text retrieval

The full-text search is sorted according to the score, and the retrieval conditions are segmented and matched.

Query all records in address that contain mill or road or mill road, and give the correlation score.

32 records were found, and the highest record was Address = "990 Mill Road", with a score of 8.926605. Address= 's "198 Mill Lane" score is 5.4032025, and only Mill words are matched.

5.2.3 phrase matching match_phase

Retrieve the values that need to be matched as a whole word (no word segmentation)

GET bank/_search {"query": {"match_phrase": {"address": "mill road"}

Find out all the records in address that contain mill road, and give the correlation score

5.2.4 Multi-field matching multi_match

GET bank/_search {"query": {"multi_match": {"query": "mill land", "fields": ["state", "address"]}

Query in multi_match can also do word segmentation.

Query records where state contains mill or land or address contains mill or land.

5.2.5 compound query bool

Compound statements can merge any other query statements, including compound statements. Compound statements can be nested with each other to express complex logic.

Use must,must_not,should in combination

Must: the condition specified by must must be met. (influence correlation score)

Must_not: the conditions for must_not must not be met. (does not affect the correlation score)

Should: if the should condition is met, the score can be improved. If you are not satisfied, you can also query the record. (influence correlation score)

Example: query a record whose address contains mill and whose gender is M and whose age is not equal to 28, and give priority to showing the record where firstname contains Winnie.

GET bank/_search {"query": {"bool": {"must": [{"match": {"address": "mill"}}, {"match": {"gender": "M"}}] "must_not": [{"match": {"age": "28"}], "should": [{"match": {"firstname": "Winnie"}}]}

5.2.6 filter filtering

Do not affect the correlation score, query the records that meet the filter conditions.

Used in bool.

GET bank/_search {"query": {"bool": {"filter": [{"range": {"age": {"gte": 18, "lte": 40}]}

5.2.7 term query

Matches the value of an attribute.

Full-text search fields use match, and other non-text fields match with term

Keyword: exact text matching (match all)

Match_phase: text phrase matching

Non-text fields exactly match GET bank/_search {"query": {"term": {"age": "20"}

5.2.8 aggregations aggregation

Aggregation: grouping and extracting data from data. Similar to SQL GROUP BY and SQL aggregate functions.

Elasticsearch can return hit results and multiple aggregate results at the same time.

Aggregation syntax:

"aggregations": {"": {"": {} [, "metadata": {[]}]? [, "aggregations": {[] +}]?} ["aggregate name 2 >": {...}] *}

Example 1: search for the age distribution (top 10 items) and average age, as well as average salary of all people with big in address

GET bank/_search {"query": {"match": {"address": "mill"}, "aggs": {"ageAggr": {"terms": {"field": "age", "size": 10} "ageAvg": {"avg": {"field": "age"}, "balanceAvg": {"avg": {"field": "balance"}

The search results are as follows:

The hits record returned, and so did the three aggregations, with an average age of 34 and an average salary of 25208.0. Pinjun's age distribution: two at 38, one at 28, and one at 32.

Example 1

If you do not want to return the hits result, you can set size:0 at the end

GET bank/_search {"query": {"match": {"address": "mill"}}, "aggs": {"ageAggr": {"terms": {"field": "age", "size": 10}, "size": 0}

Example 2: aggregate by age and query the average salary for those age groups

From the results, we can see that there are 61 31-year-olds, with an average salary of 28312.9, and the aggregate results of other ages are similar.

Example 2

Example 3: group by age, then group the results by sex, and then query the average salary after these groups.

GET bank/_search {"query": {"match_all": {}}, "aggs": {"ageAggr": {"terms": {"field": "age", "size": 10} "aggs": {"genderAggr": {"terms": {"field": "gender.keyword", "size": 10} "aggs": {"balanceAvg": {"avg": {"field": "balance"}}, "size": 0}

From the results, we can see that there are 61 31-year-olds. Among them, 35 were of gender M, with an average salary of 29565.6, and 26 of gender F, with an average salary of 26626.6. The aggregation results of other ages are similar.

Aggregate result

5.2.9 Mapping mapping

Mapping is used to define a document (document) and how the attributes it contains (field) are stored and indexed.

Define which string properties should be treated as full-text properties (full text fields)

Define which properties contain numbers, dates, or geographic locations

Defines whether all attributes in the document can be indexed (_ all configuration)

Format of date

Customize mapping rules to perform dynamically adding attributes

Elasticsearch7 removes the concept of tpye:

Two database representations in a relational database are independent, even if they have columns with the same name, but this is not the case in ES. Elasticsearch is a search engine based on Lucence, and different type and field with the same name in ES end up in the same way in Lucence.

In order to distinguish fields with the same name under different type, Lucence needs to deal with conflicts, resulting in a decline in retrieval efficiency.

ES7.x version: the type parameter in URL is optional.

ES8.x version: type parameter in URL is not supported

For all types, please refer to the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html

Mapping of query indexes

For example, query the mapping of my-index index

GET / my-index/_mapping returns the result: {"my-index": {"mappings": {"properties": {"age": {"type": "integer"}, "email": {"type": "keyword"} "employee-id": {"type": "keyword", "index": false}, "name": {"type": "text"}

Create an index and specify a mapping

For example, to create a my-index index, there are three fields age,email,name, and the specified types are interge, keyword, and text.

PUT / my-index {"mappings": {"properties": {"age": {"type": "integer"}, "email": {"type": "keyword"}, "name": {"type": "text"} return result: {"acknowledged": true, "shards_acknowledged": true "index": "my-index"}

Add a new field mapping

For example, add an employ-id field to the my-index index and specify the type as keyword

PUT / my-index/_mapping {"properties": {"employee-id": {"type": "keyword", "index": false}

Update Mapping

We cannot update existing mapping fields, we must create a new index for data migration.

Data migration

POST _ reindex {"source": {"index": "twitter"}, "dest": {"index": "new_twitter"}}

VI. Chinese word segmentation

ES has many built-in word separators, but it is not friendly to Chinese word segmentation, so we need to use the third-party Chinese word segmentation toolkit.

6.1 the principle of word segmentation in ES

6.1.1 the concept of word splitter in ES

One of ES's word splitters (tokenizer) receives a character stream, splits it into separate tokens, and then outputs the word stream.

ES provides many built-in word splitters that can be used to build custom word splitters (custom ananlyzers)

6.1.2 principle of standard word splitter

For example, the stadard tokenizer standard word splitter encounters spaces for word segmentation. The separator is also responsible for recording the order or position position of each entry (term) (for phrase phrases and nearest neighbor queries of word proximity words). The character offset for each word (used to highlight the content of the search).

6.1.3 examples of English and punctuation word segmentation

An example of a query is as follows:

POST _ analyze {"analyzer": "standard", "text": "Do you know why I want to study ELK? 2 333..."}

Query results:

Do, you, know, why, i, want, to, study, elk, 2,3,33

As can be seen from the query results:

(1) there is no participle in punctuation.

(2) numbers can be segmented.

English sentence segmentation

6.1.4 examples of Chinese word segmentation

However, this kind of word separator is not friendly to Chinese word segmentation and will segment words into separate Chinese characters. For example, the following example will divide the structure of Wukong chat into Wu, Kong, chat, frame, structure, and expected participle as Wukong, chat, and structure.

POST _ analyze {"analyzer": "standard", "text": "Wukong chat Architecture"}

Chinese word Segmentation Wukong chat structure

We can install ik word Segmentation to support Chinese word segmentation more amicably.

6.2 install the ik word splitter

6.2.1 ik Separator address

Ik Separator address:

Https://github.com/medcl/elasticsearch-analysis-ik/releases

Check the ES version first. The version I installed is 7.4.2, so we also choose 7.4.2 when we install the ik participle.

Http://192.168.56.10:9200/ {"name": "8448ec5f3312", "cluster_name": "elasticsearch", "cluster_uuid": "xC72O3nKSjWavYZ-EPt9Gw", "version": {"number": "7.4.2", "build_flavor": "default", "build_type": "docker", "build_hash": "2f90bbf7b93631e52bafb59b3b049cb44ec25e96" "build_date": "2019-10-28T20:40:44.881551Z", "build_snapshot": false, "lucene_version": "8.2.0", "minimum_wire_compatibility_version": "6.8.0", "minimum_index_compatibility_version": "6.0.0-beta1"}, "tagline": "You Know, for Search"}

6.2.2 how to install the ik word splitter

6.2.2.1 method 1: install ik word splitter in the container

Enter the plugins directory inside the es container

Docker exec-it / bin/bash

Get ik Separator compressed package

Wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

Extract the ik package

Unzip compressed package

Delete the downloaded package

Rm-rf * .zip

6.2.2.2 method 2: install ik word splitter in mapping file

Go to the mapping folder

Cd / mydata/elasticsearch/plugins

Download the installation package

Wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip

Extract the ik package

Unzip compressed package

Delete the downloaded package

Rm-rf * .zip

6.2.2.3 method 3: Xftp uploads the compressed package to the mapping directory

First use the XShell tool to connect the virtual machine (you can refer to the previous article [02. Quickly build Linux environment-essential for operation and maintenance] (http://www.jayh.club/#/05. Installation and deployment / 01. ), and then use Xftp to copy the downloaded installation package to the virtual machine.

6.3 decompress the ik word splitter into the container

If the unzip decompression tool is not installed, install the unzip decompression tool.

Apt install unzip

Extract the ik word splitter to the ik folder in the current directory.

Command format: unzip

Example:

Unzip ELK-IKv7.4.2.zip-d. / ik

Decompress the ik word splitter

Modify folder permissions to be readable and writable.

Chmod-R 777 ik/

Delete ik Separator compressed package

Rm ELK-IKv7.4.2.zip

6.4 check the installation of the ik word splitter

Enter into the container

Docker exec-it / bin/bash

Check out the plug-ins for Elasticsearch

Elasticsearch-plugin list

The results are as follows, indicating that the ik word splitter has been installed. Isn't it easy.

Ik

Then exit the Elasticsearch container and restart the Elasticsearch container

Exit docker restart elasticsearch

6.5 use ik Chinese word Separator

There are two modes of ik word splitter

Intelligent word segmentation mode (ik_smart)

Maximum combinatorial word segmentation model (ik_max_word)

Let's take a look at the effect of the intelligent word segmentation model. For example, if you do Chinese word segmentation for a small star, you will get two words: one and the little star.

We enter the following query in Dev Tools Console

POST _ analyze {"analyzer": "ik_smart", "text": "a little star"}

The following results are obtained, and the participle is divided into small stars.

The result of word segmentation with a little star

Let's take a look at the maximum combinatorial word segmentation model. Enter the following query statement.

POST _ analyze {"analyzer": "ik_max_word", "text": "a little star"}

A little star is divided into six words: one, little star, little star, star.

The result of word segmentation with a little star

Let's look at another Chinese participle. For example, search Wukong Brother chat structure, and expect the results: Gokong Brother, chat, structure three words.

The actual result: Wu, empty elder brother, chat, structure four words. The ik participle participle Gokongge and thinks that empty elder brother is a word. So you need to let the ik splitter know that Brother Wukong is a word and does not need to be split. What are we going to do?

Brother Wukong talks about structural participle

6.5 Custom word Segmentation Thesaurus

6.5.1 Scheme for a custom thesaurus

Scheme

Create a new thesaurus file, and then specify the path to the participle thesaurus file in the configuration file of the ik word splitter. You can specify a local path or a remote server file path. Here we use the remote server file scheme because it supports hot updates (the server file is updated and the ik participle thesaurus is reloaded).

Modify the configuration file

The path of the configuration file of the ik word splitter in the container:

/ usr/share/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml .

You can modify this file by modifying the mapping file and file path:

/ mydata/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml

Edit the configuration file:

Vim / mydata/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml

The contents of the configuration file are as follows:

IK Analyzer extension configuration custom/mydict.dic;custom/single_word_low_freq.dic custom/ext_stopword.dic location http://xxx.com/xxx.dic

Modify the property value of the configuration remote_ext_dict to specify the path to a remote Web site file, such as http://www.xxx.com/ikwords.text.

Here we can build our own nginx environment, and then put ikwords.text into the nginx root directory.

6.5.2 build a nginx environment

Solution: first obtain the nginx image, then start a nginx container, then copy the nginx configuration file to the root directory, delete the original nginx container, and then restart the nginx container by mapping folders.

Install the nginx environment through the docker container.

Docker run-p 80:80-- name nginx-d nginx:1.10

Copy the configuration file of the nginx container to the conf folder of the mydata directory

Cd / mydata docker container cp nginx:/etc/nginx. / conf

Create a nginx directory in the mydata directory

Mkdir nginx

Move the conf folder to the nginx mapping folder

Mv conf nginx/

Terminate and delete the original nginx container

Docker stop nginx docker rm

Start a new container

Docker run-p 80:80-- name nginx\-v / mydata/nginx/html:/usr/share/nginx/html\-v / mydata/nginx/logs:/var/log/nginx\-v / mydata/nginx/conf:/etc/nginx\-d nginx:1.10

Access the nginx service

192.168.56.10

Report 403 Forbidden, and nginx/1.10.3 indicates that the nginx service starts normally. The reason for the exception is that there is no file under the nginx service.

Create a new html file in the nginx directory

Cd / mydata/nginx/html vim index.html hello passjava

Access the nginx service again

The browser prints the hello passjava. Indicates that there is no problem accessing the page of the nginx service.

Create ik word segmentation thesaurus file

Cd / mydata/nginx/html mkdir ik cd ik vim ik.txt

Fill in Brother Gokong and save the file.

Access thesaurus files

Http://192.168.56.10/ik/ik.txt

The browser will output a string of garbled codes, which can be ignored first. Indicates that the thesaurus file can be accessed.

Modify ik word splitter configuration

Cd / mydata/elasticsearch/plugins/ik/config vim IKAnalyzer.cfg.xml

Modify ik word splitter configuration

Restart the elasticsearch container and set the elasticsearch container to start each time the machine is rebooted.

Docker restart elasticsearch docker update elasticsearch-restart=always

Query the result of word segmentation again

We can see that the structure of Wukong chat is divided into three words: Gokong, chat and architecture, which shows that Wukong in the custom thesaurus plays a role.

The result of word segmentation after a custom thesaurus

At this point, I believe you have a deeper understanding of "what are the knowledge points of Elasticsearch?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report