Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Implementation of docking between Elasticsearch and Python

2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "the implementation of the docking of Elasticsearch and Python". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "the docking implementation of Elasticsearch and Python".

What is Elasticsearch?

If you want to check the data, you can not avoid search, search is inseparable from the search engine, Baidu, Google are a very large and complex search engine, they index almost all the open pages and data on the Internet. However, for our own business data, it is certainly not necessary to use such a complex technology. If we want to implement our own search engine to facilitate storage and retrieval, Elasticsearch is the best choice. It is a full-text search engine that can quickly store, search and analyze huge amounts of data.

Why use Elasticsearch?

Elasticsearch is an open source search engine based on Apache Lucene ™, a full-text search engine library.

And what is Lucene? Lucene may be a library that currently exists, whether open source or proprietary, with the most advanced, high-performance and full-featured search engine capabilities, but it is only a library. To use Lucene, we need to write Java and reference the Lucene package, and we need some understanding of information retrieval to understand how Lucene works, but it's not that easy to use.

So in order to solve this problem, Elasticsearch was born. Elasticsearch is also written in Java, its internal use of Lucene for indexing and search, but its goal is to make full-text retrieval simple, equivalent to a layer of Lucene encapsulation, it provides a set of simple and consistent RESTful API to help us achieve storage and retrieval.

So Elasticsearch is just a simple version of Lucene package? That's a big mistake. Elasticsearch is not just Lucene, and it's not just a full-text search engine. It can be accurately described as follows:

A distributed real-time document storage where each field can be indexed and searched

A distributed real-time analysis search engine

Capable of expanding hundreds of service nodes and supporting structured or unstructured data at PB level

In short, it is a very powerful search engine, Wikipedia, Stack Overflow, GitHub have used it to do search.

Installation of Elasticsearch

We can download Elasticsearch: https://www.elastic.co/downloads/elasticsearch from Elasticsearch's official website, which also contains installation instructions.

First download and extract the installation package, then run bin/elasticsearch (Mac or Linux) or bin\ elasticsearch.bat (Windows) to start Elasticsearch.

I am using Homebrew installation which is recommended by individuals under Mac,Mac:

Brew install elasticsearch

Elasticsearch runs on port 9200 by default. We open a browser to access it.

Http://localhost:9200/ can see something like this:

{

"name": "atntrTf"

"cluster_name": "elasticsearch"

"cluster_uuid": "e64hkjGtTp6_G2h2Xxdv5g"

"version": {

"number": "6.2.4"

"build_hash": "ccec39f"

"build_date": "2018-04-12T20:37:28.497551Z"

"build_snapshot": false

"lucene_version": "7.2.1"

"minimum_wire_compatibility_version": "5.6.0"

"minimum_index_compatibility_version": "5.0.0"

}

"tagline": "You Know, for Search"

}

If you see this content, it means that Elasticsearch installed and started successfully, here shows that my Elasticsearch version is 6.2.4, the version is very important, later to install some plug-ins must achieve the corresponding version.

Next, let's take a look at the basic concepts of Elasticsearch and its docking with Python.

Related concepts of Elasticsearch

There are several basic concepts in Elasticsearch, such as nodes, indexes, documents, and so on. Understanding these concepts is very helpful to familiarize yourself with Elasticsearch.

Node and Cluster

Elasticsearch is essentially a distributed database that allows multiple servers to work together, and each server can run multiple Elasticsearch instances.

A single Elasticsearch instance is called a Node. A group of nodes forms a Cluster.

Index

Elasticsearch indexes all fields and writes a reverse index (Inverted Index) after processing. When looking for data, look for the index directly.

Therefore, the top-level unit of Elasticsearch data management is called Index (index), which is actually equivalent to the concept of database in MySQL, MongoDB and so on. It is also worth noting that the name of each Index (that is, the database) must be lowercase.

Document

A single record in Index is called Document (document). Many Document make up an Index.

Document is represented in JSON format, and here is an example.

The Document in the same Index does not require the same structure (scheme), but it is better to keep it the same, which helps to improve the search efficiency.

Type

Document can be grouped, for example, in the Index of weather, it can be grouped by city (Beijing and Shanghai) or by climate (sunny and rainy). This kind of grouping is called Type, which is a virtual logical grouping used to filter Document, similar to the data table in MySQL and the Collection in MongoDB.

Different Type should have similar structures (Schema). For example, the id field cannot be a string in this group and a numeric value in another group. This is a difference from tables in a relational database. Data that is completely different in nature (such as products and logs) should be stored as two Index, rather than two Type in the same Index (although it can be done).

According to the plan, allowing only one Type,7.x version per Index for Elastic version 6.x will completely remove Type.

Fields

That is, fields, each Document is similar to a JSON structure, it contains many fields, each field has its corresponding value, multiple fields constitute a Document, in fact, it can be compared to the fields in the MySQL data table.

In Elasticsearch, documents belong to a type (Type), and these types exist in the Index (index), so we can draw some simple comparison diagrams to compare traditional relational databases:

Relational DB-> Databases-> Tables-> Rows-> Columns

Elasticsearch-> Indices-> Types-> Documents-> Fields

These are some of the basic concepts in Elasticsearch, which are more helpful to understand by comparing with relational databases.

Python docking Elasticsearch

Elasticsearch actually provides a series of Restful API for access and query operations, we can use commands such as curl to operate, but after all, the command line mode is not so convenient, so here we will directly introduce the relevant methods of using Python to interface with Elasticsearch.

Interfacing Elasticsearch in Python uses a library with the same name, and the installation method is very simple:

Pip3 install elasticsearch

The official document is: https://elasticsearch-py.readthedocs.io/, all usages can be found in it, and the content at the end of the article is also based on the official document.

Create Index

Let's first look at how to create an index (Index). Here we create an index called news:

From elasticsearch import Elasticsearch

Es = Elasticsearch ()

Result = es.indices.create (index='news', ignore=400)

Print (result)

If the creation is successful, the following results are returned:

{'acknowledged': True,' shards_acknowledged': True, 'index':' news'}

The returned result is in JSON format, where the acknowledged field indicates that the creation operation was executed successfully.

But at this point, if we execute the code again, we will return the following result:

{'error': {' root_cause': [{'type':' resource_already_exists_exception', 'reason':' index [news/QM6yz2W8QE-bflKhc5oThw] already exists', 'index_uuid':' QM6yz2W8QE-bflKhc5oThw', 'index':' news'}], 'type':' resource_already_exists_exception', 'reason':' index [news/QM6yz2W8QE-bflKhc5oThw] already exists', 'index_uuid':' QM6yz2W8QE-bflKhc5oThw' 'index':' news'}, 'status': 400}

It indicates that the creation failed, and the status status code is 400. the reason for the error is that Index already exists.

Note that our code uses an ignore parameter of 400, which means that if the return result is 400, the error will not be ignored and the program will not throw an exception.

If we don't add the parameter ignore:

Es = Elasticsearch ()

Result = es.indices.create (index='news')

Print (result)

If you execute it again, you will get an error:

Raise HTTP_EXCEPTIONS.get (status_code, TransportError) (status_code, error_message, additional_info)

Elasticsearch.exceptions.RequestError: TransportError (400, 'resource_already_exists_exception',' index [news/QM6yz2W8QE-bflKhc5oThw] already exists')

In this way, there will be problems in the execution of the program, so we need to make good use of the ignore parameter to rule out some unexpected situations, so as to ensure the normal execution of the program without interruption.

Delete Index

Deleting an Index is similar, with the following code:

From elasticsearch import Elasticsearch

Es = Elasticsearch ()

Result = es.indices.delete (index='news', ignore= [400,404])

Print (result)

The ignore parameter is also used here to ignore the problem that the Index does not exist and the deletion failure causes the program to be interrupted.

If the deletion is successful, the following result is output:

{'acknowledged': True}

If the Index has been deleted, performing the deletion will output the following result:

{'error': {' root_cause': [{'type':' index_not_found_exception', 'reason':' no such index', 'resource.type':' index_or_alias', 'resource.id':' news', 'index_uuid':' _ na_', 'index':' news'}], 'type':' index_not_found_exception', 'reason':' no such index' Resource.type': 'index_or_alias',' resource.id': 'news',' index_uuid':'_ na_', 'index':' news'}, 'status': 404}

This result shows that the current Index does not exist and the deletion failed, and the returned result is also JSON with a status code of 400. however, because we add the ignore parameter and ignore the 400status code, the program normally outputs the JSON result instead of throwing an exception.

Insert data

Elasticsearch, like MongoDB, can insert structured dictionary data directly when inserting data, and the create () method can be called when inserting data. For example, here we insert a piece of news data:

From elasticsearch import Elasticsearch

Es = Elasticsearch ()

Es.indices.create (index='news', ignore=400)

Data = {'title':' is it a mess left by the United States to Iraq', 'url':' http://view.news.qq.com/zt2011/usa_iraq/index.htm'}

Result = es.create (index='news', doc_type='politics', id=1, body=data)

Print (result)

Here we first declare a piece of news data, including the title and link, and then insert this data by calling the create () method. When we call the create () method, we pass in four parameters, the index parameter represents the index name, doc_type represents the document type, body represents the specific content of the document, and id is the unique identification ID of the data.

The running results are as follows:

{'_ index': 'news',' _ type': 'politics',' _ id':'1,'_ version': 1, 'result':' created','_ shards': {'total': 2,' successful': 1, 'failed':},' _ seq_no':,'_ primary_term': 1}

The result field in the result is created, which means that the data has been inserted successfully.

In addition, we can also use the index () method to insert data, but unlike create (), the create () method requires us to specify an id field to uniquely identify the data, while the index () method does not. If we do not specify id, an id is automatically generated. The index () method is written as follows:

Es.index (index='news', doc_type='politics', body=data)

The index () method is actually called inside the create () method, which encapsulates the index () method.

Update data

Updating the data is also very simple. We also need to specify the id and content of the data by calling the update () method as follows:

From elasticsearch import Elasticsearch

Es = Elasticsearch ()

Data = {

"title':, is it a mess that the United States has left to Iraq?"

'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',

'date': '2011-12-16'

}

Result = es.update (index='news', doc_type='politics', body=data, id=1)

Print (result)

Here we add a date field to the data, and then call the update () method, and the result is as follows:

{'_ index': 'news',' _ type': 'politics',' _ id':'1,'_ version': 2, 'result':' updated','_ shards': {'total': 2,' successful': 1, 'failed':},' _ seq_no': 1,'_ primary_term': 1}

You can see that in the returned result, the result field is updated, which means that the update is successful. In addition, we also notice that there is a field _ version, which represents the updated version number, and 2 represents the second version. Because the data has been inserted once before, the data inserted for the first time is version 1, which can be seen in the running result of the example above. After this update, the version number becomes 2, and every subsequent update The version number will be added by 1.

In addition, the update operation can also be done using the index () method, which is written as follows:

Es.index (index='news', doc_type='politics', body=data, id=1)

As you can see, the index () method can perform two operations instead of us. If the data does not exist, then perform the insert operation, and if it already exists, then perform the update operation, which is very convenient.

Delete data

If you want to delete a piece of data, you can call the delete () method and specify the id of the data to be deleted, as follows:

From elasticsearch import Elasticsearch

Es = Elasticsearch ()

Result = es.delete (index='news', doc_type='politics', id=1)

Print (result)

The running results are as follows:

{'_ index': 'news',' _ type': 'politics',' _ id':'1,'_ version': 3, 'result':' deleted','_ shards': {'total': 2,' successful': 1, 'failed':},' _ seq_no': 2,'_ primary_term': 1}

You can see that the result field in the run result is deleted, which means that the deletion is successful, and the _ version becomes 3 and increases by 1.

Query data

The above operations are very simple operations, ordinary databases such as MongoDB can be completed, it does not seem to be a big deal, Elasticsearch is even more special lies in its extremely powerful retrieval function.

For Chinese, we need to install a word segmentation plug-in. Here we use the elasticsearch-analysis-ik,GitHub link: https://github.com/medcl/elasticsearch-analysis-ik. Here we use elasticsearch-plugin, another command line tool of Elasticsearch, to install it. The version installed here is 6.2.4. Make sure it corresponds to the version of Elasticsearch. The command is as follows:

Elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip

Please replace the version number here with your Elasticsearch version number.

Just restart Elasticsearch after installation, and it will automatically load the installed plug-ins.

First, we create a new index and specify the fields for which word segmentation is required. The code is as follows:

From elasticsearch import Elasticsearch

Es = Elasticsearch ()

Mapping = {

'properties': {

'title': {

'type': 'text'

'analyzer': 'ik_max_word'

'search_analyzer': 'ik_max_word'

}

}

}

Es.indices.delete (index='news', ignore= [400,404])

Es.indices.create (index='news', ignore=400)

Result = es.indices.put_mapping (index='news', doc_type='politics', body=mapping)

Print (result)

Here we first delete the previous index, and then create a new index, and then update its mapping information. The mapping information specifies the field of word segmentation, and specifies that the field type type is text, word splitter analyzer and search word segmentation search_analyzer are ik_max_word, even with the Chinese word segmentation plug-in we just installed. If not specified, the default English word splitter is used.

Next, we insert a few new pieces of data:

Datas = [

{

"title':, is it a mess that the United States has left to Iraq?"

'url': 'http://view.news.qq.com/zt2011/usa_iraq/index.htm',

'date': '2011-12-16'

}

{

'title':' Ministry of Public Security: school buses will enjoy the highest right of way.

'url': 'http://www.chinanews.com/gn/2011/12-16/3536077.shtml',

'date': '2011-12-16'

}

{

'title':' investigation on the conflict between Chinese and South Korean fishing police: South Korean police detain an average of one Chinese fishing boat a day.

'url': 'https://news.qq.com/a/20111216/001044.htm',

'date': '2011-12-17'

}

{

Title': the suspect who was shot by an Asian man at the Chinese consulate in Los Angeles has turned himself in.

'url': 'http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml',

'date': '2011-12-18'

}

]

For data in datas:

Es.index (index='news', doc_type='politics', body=data)

Here we specify four pieces of data, all with title, url, and date fields, and then insert them into the Elasticsearch through the index () method, with the index name news and type politics.

Next, let's query the relevant content according to the keywords:

Result = es.search (index='news', doc_type='politics')

Print (result)

You can see that the query produced all four inserted pieces of data:

{

"took":

"timed_out": false

"_ shards": {

"total": 5

"successful": 5

"skipped":

"failed":

}

"hits": {

"total": 4

"max_score": 1.0

"hits": [

{

"_ index": "news"

"_ type": "politics"

"_ id": "c05G9mQBD9BuE5fdHOUT"

"_ score": 1.0

"_ source": {

"title": "is it a mess that the United States left to Iraq?"

"url": "http://view.news.qq.com/zt2011/usa_iraq/index.htm","

"date": "2011-12-16"

}

}

{

"_ index": "news"

"_ type": "politics"

"_ id": "dk5G9mQBD9BuE5fdHOUm"

"_ score": 1.0

"_ source": {

"title": "the Chinese consulate in Los Angeles has been shot by an Asian man, and the suspect has turned himself in."

"url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml","

"date": "2011-12-18"

}

}

{

"_ index": "news"

"_ type": "politics"

"_ id": "dU5G9mQBD9BuE5fdHOUj"

"_ score": 1.0

"_ source": {

"title": "investigation on the conflict between Chinese and Korean fishing police: South Korean police detain an average of one Chinese fishing boat a day."

"url": "https://news.qq.com/a/20111216/001044.htm","

"date": "2011-12-17"

}

}

{

"_ index": "news"

"_ type": "politics"

"_ id": "dE5G9mQBD9BuE5fdHOUf"

"_ score": 1.0

"_ source": {

"title": "Ministry of Public Security: local school buses will enjoy the highest right of way"

"url": "http://www.chinanews.com/gn/2011/12-16/3536077.shtml","

"date": "2011-12-16"

}

}

]

}

}

You can see that the returned result appears in the hits field, where the total field indicates the number of result entries of the query, and max_score represents the maximum matching score.

In addition, we can also conduct full-text search, which is where the characteristics of Elasticsearch search engine are reflected:

Dsl = {

'query': {

'match': {

'Chinese Consulate title':'

}

}

}

Es = Elasticsearch ()

Result = es.search (index='news', doc_type='politics', body=dsl)

Print (json.dumps (result, indent=2, ensure_ascii=False))

Here, we use the DSL statement supported by Elasticsearch to query, and use match to specify full-text search. The field to be searched is title, and the content is "Chinese Consulate". The search results are as follows:

{

"took": 1

"timed_out": false

"_ shards": {

"total": 5

"successful": 5

"skipped":

"failed":

}

"hits": {

"total": 2

"max_score": 2.546152

"hits": [

{

"_ index": "news"

"_ type": "politics"

"_ id": "dk5G9mQBD9BuE5fdHOUm"

"_ score": 2.546152

"_ source": {

"title": "the Chinese consulate in Los Angeles has been shot by an Asian man, and the suspect has turned himself in."

"url": "http://news.ifeng.com/world/detail_2011_12/16/11372558_0.shtml","

"date": "2011-12-18"

}

}

{

"_ index": "news"

"_ type": "politics"

"_ id": "dU5G9mQBD9BuE5fdHOUj"

"_ score": 0.2876821

"_ source": {

"title": "investigation on the conflict between Chinese and Korean fishing police: South Korean police detain an average of one Chinese fishing boat a day."

"url": "https://news.qq.com/a/20111216/001044.htm","

"date": "2011-12-17"

}

}

]

}

}

Here we see that there are two matching results, the score of the first is 2.54 and the score of the second is 0.28, this is because the first matching data contains the words "China" and "consulate". The second matching data does not include the word "consulate", but contains the word "China", so it is also retrieved, but the score is relatively low.

Therefore, it can be seen that the full-text search of the corresponding fields will be carried out during retrieval, and the results will be sorted according to the relevance of the retrieval keywords, which is a basic search engine prototype.

Thank you for your reading, the above is the content of "the docking implementation of Elasticsearch and Python". After the study of this article, I believe you have a deeper understanding of the docking realization of Elasticsearch and Python, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Network Security

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report