What are the methods of manipulating ElassticSearch documents 07/04 Update SLTechnology News&Howtos

What are the methods of manipulating ElassticSearch documents

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the methods of operating ElassticSearch documents". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "what are the methods of ElassticSearch document operation"!

# Chapter 1

# interact with Elasticsearch

Node client (node client)

The node client joins the cluster as numerous data nodes (none data node). In other words, it does not store any data on its own, but it knows the specific location of the data in the cluster and can forward requests directly to the corresponding node.

Transport client (Transport client)

This lighter transport client can send requests to the remote cluster. It does not join the cluster itself, but simply forwards the request to the nodes in the cluster.

RESTful API based on HTTP protocol and JSON as data exchange format

All other programming languages can use RESTful API to communicate with Elasticsearch through port 9200

# compare

Relational DB-> Databases-> Tables-> Rows-> Columns

Elasticsearch-> Indices-> Types-> Documents-> Fields

# Chapter 2 # #

# concept: #

Index (index)-A place where associated data is stored. In fact, an index is just a "logical logical namespace" used to point to one or more shards

A shard is a minimum-level "worker unit" that holds only a portion of all the data in the index (sharding is an instance of Lucene, and it itself is a complete search engine. Our documents are stored and indexed in shards, but our application does not communicate with them directly, instead, directly with the index)

The maximum capacity of the shard depends entirely on your usage: the size of the hardware storage, the size and complexity of the document, how to index and query your document, and your expected response time (the project needs to be determined by stress testing. the number of primary slices has been determined when the index is created, which defines the maximum amount of data that can be stored in the index. However, either the main shard or the replicated shard can handle read requests-search or document retrieval-so the more redundant the data, the greater the search throughput we can handle)

If the rebooted machine has copies of the old shards, it will try to reuse them, replicating only those parts of the main shard that have data changes during the failure.

# Chapter 3 # #

No matter how the program is written, the intention is the same: organize data to serve our goals. But data is not just made up of random bits and bytes, we establish associations between data nodes to represent real-world entities or "something". Names and Email addresses that belong to the same person will have more meaning.

# what is a document?

A JSON object of a key-value pair. The key (key) is the name of the field (field) or property. The value (value) can be a string, a number, a Boolean type, another object, an array of values, or other special types, such as a string that represents a date or an object that represents a geographic location.

In Elasticsearch, the term document has a special meaning. It specifically refers to JSON data serialized from the topmost structure or root object (root object) (identified by a unique ID and stored in Elasticsearch)

A document contains more than just data. It also contains metadata (metadata): _ index,_type,_id

Retrieve a portion of the document (including the original data):

GET / website/blog/123?_source=title,text

If you just want to get the _ source field without other metadata, you can request: GET / website/blog/123/_source

Check to see if the document exists:

Curl-I-XHEAD http://localhost:9200/website/blog/123

Return 200 OK status if your document exists, if it doesn't exist, it returns 404 Not Found. Of course, this only means that your document doesn't exist at the moment of the query, but it doesn't mean it still doesn't exist a few milliseconds later. Another process may create a new document during this period)

With custom _ id, we must tell Elasticsearch that the request should not be accepted until _ index, _ type, and _ id are different.

PUT / website/blog/123?op_type=create PUT / website/blog/123/_create

Normal metadata is returned and the response status code is 201 Created. On the other hand, if a document containing the same _ index, _ type, and _ id already exists, Elasticsearch will return a 409 Conflict response status code (the error is due to the parameter create, if there is no create parameter, then the document will be updated only that the created is false in the returned result, and internally, Elasticsearch will mark the old document as deleted and add a complete new document. The old version of the document won't disappear immediately, but you can't access it either. Elasticsearch will clean up deleted documents as you continue to index more data)

Delete document

DELETE / website/blog/123

If the document is found, Elasticsearch returns a 200 OK status code and the following response body. Notice that the number of _ version has been increased. If the document is not found, we will get a 404 Not Found status code. Although the document does not exist-the value of "found" is false--_version still increased. This is part of the internal record, which ensures that different operations between multiple nodes can be in the correct order.

Version control

Pessimistic concurrency control (Pessimistic concurrency control) is widely used in relational databases, assuming that conflicting changes occur frequently, and we block access to resolve conflicts. A typical example is to lock a row of data before reading it, and then make sure that only the locked thread can modify the row. Optimistic concurrency control (Optimistic concurrency control)

Used by Elasticsearch, assuming that conflicts do not occur frequently and do not block access, however, if the data changes during read and write, the update operation will fail. At this point, it is up to the program to decide how to resolve the conflict after failure. In practice, you can try to update again, refresh the data (reread), or feed it back directly to the user.

We take advantage of _ version to ensure that data is not lost due to modification conflicts

Eg:

Let's create a new blog post: let's create a new blog post

PUT / website/blog/1/_create {"title": "My first blog entry", "text": "Just trying this out..."}

The responder tells us that this is a newly created document and that its _ version is 1. Now suppose we want to edit this document: load the data into the web form, modify it, and save it to a new version.

First, we retrieve the document:

GET / website/blog/1

Now, when we save the changes by re-indexing the document, we specify the version parameter as follows:

PUT / website/blog/1?version=1 {"title": "My first blog entry", "text": "Starting to get the hang of this..."}

We only want the update to take effect if the _ version of the document is 1

Document local update

POST / website/blog/1/_update {"doc": {"tags": ["testing"], "views": 0}}

Retrieve multiple documents

Mget mode

More time-saving batch operation

POST / _ bulk {"delete": {"_ index": "website", "_ type": "blog", "_ id": "123"} {" create ": {" _ index ":" website "," _ type ":" blog "," _ id ":" 123 "} {" title ":" My first blog post "} {" index ": {" _ index ":" website " "_ type": "blog"} {"title": "My second blog post"} {"update": {"_ index": "website", "_ type": "blog", "_ id": "123", "_ retry_on_conflict": 3}} {"doc": {"title": "My updated blog post"}}

How old is too big?

The entire batch request needs to be loaded into the memory of the node that accepts our request, so the larger the request, the less memory is available for other requests. There is an optimal bulk request size. Beyond this size, performance will no longer improve and may degrade.

The best size, of course, is not a fixed number. It all depends on your hardware, the size and complexity of your document, and the load of indexing and search. Fortunately, this sweetspot is easy to find:

Try indexing standard documents in batches. As the size increases, when performance starts to degrade, it means that the size of each batch is too large. The initial number can be between 1000 and 5000 documents, and if your document is very large, you can use smaller batches.

It is usually useful to focus on the physical size of your requested batch. A thousand 1kB documents is very different from a thousand 1MB documents. A good batch is best kept in the 5-15MB size room.

# Chapter IV # #

Routin

Shard = hash (routing)% number_of_primary_shards

The routing value is an arbitrary string, which defaults to _ id but can also be customized. This routing string generates a number through the hash function, and then divides it by the number of primary slices to get a remainder (remainder), which always ranges from 0 to number_of_primary_shards-1, which is the number of slices in which a particular document is located (which also explains why the number of primary slices can only be defined when the index is created and cannot be modified: if the number of primary slices changes in the future. All previous route values will be invalidated and the document will never be found)

All document API (get, index, delete, bulk, update, mget) receives a routing parameter, which is used to customize the document-to-shard mapping. Custom routing values ensure that all relevant documents-such as those belonging to the same person-are saved on the same slice

Create, index, and delete documents

Request node

Here is a list of the sequential steps necessary to successfully create, index, or delete a document on the main and replication shards:

The client sends a new, indexed, or delete request to Node.

The node uses the _ id of the document to determine that the document belongs to fragment 0. It forwards the request to Node 3, where shard 0 is located. Node 3 executes the request on the main shard and, if successful, forwards the request to the corresponding replication node at Node 1 and Node. When all replication nodes report success, Node reports success to the requested node, and the requested node reports it to the client.

When the client receives a successful response, the modification of the document has been applied to the main shard and all replicated shards. Your changes are in effect.

The default value for replication is sync

The allowed values for consistency are one (only one primary shard), all (all primary and replicated shards), or the default quorum or more than half-shard int ((primary + number_of_replicas) / 2) + 1.

What happens when the copy of the fragment is insufficient? Elasticsearch will wait for more fragments to appear. By default, wait for one minute. Parameter timeout

Let's list the sequence steps necessary to retrieve a document on the main or replication shard:

The client sends an get request to Node 1.

The node uses the _ id of the document to determine that the document belongs to fragment 0. The replication shards corresponding to shard 0 are available on all three nodes. At this point, it forwards the request to Node 2 (the shard address where the document is located is calculated according to the routing rule).

Node 2 returns endangered to Node 1 and then returns it to the client. For read requests, in order to balance the load, the requesting node selects a different shard for each request-it circulates all shard copies, and it is possible that an indexed document already exists on the main shard but has not yet been synchronized to the replication shard. At this point, the copy shard will report that the document was not found, and the master shard will successfully return the document. Once the index request is successfully returned to the user, the document is available in both the main part and the copy part.

Let's list the sequential steps necessary to perform a local update:

The client sends an update request to Node 1.

It forwards the request to Node 3, the node where the main shard is located.

Node 3 retrieves the document from the main slice, modifies the JSON of the _ source field, and then rebuilds the index on the main slice. If another process modifies the document, it repeats step 3 the number of times set by retry_on_conflict, and then abandons it if it fails.

If Node 3 successfully updates the document, it also forwards the new version of the document to the replication nodes on Node 1 and Node 2 to rebuild the index. When all replication nodes report success, Node 3 returns success to the requesting node and then to the client. Update API also accepts the routing, replication, consistency, and timout parameters mentioned in the "New, Index, and delete" section. The multi-document mode mget and bulk API are similar to individual documents. The difference is that the request node knows the shard in which each document is located. It splits multiple document requests into document requests for each fragment, and then forwards each participating node.

Once the response from each node is received, the responses are organized and combined into a separate response, which is finally returned to the client.

Let's list the sequential steps for retrieving multiple documents through a single mget request:

The client sends an mget request to Node 1.

Node 1 builds a plurality of data retrieval requests for each shard, and then forwards them to the main shard or replication shard required by these requests. When all replies are received, Node 1 builds the response and returns it to the client.

The routing parameter can be set by each document in the docs.

Below we will list the sequential steps for performing multiple create, index, delete, and update requests using a single bulk:

The client sends an bulk request to Node 1.

Node 1 builds batch requests for each shard and forwards them to the main shard needed for these requests.

The main fragments perform operations one after another in order. When an operation is completed, the master shard forwards the new document (or delete part) to the corresponding replication node, and then performs the next operation. The replication node reports that all operations are completed, the node reports to the requesting node, and the request node collates the response and returns it to the client.

Bulk API can also use the replication and consistency parameters at the top level, and the routing parameter is used in the metadata of each request.

# Chapter V # #

A search can be: search (search) can:

Use structured queries on fields like gender or age, and sorting on fields like join_date, just like structured queries on SQL.

Full-text search, you can use all fields to match keywords, and then an sorts the results by relevance.

Or combine the above two.

Concept interpretation Mapping (Mapping) interpretation of data in each field explanation Analysis (Analysis) full text can be searched Domain specific language query (Query DSL) A flexible and powerful query language used by Elasticsearch

# Chapter 6 # #

The mapping mechanism is used for field type validation, matching each field to a determined data type (string, number, booleans, date, etc.).

The analysis mechanism is used for word segmentation of full-text text (Full Text) to establish a reverse index for search.

At this point, I believe that we have a deeper understanding of the "ElassticSearch document operation methods", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.