In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
I. description of the problem
Key-value query is a very common query scenario. After an index is built on a data table, even if there are a large number of data records in the table (hundreds of millions or even billions of rows), it will be very fast to query a single record with key-value, because the complexity of indexing is only logN (based on 2), and 1 billion rows of data only need to be compared 30 times (1 billion equals 2 ^ 30), and it only takes tens of milliseconds on modern computers.
However, if you need to query a lot of keys, such as thousands or even tens of thousands of keys, if you search each time independently, the reading and comparison will accumulate to tens of thousands or even hundreds of thousands of times, and the time delay will rise to the level of tens of minutes or even hours. at this time, the simple use of database indexes must be intolerable for the user experience.
Second, an example of a scene
The aggregator group table function we will introduce below, based on high-performance indexing and batch key lookup, can effectively deal with this scenario. We will discuss in depth step by step in the following order:
1) single field key
2) Multi-field key
3) Multithreaded query
4) processing of data addition
What needs to be explained, this article only discusses the situation of the stand-alone machine, and the subsequent article will continue to discuss the cluster-based scheme in depth.
2.1 single field key
Take the typical data structure in the following table as an example:
Field name
Types
Whether or not the primary key
Description
Id
Long
Yes
Self-increasing from 1
Data
String
The data to be obtained
2.1.1 create a group table
First, let's create a group table and import the source data into the group table:
A
one
= file ("single.ctx")
two
= A1.create (# id,data)
three
= file ("single_source.txt")
four
= A3.cursor@t ()
five
= A2.append (A4)
A1: create and open the specified file object, where single.ctx is the group table file to be created, with the extension ctx. For a detailed definition and usage of group tables, you can refer to the aggregator tutorial.
A2: the data structure of creating a group table is (id,data). Where the field beginning with # is called the dimension field, and the group table will assume that the data is ordered by that field, that is, the group table A2 will order the key field id. Group tables use column storage by default.
A3: assume that the source data is stored as text, and A3 opens the data file. The data header of the txt file and the first few lines of data are shown in the following figure. Of course, the source data can also come from other types of data sources such as databases.
A4: generates cursors for source data. Where the @ t option indicates that the first line of record in the file is the field name.
A5: appends the records in cursor A4 to the group table.
The above script assumes that the primary key id is already ordered in the original data, and if the actual primary key is not ordered, you need to sort the primary key and then build it into a group table. At this point, you can use the cs.sortx () function to sort. For more information, please see the function reference.
In the designer of the aggregator, you can see the first ten pieces of data visually through three lines of code, and the code and screenshots are as follows:
A
one
= file ("single.ctx")
two
= A1.create ()
three
= A2.cursor () .fetch (10)
A1: opens the group table file object with the specified file name.
A2:f.create (). If there is no parameter in the function, the group table will be opened directly.
A3: use cursors to access the first ten pieces of data in A2, as shown in the following figure.
2.1.2 create an index
Next, we index the group table files to improve retrieval performance:
A
one
= file ("single.ctx")
two
= A1.create ()
three
= A2.index (id_idx;id;data)
A1: opens the group table file object with the specified file name.
A2: use the create function with no parameters to open the group table.
A3: build an index. In the function T.index (), the parameter id_idx is the index name, id is the dimension, and data is the data column. In general, indexing does not need to use data columns, when using the index search will go to the original data table to find the data. However, this example also includes the data column in the index when building the index, so that the data column is no longer referenced when looking up, and although it takes up more space, the lookup is faster.
When indexing by dimensional fields, the group tables are no longer sorted taking advantage of the ordered characteristics of the group tables. If # is not used at the beginning of the group table, it will be reordered when it is indexed.
2.1.3 query
Use the main and subroutine calls to complete the query:
Query subroutine script search.dfx:
A
one
= file ("single.ctx")
two
= A1.create ()
three
= keys
four
= A2.icursor (; A3.contain (id), id_idx)
five
> file ("result.txt") .export @ t (A4)
A3:keys is a parameter that is passed by the following main program when called.
A4: in the icursor () function of the group table, the index id_idx is used to filter the group table with conditional A3.contain (id). The aggregator automatically recognizes that the A3.contain (id) condition can be indexed, and automatically sorts the contents of A3 and looks them up front and back.
A5: export the query results from A4 to result.txt. Here the @ t option specifies that the field name will be output on export.
Main program script:
A
one
= file ("keys.txt") .import@i ()
two
= call ("search.dfx", A1)
A1: get the sequence of query key values from keys.txt. Since there is only one column of results, you can use the @ I option to return the results as a sequence:
This sequence is the set of random key values that need to be queried. In the example, keys.txt is used to pre-store random key values, but in practical applications, other data sources can also be used to store them.
A2: call the subroutine serach.dfx and pass the set of key values obtained by A1 to the subroutine as parameters.
Here are some of the contents of the result file result.txt:
In addition, we can embed the aggregator into Java applications to provide flexible and simple data query capabilities for Java applications. When embedded, you can access the aggregator script as if you were using JDBC to access the database. You can refer to the tutorial "called by JAVA" section on how to write it.
The single-field key query example in this example is relatively simple in data structure. The key value field of the query is id, and the data to be obtained is a single-column data. If there are other columns, for example:
Field name
Types
Whether or not the primary key
Description
Id
Long
Yes
Self-increasing from 1
Data1
String
Data 1 to be obtained
Data2
Int
Data to be obtained 2
……
……
……
DataN
……
Data N to be obtained
Then multiple data column fields should be included in the indexing step, and the data column parameters are written as follows:
A
one
= file ("single.ctx")
two
= A1.create ()
three
= A2.index (id_idx;id;data1,data2, … , dataN)
In the case of multi-field keys to be discussed next, multiple index fields need to be created when indexing, and the corresponding parameters are similarly written: index (id_idx;id1,id2,... IdN;data1,data2,... , dataN.
2.2 Multi-field key
A multi-field key refers to the case of a federated primary key, such as:
Field name
Types
Whether or not the primary key
Description
Type
String
Enumerable
Id
Long
The id for each enumeration type is incremented from 1
Data
String
The data to be obtained
The type and id fields are used as joint primary keys to determine a record.
2.2.1 method 1 (General method) 2.2.1.1 create a group table
A
one
= file ("multi.ctx")
two
= A1.create (# type,#id,data)
three
= file ("multi_source.txt")
four
= A3.cursor@t ()
five
= A2.append (A4)
In this example, A2 needs to specify two dimensions, type and id, and the rest of the code is the same as the single field key.
2.2.1.2 create an index
A
one
= file ("multi.ctx")
two
= A1.create ()
three
= A2.index (type_id_idx;type,id;data)
Because there are two dimensions in this example, you need to include two dimensions, type and id, when you build the index, as shown in A3.
2.2.1.3 query
A
one
= file ("multi.ctx")
two
= A1.create ()
three
= [["type_a", 55], ["type_b", 66]]
four
= A2.icursor (; A3.contain ([type,id]), type_id_idx)
five
> file ("result.txt") .export @ t (A4)
A3 prepares two pieces of data, a binary composed of type and id, as a set of values for lookup. The structure is shown in the following figure:
A4:A3.contain ([type,id]), filter the data based on the sequence of the two tuples, so you need to change the parameters in the contain function to the two tuples.
The final exported result file is as follows:
2.2.2 method 2 (merge primary keys)
Although multi-field keys can be used directly, storage and comparison involving collections are slower. In order to achieve high performance, a more common approach is to combine multiple field keys into single field keys.
Looking at the data structure of this example, although type is a string, it can be enumerated, so you can digitize type and merge it with id into a new primary key field. The maximum value of the long type is 2 ^ 63-1, which can fully accommodate the digitalized merge results of id and type. We call the new primary key after the merger of type and id nid, and we can determine how many bits are used to represent type and id in nid according to the size of the data.
For example, the range of id is 9 digits, and the enumeration of type is represented by 3 digits. So for nid, you need 13 digits (to avoid looking untidy with the first digits being 0, let's set the first digit to 1). In this way, you can turn the federated primary key into the only primary key of a single field, excluding the 12 digits after the first bit, the first 3 digits represent the digitized type, and the last 9 digits are the original id.
The code is as follows:
A
one
= ["type_a",... , "type_z", "type_1", , "type_9", "type_0"]
two
= A1.new (#: tid,~:type)
three
= file ("multi_source.txt")
four
= A3.cursor@t ()
five
= A4.switch (type,A2:type)
six
= A4.new (1000000000000+type.tid*long (1000000000) + id:nid,data)
seven
= A4.skip (99999995)
eight
= A4.fetch (10)
A sequence of enumerated values of A1:type. In practice, enumerated lists may come from files or database data sources.
A2: give each type in the enumerated value sequence a tid. Prepare for subsequent digital primary key merging.
A3~A6: get the data from the multi_source.txt file, and according to the corresponding relationship in A2, change the enumeration string of the type column into a number, and then merge type and id to generate a new primary key nid.
A7~A8: check the data after merging gradually. After skipping the first 99999995 records of cursor A4, take 10 records. The result is as follows:
This results in a new "single field build" data structure:
Field name
Types
Whether or not the primary key
Description
Nid
Long
Yes
Unique primary key containing type and id information
Data
String
The data to be obtained
Then you can do what you did in the "single field key", and of course make sure that the nid is orderly.
2.3 Multithreaded query
On the basis of the above methods, we can also use multi-thread parallelism to further improve performance.
The so-called multi-thread parallelism is to divide the data into N parts and query them with N threads. But if you just randomly divide the data into N parts, you probably won't really improve performance. Because the set of key values to be queried is unknown, there is no theoretical guarantee that the data you want to find will be evenly distributed in each group table file. A better way to deal with it is to observe the characteristics of the key value set first, so as to split the data evenly as much as possible.
For example, continue to use the example of using the above multi-field key to form a single-field key, the merged primary key nid is modeled on 4, and the data with the same remainder is stored in the same group table, and finally all the existing data is loaded by four group table files. Such a file splitting method can make the distribution of the queried data relatively more uniform.
If the key data has obvious business characteristics, we can consider using fields such as date, department and so on to deal with file splitting according to the actual business scenario. For example, 1000 records belonging to department An are equally divided into 10 files, and each file has 100 records. When using multithreading to query records belonging to department A, each thread fetches the corresponding 100 records from its corresponding file.
Let's look at a practical example.
2.3.1 create a group table
A
one
= ["type_a",... , "type_z", "type_1", , "type_9", "type_0"]
two
= A1.new (#: tid,~:type)
three
= file ("multi_source.txt")
four
= A3.cursor@t ()
five
= A4.switch (type,A2:type)
six
= A4.new (1000000000000+type.tid*long (1000000000) + id:nid,data)
seven
= N. (file ("nid_" + string (~-1) + "_ T.ctx") .create (# nid,data)
eight
= N. (eval ("channel (A4) .select (nid%N==" + string (~-1) + ") .append (~ .cursor ()
nine
For A6,500000
A1~A6: consistent with method 2 of multi-field keys.
A7: use the loop function to create a group table file named "key name _ the remainder of key value N _ T.ctx" with the same structure as (# nid,data).
A8: use the loop function to append the cursor data to N original group tables respectively. For example, when Numb1, the eval function parameter spelled out is: channel (A4). Select (nid%4==0) .attach (A7 (1) .append (~ .cursor ()). It means to create a pipe for cursor A4, take the remainder of the recorded key value nid from 4 in the pipe, and filter out the record with the remainder equal to 0. Attach is an additional operation to the current pipeline, which indicates that the original group table corresponding to the current residual value is taken, and the filtered records in the current pipeline are appended to A7 (1), that is, the first group table, in the way of cursor records.
A9: loop cursor A6, fetching 500000 records at a time until the data in the A6 cursor is finished.
After execution, output 4 (in this case we take Niss4 as an example) has separate group table files:
2.3.2 create indexing
A
B
one
Fork directory@p ("nid*T.ctx")
= file (A1) .create () .index (nid_idx;nid;data)
A1: lists the file names that meet the nid*T.ctx (where * is a wildcard), where the @ p option represents the file name that needs to be returned with the full path information. When using fork to execute multithreading, you need to pay attention to whether the number of parallel limits in the environment is set properly. Four threads are used here, and the corresponding settings in the designer are as follows:
B2: each thread creates the corresponding index file for each group of tables, and the final result is as follows:
2.3.3 query
A
B
one
= file ("keys.txt") .import@i ()
two
= A1.group (~% N)
three
Fork N. (~-1), A2
= A3 (2)
four
= file ("nid_" / A3 (1) / "_ T.ctx") .create () .icursor (; B3.contain (nid), nid_idx)
five
Return B4
six
= A3.conjx ()
seven
= file ("result_nid.txt") .export @ t (A6)
A1: get the sequence of query key values from keys.txt. Since there is only one column of results, use the @ I option to return the results as a sequence:
A2: group the sequences of A1 into equal groups according to the remainder of 4:
A3, B3~B5: use the fork function to query each group table in parallel according to the key value after the equivalent grouping. The fork here is followed by two parameters, the first is the cyclic function N. (~-1), and the second is A2. In the following B3 and B4, A3 (2) and A3 (1) are used to obtain the parameters of the corresponding order after fork. B4: the group table file is filtered according to the key value set in B3, and B5: returns the cursor. Because A3 is a sequence of cursors returned by multiple threads, you need to use conjx to connect multiple cursors vertically in A6.
A6~A7: after vertically connecting the cursors returned by multiple threads, export the cursor records to a text file, the first few lines are as follows.
2.4 data addition
Previously, we have solved the problem of batch random key value query for big data, but we can't assume that the data will never change. Especially for big data, the addition of new data is inevitable. When appending new data to the original group table file, we need to discuss three situations: ordered key value appending, unordered key value appending, and data appending when the amount of data is large.
2.4.1 ordered key value append
For a single file, if the key values are ordered, the appended code is as follows:
A
one
= file ("single.ctx")
two
= A1.create ()
three
= file ("single_add.txt")
four
= A3.cursor@t ()
five
= A2.append (A4)
A1:single.ctx is an existing group table file with a structure of (# id,data), where id is a self-increment key value.
A3~A5: the new data file has the same structure as the existing file, and its id is ordered for the overall data after it is added to the original group table. In this case, it can be appended directly to the original group table, and the group table automatically updates the index.
If you split it into multiple files in a multithreaded way, the code is as follows:
A
one
= file ("single_add.txt")
two
= A1.cursor@t ()
three
= directory@p ("id*T.ctx"). (file (~). Create ()
four
= N. (eval ("channel (A2) .select (id%N==" + string (~-1) + ") .select (A3 (" + string (~) + ") .append ([~ .cursor ()])
five
For A2,500000
A1, A2: use cursors to get new data.
A3: meet the wildcard string: "id*T.ctx", the existing sequence of N group table files.
A4, A5: consistent with the code in the previous method.
2.4.2 unordered key value append
Let's also take a look at the append method of a single file. Take the single-field key as an example, the code is as follows:
A
one
= file ("single.ctx")
two
= A1.create () .cursor ()
three
= file ("single_add.txt")
four
= A3.cursor@t ()
five
= file ("single.ctx_temp") .create (# id,data)
six
= A5.append ([A2maeA4] .mergex (id))
A2: open the existing group table in cursor mode.
A4: get the new data by cursor.
A5: create a new group table file.
A6: the result of merging existing group table data and new data in a new group table. Note here that merging with cs.mergex (x) requires cs sequence pairs x to be ordered, that is, the data in the group table file A1 and the new data file A3 are ordered for id respectively. If it is not satisfied that the cs pair x is ordered, the program will not report an error, but the merged result is also disordered.
When this code is executed, you also need to clean up the old group tables, clean up the old indexes, and index the new group tables:
A
one
= movefile (file ("single.ctx"))
two
= movefile (file ("single.ctx__id_idx"))
three
= movefile (file ("single.ctx_temp"), "single.ctx"))
four
= file ("single.ctx") .create () .index (id_idx;id;data)
The first three lines are file operations. For more information, see function reference: movefile. A4 indexes the group table and will not elaborate on it.
Let's take a look at the append method of multiple files. Take the data structure (nid,data) after converting a multi-field key to a single field key as an example. The code is as follows:
A
one
= ["type_a",... , "type_z", "type_1", , "type_9", "type_0"]
two
= A1.new (#: tid,~:type)
three
= file ("multi_source_add.txt")
four
= A3.cursor@t ()
five
= A4.switch (type,A2:type)
six
= A4.new (1000000000000+type.tid*long (1000000000) + id:nid,data)
seven
= directory@p ("nid*T.ctx"). (file (~). Create (). Cursor ())
eight
= directory@p ("nid*T.ctx"). (file (~ + "_ temp") .create (# nid,data))
nine
= N. (eval ("channel (A4) .select (nid%N==" + string (~-1) + ") .select (A8 (" + string (~) + "). Append ([~ .cursor (), A7 (" + string (~) + ")] .mergex (nid)
ten
For A4,500000
A3:multi_source_add.txt is a new data source.
A7: assuming that the original group table already exists, list the file name of the original group table, obtain the group table cursor in turn, and return it into a sequence.
A8: create a new group table file to store the old group table data and the new data, and add "_ temp" to the original file name to show the difference.
A9: use the pipeline for the new data, merge the N cursor records in the pipeline with the corresponding N cursor records in the old group table, and add them to the new N group table. The above has been explained in detail.
After this code is executed, you also need to clean up the old group table, clean up the old index, and index the new group table, as follows:
A
one
= directory@p ("* T.ctx_temp")
two
= A1. (left (~, len (~)-5)
three
= A2. (movefile (file (~)
four
= A1. (left (~, len (~)-5) + "_ _ nid_idx")
five
= A2. (movefile (file (~)
six
= A1. (movefile (file (~), left (~, len (~)-5))
seven
= A2. (file (~). Create (). Index (nid_idx;nid;data))
The code is almost full of looping functions and simple file operations. For details, see the function reference "file". The last line is indexed, which has been explained several times earlier.
2.4.3 data addition when there is a large amount of data
With the continuous increase of new data, the time cost of merging new additional data with full historical data will become higher and higher. At this time, each group table document needs to be divided into the new and the old, the new one is the additional data accumulated in the recent period of time, and the old one is the previous historical data. Whenever there is new data to be appended, we still follow the 2.4.2 processing idea, but only deal with the new group table file. When the new data file exceeds a certain size threshold (such as 100G), it is merged with the old data. This approach can not only reduce the time cost of merging, but also reduce the loss of disks on the other hand.
The enumerated data structures take (nid,data) as an example, and this time let's take a complete look at the code from scratch:
First, define the new and old group table files. The naming rules are as follows:
New group table: the remainder of key value name _ key value N _ T.ctx; the old group table: key value name _ remainder of key value N _ H.ctx.
1. Create new and old group tables. In this case, Niss4 represents the establishment of 4 group tables:
A
one
= N. (file ("nid_" + string (~-1) + "_ H.ctx") .create (# nid,data)
two
= N. (file ("nid_" + string (~-1) + "_ T.ctx") .create (# nid,data)
If N takes 4, the history and temporary group table files are generated as follows:
2. Append new data to the new group table.
A
one
= ["type_a",... , "type_z", "type_1", , "type_9", "type_0"]
two
= A1.new (#: tid,~:type)
three
= file ("multi_source_.txt")
four
= A3.cursor@t ()
five
= A4.switch (type,A2:type)
six
= A4.new (1000000000000+type.tid*long (1000000000) + id:nid,data)
seven
= directory@p ("nid*T.ctx"). (file (~). Create (). Cursor ())
eight
= directory@p ("nid*T.ctx"). (file (~ + "_ temp") .create (# nid,data))
nine
= N. (eval ("channel (A4) .select (nid%N==" + string (~-1) + ") .select (A8 (" + string (~) + "). Append ([~ .cursor (), A7 (" + string (~) + ")] .mergex (nid)
ten
For A4,500000
3. After the new group table is merged, clean up the original new group table and index, and then rebuild the new group table index.
A
one
= directory@p ("* T.ctx_temp")
two
= A1. (left (~, len (~)-5)
three
= A2. (movefile (file (~)
four
= A1. (left (~, len (~)-5) + "_ _ nid_idx")
five
= A2. (movefile (file (~)
six
= A1. (movefile (file (~), left (~, len (~)-5))
seven
= A2. (file (~). Create (). Index (nid_idx;nid;data))
4. Judge the size of the new data, and merge the data with the old group table if it exceeds the parameter B (in bytes).
A
B
C
one
Fork directory@p ("nid*T.ctx")
= file (A1)
two
If B1.size () > B
= left (A1, (len (A1)-5)) + "H.ctx"
three
= B1.create () .cursor ()
four
= file (C2) .create () .cursor ()
five
= left (A1, (len (A1)-5)) + "H.ctx_temp"
six
= file (C5) .create (# nid,data) .append ([C3MagneC4] .mergex (nid))
5. After merging the old group table with the new group table, clean up the old group table and index, and then rebuild the old group table index. Clean up the merged new group table and rebuild the empty new group table.
A
one
= directory@p ("* H.ctx_temp")
two
= A1. (left (~, len (~)-5)
three
= A2. (movefile (file (~)
four
= A1. (left (~, len (~)-5) + "_ _ nid_idx")
five
= A4. (movefile (file (~))
six
= A1. (movefile (file (~), left (~, len (~)-5))
seven
= A2. (file (~). Create (). Index (nid_idx;nid;data))
eight
= A1. (left (~, len (- 10) + "T.ctx")
nine
= A8. (movefile (file (~))
ten
= A1. (left (~, len (- 10) + "T.ctx__nid_idx")
eleven
= A10. (movefile (file (~))
twelve
= A8. (file (~) .create (# nid,data))
6. Use multithreading to query the new and old group table files respectively.
A
B
one
= file ("keys.txt") .import@i ()
two
= A1.group (~% N)
three
Fork directory@p ("* H.ctx"), directory@p ("* T.ctx"), A2
= A3 (3)
four
= file (A3 (1)) .create () .icursor (; B3.contain (nid), nid_idx)
five
= file (A3 (2)) .create () .icursor (; B3.contain (nid), nid_idx)
six
Return B4 | B5
seven
= A3.conj ()
eight
= file ("result.txt") .export @ t (A8)
We need to pay attention to the writing in A7, because B4 | B5 is returned in B6, so the result of A3 is a sequence of multiple cursor sequences, so when joining A3 vertically, you should use the conj of the sequence instead of the conjx of the cursor.
So far, based on the six aggregator script files in this paper, under the reasonable call of the third-party timing task scheduling tool, we can add a large amount of data in the case of a single machine, as well as query for batch random key values.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.