Performance optimization tips-find 04/23 Update SLTechnology News&Howtos

Performance optimization tips-find

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

In our daily life, we will encounter all kinds of data, from the company address book to the behavior analysis of Internet users. In the process of data analysis and processing, query is an essential link, how to query data more efficiently.

SPL provides users with a powerful index mechanism and query functions for objects in different scenarios, which can significantly improve query performance.

1 key value lookup 1.1 order table

We first create a copy of the "call record" simulation data, through this data, to compare the impact of different query functions on the query performance of the ordinal table. The code to create the simulation data is as follows:

A1=5000000.new (13800000000000000001length.rands ("TF", 1): isLocal) 2=file ("btx/voiceBill.btx") .export @ b (A1)

Code 1.1.1

Some of the data are as follows:

Figure 1.1.1

When querying the ordinal table, we usually think of using the A.select () function. Let's look at the effect of using this function:

AB1=file ("btx/voiceBill.btx") .import@b () / reads the data in the file as the ordinal table 2=now () / current time 3=A1.select (Subscriber==13800263524) / uses the A.select () function to query 4=interval@ms (A2 now ()) / the query takes time

Code 1.1.2

The query takes 80 milliseconds.

When querying the key value of the ordinal table, you can use the A.find () function to query. The sample code is as follows:

AB1=file ("btx/voiceBill.btx") .import@b () / reads the data in the file as sequence Table 2 > A1.keys (Subscriber) / sets Subscriber as the primary key 3 > A1.index () / indexing 4=now () / current time 5=A1.find (13800263524) / use the A.find () function to find 6=interval@ms (A4 now ()) / time spent on the query

Code 1.1.3

The query takes 1 millisecond.

This is because in the order table of the aggregator, one or more fields can be specified as primary keys, and lookups based on primary keys can use special functions. For example, the find function of A5 in Code 1.1.3 can not only simplify writing, but also effectively improve computing performance.

When there are many key values, we use the function A.find@k () to do batch key lookup. The sample code is as follows:

AB1=file ("btx/voiceBill.btx") .import@b () / read the data in the file As sequence table 2 > A1.keys (Subscriber) / set Subscriber as primary key 3 > A1.index () / index 4=A1 (50. (rand (A1.len ()) + 1)). (Subscriber) / take 50 values at random from Subscriber 5=now () / current time 6=A1.find@k (A4) / use the A.find@k () function to search A4 for batch key search 7=interval@ms (A5 now ()) / time spent on query

Code 1.1.4

It is important to note that when using the A.find () function, you need to create the primary key in advance, otherwise an error of "missing primary key" will be reported.

The use of primary key value lookup function can effectively improve computational performance because the index table is established for the primary key in the ordinal table. In Code 1.1.4, the average query time is about 1400 milliseconds without indexing; after indexing, the average query time is less than 1 millisecond.

The larger the amount of data in the sequence table, the more times you need to search, the more obvious the improvement in efficiency.

When the query condition corresponds to multiple keys, the sample code is as follows:

AB1=file ("btx/voiceBill.btx") .import@b () / read the data in the file, as sequence table 2 > A1.keys (Subscriber,isLocal) / set Subscriber,isLocal as primary key 3 > A1.index () / index 4 = [[13800000002, "F"], [13802568478, "F"] / there are multiple keys So it takes time to make up the sequence 5=now () / current time 6=A1.find@k (A4) / use the A.find@k () function to do a batch key search for 7=interval@ms (A5MagneNow ()) / query for A4.

Code 1.1.5

The switch/join function also needs to find records in the ordered table according to the primary key value, and automatically indexes the dimension table when it is used. If the corresponding dimension table is not indexed before the multithreaded fork function, an index will be automatically established for the dimension table in each thread, which will consume more memory during execution, which may cause memory overflow, such as figure 1.1.1.2, which should be avoided. For better treatment, please refer to figure 1.1.3.

Figure 1.1.2 each thread in fork is automatically indexed, resulting in a memory overflow

Figure 1.1.3 Index the dimension table before fork is executed

1.2 set files

To find ordered set files, you can use the f.iselect () function to achieve binary lookup, which also supports batch lookup. Here is an example of batch lookup using f.iselect () based on set files:

A1=5000000.new (13800000000000000001length.rands ("TF", 1): isLocal) 2=file ("btx/voiceBill@z.btx") .export @ z (A1)

Code 1.2.1

Code 1.2.1, set up the set file voiceBill@z.btx. Obviously, Subscriber is orderly.

AB1=file ("btx/voiceBill@z.btx")

2 id 50. (13800000000+rand (5000000) + 1). Id () / randomly find 50 out of 5 million Subscriber, remove and sort 3=now () / current time 4=A1.iselect@b (A2 minute subscriber) / a pair of set files use f.iselect () binary search function for batch search 5=A4.fetch ()

Time spent on 6=interval@ms (A3MaginNow ()) / query

Code 1.2.2

Code 1.2.2, because f.iselect () is a binary lookup function, it is important to note that A2 in the code needs to be ordered as a query sequence, just like the number of the set file. Also note that the option @ b here does not mean dichotomy, but instead reads the set file exported through the f.export () function. When exporting the set file, note that you need to use the option @ z, otherwise an error will be reported when querying the set file using f.iselect ().

Assuming that the total amount of data is N and the time complexity of searching using dichotomy is logN (with 2 as the base), the larger the amount of data is, the more obvious the performance improvement is.

1.3 Group tables

Group tables also have T.find () and T.find@k () functions similar to ordinal tables, which can efficiently implement key value lookups. It is suitable for scenarios where a small number of records are found in a large dimensional table. Let's look at an example like this:

AB1=file ("ctx/voiceBill.ctx") .create (# Subscriber,Time,Length,isLocal)

2for 5000=to ((A2-1) * 100000000) .new (13800000000000000003000000): Time,rand (1547123785410+rand (864000000)): Time,rand (3600) + 1 TF (1): isLocal) 3

= A1.append (B2.cursor ())

Code 1.3.1

Code 1.3.1, create a group table file voiceBill.ctx, where Subscriber is the dimension of the group table.

AB1=file ("ctx/voiceBill.ctx") .create () / Open a subscription value 3=now () / current time 4=A1.cursor (). Fetch () / query the group table using cs.select () to query the group table 5=interval@ms (A3 now ()) / time spent on the query

Code 1.3.2

Code 1.3.2, query the group table using the cs.select () function, which takes 13855 milliseconds.

AB1=file ("ctx/voiceBill.ctx") .create () / Open a subscriber value 3=now () / current time 4=A1.find (A2) / one of the group table data 13801701672 / use T.find () to query the group table 5=interval@ms (A3 now ()) / time spent on the query

Code 1.3.3

Code 1.3.3, using the T.find () function to query the group table, takes 77 milliseconds.

Comparison shows that: for dimensional group tables, we can use the T.find () function similar to the ordinal table to query single or batch key values, and the query efficiency is much higher than that of fetching numbers from filtered cursors.

2 Index lookup

There are three indexes that can be created on a group table, each for a different situation, which are:

1. Hash index, suitable for single-valued lookup, such as enumerated types

2. Sort index, which is suitable for interval search, such as number, date and time type

3. Full-text index for fuzzy queries, such as string types.

Let's create a group table so that its data type covers the above three indexes, as follows:

AB1=file ("ctx/voiceBillDetail.ctx") .create (# Subscriber,Time,Length,isLocal,City,Company)

2=file ("info/city_en.txt"). Import@i () = A2.len () 3for 5000=to ((A3-1) * 100000000). New (138000000000000000000000000000000000000000000001mm) Subscriberdatetime (1547123785410+rand (864000000)): Time,rand (3600) + 1VV LengthRands ("TF", 1): isLocal,A2 (rand (B2) + 1): City,rands ("ABCDEFGHIJKLMNOPQRSTUVWXYZ", 14) + "Co. Ltd": Company) 4

= A1.append (B3.cursor ())

Code 2.1

For the group table established in Code 2.1, the first ten records are as follows:

Figure 2.1

AB1=file ("ctx/voiceBillDetail.ctx") .create () / Open group table 2=A1.index (subscriber_idx;Subscriber) / user mobile phone number, value with sort index 3=A1.index (time_idx;Time) / call start time, value with sort index 4=A1.index (length_idx;Length) / call duration, value with sort index 5=A1.index (city_idx:1;City) / city, enumeration uses hash index 6=A1.index@w (company_idx) Company) / company, full-text index for strings

Code 2.2

Code 2.2, according to the characteristics of each column data type, establish a different type of index. The established index and group table files are shown in figure 2.2:

Figure 2.2

The aggregator can automatically identify the condition and find the appropriate index, both equivalent and interval, and like ("A*") also supports it. Let's take a look at the effect:

Equivalent search

AB1=file ("ctx/voiceBillDetail.ctx") .create () / Open group table 2=now () / current time 3=A1.icursor (; Subscriber==13834750766,subscriber_idx) .fetch () / icursor query 4=interval@ms (A2 now ()) / query time

Code 2.3

AB1=file ("ctx/voiceBillDetail.ctx") .create () / Open group table 2=now () / current time 3=A1.icursor (; Subscriber==13834750766) .fetch () / icursor query 4=interval@ms (A2 now ()) / query time

Code 2.4

Code 2.3 is written without omitting the index name, and code 2.4 is written with the index name omitted. The time consumption of both is basically the same, about 100 milliseconds.

AB1=file ("ctx/voiceBillDetail.ctx") .create () / Open group table 2=now () / current time 3=A1.cursor (). Select (Subscriber==13834750766) .fetch () / ordinary cursor query 4=interval@ms (A2 Magi now ()) / query time-consuming

Code 2.5

Code 2.5 uses a normal cursor to query the same record, which takes about 40 seconds.

Interval search

AB1=file ("ctx/voiceBillDetail.ctx") .create () / Open group table 2=now () / current time 3=A1.icursor (; Subscriber > = 13834750766 & & Subscriber=13834750766 & & Subscriber208 & & weight

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.