In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly explains "how to search spatio-temporal behavior data in database". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and learn "how to search spatio-temporal behavior data in database".
Data structure
Spatio-temporal behavior data contains three attributes: time, space and object.
Unstructured index:
Create table test (id int8, crt_time timestamp,-Time pos geometry,-- Location obj jsonb-- Object description)
In addition to being applied to JSON, structured data can also be used for object description. For example:
Create table test (id int8, crt_time timestamp,-- Time pos geometry,-- Location C1 int,-- Some property examples c2 int, c3 text, c4 float8, c5 int, c6 date, c7 text, c8 int, c9 int, c10 int)
SQL query instance of Spatio-temporal behavior data
Select * from test where pos?
< ? and crt_time between ? and ? and ( (c1 = ? and c2 between ? and ?) or c10=?) ... ; 优化方法 考虑运用以下知识: 时间序列BRIN索引 crt_time字段是一个时间序列字段,表示生成数据的时间。在PostgreSQL堆存储中,存储和该字段的值具有很强的线性相关性。 因此,BRIN索引很合适。 使用BRIN索引来代替分区表进行TPC-H测试。大范围搜索的性能甚至优于使用分区表时的功能。 create index idx_test_1 on test using brin(crt_time); 空间索引 显然,空间检索需要空间索引。PostgreSQL中可以使用三种方法实现空间检索。 1. 几何类型的GIST索引 create index idx_test_2 on test using gist(pos); 该索引支持空间KNN搜索和空间位置确定等功能。 2. 几何类型的主索引 create index idx_test_2 on test using spgist(pos); 该索引支持空间KNN搜索和空间位置确定等功能。 3. Geohash和B-tree索引(将经度和纬度转换为Geohash并为hash值创建B-tree索引)。只需使用表达式索引。 create index idx_test_3 on test using btree( ST_GeoHash(pos,15) ); 此索引支持前缀搜索(其能落实编码地理信息网格中包含的关系)。它属于有损索引,需要二次过滤。 GiST和SPGiST空间索引能够找到准确的地理位置信息,优于GEOHASH索引。但是,查询信息时需要特别注意。 GIN 索引 此索引类型的目标是对象属性字段JSONB或多个结构化对象属性字段。只需使用GIN索引。 例如: create extension btree_gin; 非结构化索引: create index idx_test_4 on test using gin( obj ); 结构化索引: create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 ); BitmapAnd和BitmapOr 但是,可以同时使用这些索引吗? PostgreSQL为多个索引提供bitmapAnd及bitmapOr接口。它们可以组合多个索引,减少需要扫描的数据库数量。 Heap, one square = one page: +---------------------------------------------+ |c____u_____X___u___X_________u___cXcc______u_| +---------------------------------------------+ Rows marked c match customers pkey condition. Rows marked u match username condition. Rows marked X match both conditions. Bitmap scan from customers_pkey: +---------------------------------------------+ |100000000001000000010000000000000111100000000| bitmap 1 +---------------------------------------------+ One bit per heap page, in the same order as the heap Bits 1 when condition matches, 0 if not Bitmap scan from ix_cust_username: +---------------------------------------------+ |000001000001000100010000000001000010000000010| bitmap 2 +---------------------------------------------+ Once the bitmaps are created a bitwise AND is performed on them: +---------------------------------------------+ |100000000001000000010000000000000111100000000| bitmap 1 |000001000001000100010000000001000010000000010| bitmap 2 &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& |000000000001000000010000000000000010000000000| Combined bitmap +-----------+-------+--------------+----------+ | | | v v v Used to scan the heap only for matching pages: +---------------------------------------------+ |___________X_______X______________X__________| +---------------------------------------------+ The bitmap heap scan then seeks to the start of each page and reads the page: +---------------------------------------------+ |___________X_______X______________X__________| +---------------------------------------------+ seek------->^ seek-- > ^ seek- > ^ |-only these pages read
For example:
Select * from test where C1... And crt_time between? And? And test- > > C1 in.
Automatically use the appropriate index based on statistics. If necessary, bitmapAnd and bitmapOr will automatically perform merge scans on multiple indexes. Skip pages that do not need to be scanned and re-check the hit pages.
Heap table storage grading and partitioning
Storage can be divided into first-level or multi-level partitions:
1. Single partition
For example, by time.
Create table test (id int8, crt_time timestamp,-- Time pos geometry,-- Location obj jsonb-- Object description) PARTITION BY range (crt_time); create table test_201701 PARTITION OF test for values FROM (2017-01-01) TO (2017-02-01);
two。 Multi-tier partition
For example, partition by time and then by Geohash.
Create table test_201701 PARTITION OF test for values FROM (2017-01-01) TO (2017-02-01) partition by range (st_geohash (pos,15)); Create table test_201701_prefix1 PARTITION OF test for values FROM (xxxx1) TO (xxxx2)-- Generate BOX (GRID) on a map, find corresponding boundaries and use-- boundaries as partitioning conditions
When using partitions, if the query criteria include partition keys (such as time and space ranges), the corresponding partitions are automatically located, which is the amount of data to be scanned.
Create an GIN index of object-oriented attributes to achieve efficient queries.
Index grading and partitioning
Like data, indexes support partitioning logic without using partitioned tables.
Spatial index + time partition
Create index idx_20170101 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02; Create index idx_20170102 on tbl using gist (pos) where crt_time between 2017-01-02 and 2017-01-03;
By using the aforementioned partition index, the target data can be quickly located and the spatial search can be performed after entering the time range.
Select * from tbl where crt_time between 2017-01-01 and 2017-01-02-Time and (pos?)
< ? -- Distance to a point to be searched for and ? -- Other conditions order by pos ? -- Sort by distance limit ?; -- Number of results to be returned 可以使用更多的索引分区,比如用作搜索条件和商店类型的维度(对象属性)(假设它是可枚举的或在范围相对较小的情况下)。 create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0; ... create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1; ... 通过使用前面的分区索引,在输入时间范围或特定条件以执行空间搜索后,可以快速定位目标数据。 select * from tbl where crt_time between 2017-01-01 and 2017-01-02 -- Time and (pos ?) < ? -- Distance to a point to be searched for and dtype=0 -- Object condition and ? -- Other conditions order by pos ? -- Sort by distance limit ?; -- Number of results to be returned 请注意,前面的SQL查询可以实现最佳性能优化。 索引组织形式(或索引结构)可以由逻辑分区重新构造,可以用上述类似的索引创建方法覆盖所有条件。 CTID相交阵列连接扫描 如前所述,BitmapAnd和BitmapOr合并扫描是在多个索引或GIN索引中自动执行的。事实上,这种扫描也可以在SQL中显式执行。 每个条件渗透对应的CTID。 使用Intersect或Union生成满足总体需求的CTID。(Intersect对应于"and"条件;union对应于"or"条件。) 生成一个ctid数组。 示例 1. 创建对象提要数据表 postgres=# create table tbl (id int, info text, crt_time timestamp, pos point, c1 int , c2 int, c3 int ); CREATE TABLE 2. 将5000万条测试数据写入表中 postgres=# insert into tbl select generate_series(1,50000000), md5(random()::text), clock_timestamp(), point(180-random()*180, 90-random()*90), random()*10000, random()*5000, random()*1000; INSERT 0 50000000 3. 创建对象索引 postgres=# create index idx_tbl_1 on tbl using gin (info, c1, c2, c3); CREATE INDEX 4. 创建时间索引 postgres=# create index idx_tbl_2 on tbl using btree (crt_time); CREATE INDEX 5. 创建空间索引 postgres=# create index idx_tbl_3 on tbl using gist (pos); CREATE INDEX 6. 生成数据布局以方便后续查询 postgres=# select min(crt_time),max(crt_time),count(*) from tbl; min | max | count ----------------------------+----------------------------+---------- 2017-07-22 17:59:34.136497 | 2017-07-22 18:01:27.233688 | 50000000 (1 row) 7. 创建一个极限KNN查询函数 create or replace function ff(point, float8, int) returns setof tid as $ declare v_rec record; v_limit int := $3; begin set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number for v_rec in select *, (pos $1) as dist, ctid from tbl order by pos $1 loop if v_limit $2 then -- raise notice "All matching points returned" return; else return next v_rec.ctid; end if; v_limit := v_limit -1; end loop; end; $ language plpgsql strict volatile; postgres=# select * from ff(point (100,100) ,100,100) ; ff ------------- (407383,11) (640740,9) (26073,51) (642750,34) ... (100 rows) Time: 1.061 ms 8. CTID合并检索 显示符合以下条件的记录 ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ) and pos point (0,0) < 5 and crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 ; 首先,分别查看每个条件,找匹配一个条件的记录数量,以及在索引扫描上所花时长。 1. 54,907条记录 postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on postgres.tbl (cost=820.07..65393.94 rows=54151 width=73) (actual time=23.842..91.911 rows=54907 loops=1) Output: id, info, crt_time, pos, c1, c2, c3 Recheck Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) Heap Blocks: exact=52778 Buffers: shared hit=52866 ->Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.264..14.264 rows=54907 loops=1) Index Cond: (tbl.c1 = ANY ({1 ms Execution time 2, 3, 4, 100, 200, 99, 88, 7, 55}:: Intel []) Buffers: shared hit=88 Planning time: 0.105 ms Execution time: 94.606 ms (10 rows)
2. 95147 records
Postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where C2 Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=53.612..53.612 rows=95147 loops=1) Index Cond: (tbl.c2
< 10) Buffers: shared hit=53 Planning time: 0.094 ms Execution time: 186.201 ms (10 rows) 3. 149930条记录(为快速获得结果,PostgreSQL使用位图进行合并扫描) postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 BitmapOr (cost=1694.23..1694.23 rows=153936 width=0) (actual time=73.763..73.763 rows=0 loops=1) Buffers: shared hit=141 ->Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=16.733..16.733 rows=54907 loops=1) Index Cond: (tbl.c1 = ANY): (tbl.c1 = ANY ({1 cost=0.00..810.79 rows=99785 width=0 2) 3 actual time=57.029..57.029 rows=95147 loops=1 4 100 100 200 99 88 7 55}:: Integer []) Buffers: shared hit=88-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=57.029..57.029 rows=95147 loops=1) Index Cond: (tbl.c2
< 10) Buffers: shared hit=53 Planning time: 0.149 ms Execution time: 274.548 ms (15 rows) 4. 60,687条记录(即使运用出色的KNN性能优化,仍然需要耗费195毫秒)。 postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) Output: ff Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) Buffers: shared hit=61296 Planning time: 0.029 ms Execution time: 195.097 ms (6 rows) 让我们看看不使用KNN优化需要多长时间。 结果非常令人惊讶——极限优化性能提高了一个数量级。 5. 2,640,751条记录 使用所有索引逐个扫描数据条件,得到ctid并执行ctid扫描。 现在,让我们来分解这个过程: 首先,让我们看看时间和对象属性的合并查询,成果非常惊人。使用位图BitmapOr时,查询可以跳过大多数数据块,并且扫描时间比单索引扫描要短。 注意,在此步骤中记录的数量减少到7,847条。 postgres=# explain (analyze,verbose,timing,costs,buffers) select ctid from tbl where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 and ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on postgres.tbl (cost=35025.85..44822.94 rows=7576 width=6) (actual time=205.577..214.821 rows=7847 loops=1) Output: ctid Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34:: timestamp without time zone) AND (tbl.crt_time BitmapAnd (cost=35025.85..35025.85 rows=7581 width=0) (actual time=204.048..204.048 rows=0 loops=1) Buffers: shared hit=7360-> BitmapOr (cost=1621.11..1621.11 rows=153936 width=0) (actual time=70.279..70.279 rows=0 loops=1) Buffers: shared hit=141-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=15 .860.. 15.860 rows=54907 loops=1) Index Cond: (tbl.c1 = ANY ({1recover2) 3 Buffers 4 cost=0.00..810.79 rows=99785 width=0 100 100 cost=0.00..810.79 rows=99785 width=0 200 99 99 77 6 55:: integer []) Buffers: shared hit=88-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=54.418..54.418 rows=95147 loops=1) Index Cond: (tbl.c2
< 10) Buffers: shared hit=53 ->Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=127.101..127.101 rows=2640751 loops=1) Index Cond: (tbl.crt_time > = 2017-07-22 17:59:34:: timestamp without time zone) AND (tbl.crt_time = 2017-07-22 17:59:34:: timestamp without time zone) AND (tbl.crt_time BitmapAnd (cost=35022.06..35022.06 rows=7581 width=0) (actual time=203.620..203. 620 rows=0 loops=1) Buffers: shared hit=7360-> BitmapOr (cost=1618.58..1618.58 rows=153936 width=0) (actual time=71.660..71.660 rows=0 loops=1) Buffers: shared hit=141-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.861..14.861 rows=54907 loops=1) Index Cond: (tbl.c1 = ANY ({1mer2) 3 Buffers 4 cost=0.00..810.79 rows=99785 width=0 100 100 cost=0.00..810.79 rows=99785 width=0 200 99 99 77 6 55:: integer []) Buffers: shared hit=88-> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=56.797..56.797 rows=95147 loops=1) Index Cond: (tbl.c2
< 10) Buffers: shared hit=53 ->Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=125.255..125.255 rows=2640751 loops=1) Index Cond: (tbl.crt_time > = 2017-07-22 17:59:34:: timestamp without time zone) AND (tbl.crt_time
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.