Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use arrays in PostgreSQL

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

How to use arrays in PostgreSQL, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

This happened in Heap a few weeks ago. We maintain an array of events for each tracking user in Heap, in which we use a hstore datum to represent each event. We have an import pipeline to append new events to the corresponding array. To make this import pipeline idempotent, we set an event_id for each event, and we repeatedly run our event array through a function. If we want to update the properties attached to the event, we just need to dump a new event into the pipeline using the same event_id.

So, we need a function to handle the hstores array, and if two events have the same event_id, we should use the most recent one in the array. When I first tried this function, it read like this:

-- This is slow, and you don't want to use Filter an array of events such that there is only one event with each event_id.-- When more than one event with the same event_id is present, take the latest one.CREATE OR REPLACE FUNCTION dedupe_events_1 (events HSTORE []) RETURNS HSTORE [] AS $$SELECT array_agg (event) FROM (--Filter for rank = 1, I.E. Select the latest event for any collisions on event_id. SELECT event FROM (--Rank elements with the same event_id by position in the array, descending.

This query is tested on i7CPU with 2.4GHz and macbook pro with 16GB Ram, and the running script is: https://gist.github.com/drob/9180760.

What on earth happened here? The key is that PostgreSQL stores a series of hstores as the value of the array, rather than a pointer to the value. An array containing three hstores looks like

{"event_id= > 1 event_id= data = > foo", "event_id= > 2 bar data = > bar", "event_id= > 3 baz data = > baz"}

On the contrary,

{[pointer], [pointer], [pointer]}

For variables of different lengths, give an example. Hstores, json blobs, varchars, or text fields, PostgreSQL must find the length of each variable. For evaluateevents [2], PostgreSQL parses the event read from the left until the second read. Then there is forevents [3], who scans again from the first index until she reads the data for the third time! So, evaluatingevents [sub] is O (sub), and evaluatingevents [sub] is O (N2) for every index in the array, and N is the length of the array.

PostgreSQL can get more appropriate parsing results, and it can analyze the array once in this case. The real answer is to implement variable-length elements and pointers with array values, so that we can always handle evaluateevents [I] in a constant amount of time.

Even so, we shouldn't let PostgreSQL handle it, because it's not an authentic query. In addition to generate_subscripts we can use unnest, which parses the array and returns a set of entries. In this way, we do not need to explicitly add an index to the array.

-- Filter an array of events such that there is only one event with each event_id.-- When more than one event with the same event_id, is present, take the latest one.CREATE OR REPLACE FUNCTION dedupe_events_2 (events HSTORE []) RETURNS HSTORE [] AS $$SELECT array_agg (event) FROM (--Filter for rank = 1, I.E. Select the latest event for any collisions on event_id. SELECT event FROM (--Rank elements with the same event_id by position in the array, descending. SELECT event, row_number AS index, rank () OVER (PARTITION BY (event-> 'event_id'):: BIGINT ORDER BY row_number DESC) FROM (--Use unnest instead of generate_subscripts to turn an array into a set. SELECT event, row_number () OVER (ORDER BY event-> 'time') FROM unnest (events) AS event) unnested_data) deduped_events WHERE rank = 1 ORDER BY index ASC) to_agg;$$ LANGUAGE SQL IMMUTABLE

The result is valid, and the time it takes is linear with the size of the input array. It takes about half a second to input 100K elements, compared with 40 seconds for the previous implementation.

This fulfills our needs:

Parsing the array at once does not require unnest.

Divided by event_id.

Adopt the latest emerging for each event_id.

Sort by input index.

Lesson: if you need to access a specific location in the PostgreSQL array, consider using unnest instead.

SELECT events [sub] AS event, sub, rank () OVER (PARTITION BY (events [sub]-> 'event_id'):: BIGINT ORDER BY sub DESC) FROM generate_subscripts (events, 1) AS sub) deduped_events WHERE rank = 1 ORDER BY sub ASC) to_agg;$$ LANGUAGE SQL IMMUTABLE

This works, but the performance of large inputs is degraded. This is quadratic, and it takes about 40 seconds when the input array has 100K elements!

This query is tested on i7CPU with 2.4GHz and macbook pro with 16GB Ram, and the running script is: https://gist.github.com/drob/9180760.

What on earth happened here? The key is that PostgreSQL stores a series of hstores as the value of the array, rather than a pointer to the value. An array containing three hstores looks like

{"event_id= > 1 event_id= data = > foo", "event_id= > 2 bar data = > bar", "event_id= > 3 baz data = > baz"}

On the contrary,

{[pointer], [pointer], [pointer]}

For variables of different lengths, give an example. Hstores, json blobs, varchars, or text fields, PostgreSQL must find the length of each variable. For evaluateevents [2], PostgreSQL parses the event read from the left until the second read. Then there is forevents [3], who scans again from the first index until she reads the data for the third time! So, evaluatingevents [sub] is O (sub), and evaluatingevents [sub] is O (N2) for every index in the array, and N is the length of the array.

PostgreSQL can get more appropriate parsing results, and it can analyze the array once in such cases. The real answer is to implement variable-length elements and pointers with array values, so that we can always handle evaluateevents [I] in a constant amount of time.

Even so, we shouldn't let PostgreSQL handle it, because it's not an authentic query. In addition to generate_subscripts we can use unnest, which parses the array and returns a set of entries. In this way, we do not need to explicitly add an index to the array.

-- Filter an array of events such that there is only one event with each event_id.-- When more than one event with the same event_id, is present, take the latest one.CREATE OR REPLACE FUNCTION dedupe_events_2 (events HSTORE []) RETURNS HSTORE [] AS $$SELECT array_agg (event) FROM (--Filter for rank = 1, I.E. Select the latest event for any collisions on event_id. SELECT event FROM (--Rank elements with the same event_id by position in the array, descending. SELECT event, row_number AS index, rank () OVER (PARTITION BY (event-> 'event_id'):: BIGINT ORDER BY row_number DESC) FROM (--Use unnest instead of generate_subscripts to turn an array into a set. SELECT event, row_number () OVER (ORDER BY event-> 'time') FROM unnest (events) AS event) unnested_data) deduped_events WHERE rank = 1 ORDER BY index ASC) to_agg;$$ LANGUAGE SQL IMMUTABLE

The result is valid, and the time it takes is linear with the size of the input array. It takes about half a second to input 100K elements, compared with 40 seconds for the previous implementation.

This fulfills our needs:

Parsing the array at once does not require unnest.

Divided by event_id.

Adopt the latest emerging for each event_id.

Sort by input index.

This is the answer to the question about how to use arrays in PostgreSQL. I hope the above content can be of some help to you. If you still have a lot of questions to solve, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report