In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how to deal with the problem of iterative data involved in Sql". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to deal with the problems of Sql involving iterative data.
1. Original description of production problem
With regard to the problem of tagging users, in the actual environment, the tagged data sources almost cover the whole department or even the whole company. Some data sources are detailed, some are original ods logs, some are dimension tables, some are list tables. For the demand side, it is necessary to quickly iterate online labels, see the demand or data warehouse division or deal with the degree of trouble, often adopt different processing methods. In addition, for the application side of the tag, there must be someone to check the value, value check person, asset analysis, etc., the example of this article is a list table, what is a list table, that is, the result of a circle of visitors, such as a pile of id that meets a certain condition, the tag is naturally used to satisfy the circle of visitors, but the original data is the result of a circle of visitors, so at the same time, in order to make the tag more convenient to use in the business side. So process the list into tags, then the requirements are determined, and the tags are processed from the list.
The original table is as follows
# id dt1 102 102 113 11
Explain, for the sake of simplicity, suppose that date types are represented by numbers, there will be a group of de-duplicated id every day, and the next day's id must be in the form of two crossed circles, that is, there are more, less, and overlap.
Now the demand is
I want to filter users for any period of time. Satisfaction is the first time that this period of time is valid, or invalid, or the stock is valid, corresponding to the description of more, less overlapping every day.
As for the definition of validity, the previous day the list does not have this id, and the next day there is this id, which is valid for the first time, but for this kind of data, there must be valid, invalid, this kind, so the daily shards of the tag will have yesterday's id plus today's id.
two。 Analysis and disassembly of production problems
In view of the problems analyzed above, the state of validity and invalidity cannot be the most recent or the first, because there will be the concept of a section, such as a certain period of history, as long as it occurs for the first time and coincides for many days. Therefore, in order to solve this problem, my side deals with the effective state of each day relative to the previous day, which is 0d1 and 2 respectively, that is, the correspondence is invalid, the first time is effective, the stock is effective, and corresponds to each paragraph of information.
So the problem is solved as, first find the valid state, and then process the earliest time and the latest time on the basis of the valid state. For the convenience of screening, I process two additional redundant sub-attribute fields, whether they are valid for the first time and currently, and invalid for valid. If you use the window function to find the previous date of the same id, if it is the day before the date, that is to say, 2, the stock is valid. If it is not for the previous day, it is incrementally valid, but this method cannot take out the invalid ones, so what I think of later is to implement it with full join, so that I can get all the data from the day before and the day after.
3. Problem solving-for valid and ineffective processing with T3 as (select id,dt CASE WHEN a.did IS NOT NULL AND b.did IS NULL THEN 1 WHEN a.did IS NULL AND b.did IS NOT NULL THEN 0 WHEN a.did IS NOT NULL AND b.did IS NOT NULL THEN 2 End as is_valid from T1 a full join b on a.id = b.id)-up to 12 The latest time is 14, whether it is valid for the first time, whether it is currently valid select id,min (dt) as min_dt,max (dt) as max_dt, gid from (select id,dt,is_valid, (dt-row_number () over (partition by id order by dt)) gid from T3 where is_valid! = 0) tmp group by id,gid4. Summarize the routine 4.1. First of all, think about the date functions you can use.
Datediff, date_sub/date_add
4.2. Consecutive date
Continuous problems will use a ranking function, but the value of the ranking function is a numerical value, and it is easy to group groups only if they are mapped to the continuity of dates. For example, dates can be mapped to consecutive numbers, or numbers can be mapped to consecutive dates. The operation to achieve these two is through the combination of datedff and date_sub. The principle is that the date and date are subtracted to get continuous integers. Integers can be subtracted from a date to get consecutive dates, where date_sub can be sorted in reverse to get consecutive dates.
4.3. Subtract from a consecutive sort date or sort id, and then group gid
Can solve this kind of problem.
Find the start and end time of the continuous state time
4.4 for special problems, such as wanting to know the data of front and rear partitions
The existing way, one is lag and lead window function, the other is self-table association, the efficiency should be higher window function, based on this topic, due to the special reasons of the list table, if you tag the full id, so that the result partition data and the original partition data is different, is added the previous day invalidation of the user, so use the self-table fulljoin association.
At this point, I believe you have a deeper understanding of "how to deal with the problems of Sql involving iterative data". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.