In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to iterate or roll back the zipper table in the development of big data". In the daily operation, I believe that many people have doubts about how to iterate or roll back the zipper table in the development of big data. I have consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "how to iterate or roll back the zipper table in the development of big data!" Next, please follow the editor to study!
1. Background
What is a zipper table? when the data warehouse is established, an important table data processing method can be used to compare the data structure to the algorithm and to the zipper table to the data warehouse, in order to solve the SCD requirements in the data warehouse establishment. Then what is SCD, that is, slowly changing the dimension? with the passage of time, the data changes slowly relative to the fact table.
There are several common ways to deal with SCD:
Retain the original value
Direct coverage
Add a new property column
Snapshot table
Zipper list
This article mainly explains the zipper list to deal with the problem of SCD, its characteristics are summarized below, there are several scenarios, you can use zipper list.
1. The table has a large amount of data, and using a full table will take up a lot of storage.
two。 Table data will be modified, with increment table, it is difficult to deal with repetition and modify data
3. There is a need for backtracking, and you need to know the full amount of data at a certain point in history.
4. The data has been modified, but the frequency and quantity are not very large, for example, only 1/1000000 have been modified.
two。 Zipper table processing theory
First of all, the zipper table is a full table and not a partition table. In order to achieve the various effects described above, it is necessary to have an intermediate table as an intermediate springboard. This intermediate springboard table is a partition table, the data is incremental data, and the incremental content includes modifications and additions, that is, create_time or update_time often falls on the current day. For the zipper table, you need to add two fields that have nothing to do with the original data to identify the data start time and valid expiration time. In the example, these two dates are start_date and end_date, respectively. The zipper table can be processed in the following three ways: initialization, daily data update, and data rollback.
2.1 initialization and new data
Its daily scrolling is shown in the following figure:
The initialization part is the start time of the full zipper scale, and also establishes the earliest time when the rollback can be rolled back. If the logic is updated every day as shown in the figure above, the new data will be divided into two parts. For the same or unchanged data in the partition on the same day, you can update the data by modifying the corresponding start_date and end_date respectively.
2.1 data rollback
For the above update logic, let's consider how to roll back the data, that is, back to a point in history, which is the full scale for the zipper table, so there is only one rollback. The rollback strategy can be based on the start_date and end_date generated by the rollback time and data, and how to roll back. Let's take a look at the following diagram:
The data in end_date < rollback_date should be retained, and the end_date should be set to 9999-12-31 for processing end_date ≥ rollback_date ≥ start_date. For the results of rollback, generally, in order to maintain the integrity of the data, the rolled-back data can be placed in a new zipper temporary table.
3. Zipper table processing case
For the common hierarchical DIM of data warehouses, that is, the dimension layer is a common scenario for zipper tables, here is an example of how to add and roll back zipper tables.
Use the zipper table to realize the DIM layer merchant dimension table in the core transaction analysis, and realize the rollback of the zipper table.
3.1 create tables and import data
The structure of the merchant dimension table is as follows:
-- create a merchant information table (increment table partition table) drop table if exists ods.ods_trade_shops Create table ods.ods_trade_shops (`shopid`int COMMENT 'shop ID', `userid`int COMMENT' shop owner', `areaid`int COMMENT 'area ID', `shopname`shop name', `shoplevel` int COMMENT 'shop rank', `status`shop status', `createtime`string COMMENT 'creation date', `modifytime`string COMMENT 'modification date') COMMENT 'merchant information table' PARTITIONED BY (`dt` string) row format delimited fields terminated by',' -- create the merchant information dimension table drop table if exists dim.dim_trade_shops Create table dim.dim_trade_shops (`shopid`int COMMENT 'shop ID', `userid`int COMMENT' shop owner', `areaid`int COMMENT 'area ID', `shopname`shop name', `shoplevel` int COMMENT 'shop level', `status`shop status', `createtime`string COMMENT 'creation date', `modifytime`string COMMENT 'modification date', `startdate` string COMMENT 'effective date' `enddate`string COMMENT 'expiration date') COMMENT 'merchant information table'
Import the following test data:
/ root/data/shop-2020-11-20.dat100050 precinct 1pene 100225, WSxxx Camp supermarket, 1meme 12020-06-28 pyrrine 2020-06-28 pyrrine 2020-11-20 1313 purl 22100052meme 100236, fresh xxx flagship store, 1meme 2020-06-28 pill 22100053Mae 3PM100011, Huawei xxx flagship store, 1ml2020-06-28ml2020-20 13ml22100054Mae 100159, Xiaomi xxx flagship store, 1Jie 1Jie 2020-06-28je 2020-28je 2011-28je 2011-28je 2020-20 13purl 2210005Mae 5Mae 100211, Apple xxx flagship store 2020-06-28 min2020-11-20 13:22:22 / root/data/shop-2020-11-21.dat100057 7pr 100311, snacks for three xxx rats, 1meme 1jr 20628 pill 22100058 8pas 100329, Ryoko xxx shop delicacy, 1Mi 22100054Mi 100159, Xiaomi xxx flagship store, 2JI 1JI 2006-28M 2011-21 1313 xxx 2210005jr 5jue 100211, Apple xxx flagship store, 2QI 1. 2020-06-28 xxx 2020-11-21 13:22:22 / root/data/shop-2020-11-22.dat100059 9pju 100225, Leju xxx Commodity, 1meme 2020-06-28 pint 22100060 1010211, colleague xxx Great Health, 1meme 22100052 22100052 22100236, fresh xxx flagship store 2020-06-28 13:22:22load data local inpath'/ root/data/shop-2020-11-20.dat' overwrite into table ods.ods_trade_shops partition (dt='2020-11-20') Load data local inpath'/ root/data/shop-2020-11-21.dat' overwrite into table ods.ods_trade_shops partition (dt='2020-11-21'); load data local inpath'/ root/data/shop-2020-11-22.dat' overwrite into table ods.ods_trade_shops partition (dt='2020-11-22'); 3.2 zipper list initialization
Suppose you take the first day's data as all the historical data.
INSERT OVERWRITE TABLE dim.dim_trade_shopsSELECT shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, CASE WHEN modifytime IS NOT NULL THEN substr (modifytime, 0,10) ELSE substr (createtime, 0,10) END AS startdate, '9999-12-31' AS enddateFROM ods.ods_trade_shopsWHERE dt = '2020-11-20 updated zipper list
For the increment table, the general logic is that the interception of create_time or modifytime as the current day partition dt,modifytime is greater than or equal to create_time, here take the first two
INSERT OVERWRITE TABLE dim.dim_trade_shopsSELECT shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, CASE WHEN modifytime IS NOT NULL THEN substr (modifytime, 0,10) ELSE substr (createtime, 0,10) END AS startdate, '9999-12-31' AS enddateFROM ods.ods_trade_shopsWHERE dt = '2020-11-21'UNION ALLSELECT b.shopid B.userid, b.areaid, b.shopname, b.shoplevel, b.status, b.createtime, b.modifytime, b.startdate, CASE WHEN a.shopid IS NOT NULL AND b.enddate = '9999-12-31' THEN date_add ('2020-11-21' -1) ELSE b.enddate END AS enddateFROM (SELECT * FROM ods.ods_trade_shops WHERE dt='2020-11-21') aRIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid
The script to load the zipper table is as follows:
Dim_load_shops.sh
#! / bin/bash source / etc/profileif [- n "$1"] then do_date=$1else do_date= `date-d "- 1 day" +% F`fi sql= "INSERT OVERWRITE TABLE dim.dim_trade_shopsSELECT shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, CASE WHEN modifytime IS NOT NULL THEN substr (modifytime, 0,10) ELSE substr (createtime, 0,10) END AS startdate '9999-12-31' AS enddateFROM ods.ods_trade_shopsWHERE dt ='$do_date'UNION ALLSELECT b.shopid, b.userid, b.areaid, b.shopname, b.shoplevel, b.status, b.createtime, b.modifytime, b.startdate CASE WHEN a.shopid IS NOT NULL AND b.enddate = '9999-12-31' THEN date_add ('$do_date',-1) ELSE b.enddate END AS enddateFROM (SELECT * FROM ods.ods_trade_shops WHERE dt='$do_date') aRIGHT JOIN dim.dim_trade_shops b ON a.shopid = b.shopid "hive-e" $sql "
You can execute this script to load 2020-12-22 data, sh dim_load_shops.sh 2020-12-22
3.4 Roll back the zipper table to a certain point in time
First create a temporary table that tmp.shops_tmp uses to hold the rolled-back data
DROP TABLE IF EXISTS tmp.shops_tmp CREATE TABLE IF NOT EXISTS tmp.tmp_shops ASSELECT shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, startdate, enddateFROM dim.dim_trade_shopsWHERE enddate < '2020-11-21'UNION ALLSELECT shopid, userid, areaid, shopname, shoplevel, status, createtime, modifytime, startdate '9999-12-31' AS enddateFROM dim.dim_trade_shopsWHERE startdate = '2020-11-21' INSERT OVERWRITE TABLE dim.dim_trade_shopsSELECT * FROM tmp.tmp_shops
The rollback script is similar to the update script, as long as the sql in it is updated, and it will not be repeated here.
At this point, the study on "how to iterate or roll back the zipper table in the development of big data" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.