What is the application method of Hive 07/12 Update SLTechnology News&Howtos

What is the application method of Hive

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what is the application method of Hive". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Hive storage

This issue mainly introduces how to design the corresponding database table structure when using Hive as the application scenario of data warehouse.

Hive data warehouse

To establish a user profile, we first need to establish a data warehouse to store user tag data. Hive is a data warehouse tool based on Hadoop, which relies on HDFS to store data. SQL provides a language that can query the data stored in HDFS. When developing, Hive is generally used as a data warehouse to store tags, user characteristics and other related data.

W.H.Inmon, the Father of data Warehouse, defines a data warehouse as "a topic-oriented, integrated, non-volatile, time-changing data collection to support managers' decisions" in his book "Building the Data Warehouse".

Topic-oriented: the data in the business database is mainly for transaction processing, the business systems are separated from each other, and the data in the data warehouse is organized according to a certain topic.

Integration: the data stored in the data warehouse is extracted from the business database, but it is not a simple copy of the original data, but through extraction, cleaning, transformation (ETL) and other work. The business database records the ledgers processed by each business. These data are not suitable for analysis and processing, need to go through a series of calculations before entering the data warehouse, and discard some data that have nothing to do with analysis and processing.

Non-volatile: business databases generally store only short-term data, so the data is unstable and records the transient changes in the data in the system. Most of the data in the data warehouse represents the data at a certain time in the past, which is mainly used for query and analysis, unlike the database in the business system, which is often modified. After the completion of the construction of the data warehouse, it is mainly used for access without modification and deletion.

Change over time: the data warehouse focuses on historical data and periodically loads new data from the business database and log database in chronological order, with time attributes.

The process of data extraction to the data warehouse is shown in the following figure.

In the process of data warehouse modeling, it mainly involves the modeling and development of fact table and dimension table.

The fact table is mainly designed around the business process. In terms of application scenarios, it mainly includes transaction fact table, periodic snapshot fact table and cumulative snapshot fact table:

Transaction fact table: used to describe business processes, which can be further divided into single transaction fact table and multi-transaction fact table according to the singleness or multi-business process of business process. The single transaction fact table records each business process separately, the following single business is recorded in the order issuing fact table, and the payment business is recorded in the payment fact table. The multi-transaction fact table contains different business processes in the same table, such as placing orders, payment, signing and receiving and other business processes are recorded in one table, and new fields are added to determine which business process belongs to. When different business processes are similar, you can consider putting multi-business processes into multi-transaction fact tables.

Periodic snapshot fact table: measures the state of the business within a determined interval. For example, check a user's payment amount in the past year, the number of purchases in the past year, the number of days logged in in the past 30 days, and so on.

Cumulative snapshot fact table: used to view the time interval between different events, such as analyzing the user's time from purchase to payment, from order placement to order completion, and so on. It is generally applicable to business processes with a clear time period.

The dimension table is mainly used to describe all aspects of the factual attributes, for example, the commodity dimension includes information about the price, discount, brand, original manufacturer, model and so on. In the process of dimension table development, slowly changing dimensions are often encountered. For slow changes, dimensions are generally used: ① rewrites dimension values to overwrite historical data; ② retains multiple records to distinguish them by inserting dimension column fields; ③ development date partition table, daily partition data records the attributes of the dimension of the day; ④ development zipper table for full storage according to time changes. In the portrait system, Hive is mainly used as the data warehouse, and the corresponding dimension tables and fact tables are developed to store tags, people, and relevant data applied to the service layer.

Partition storage

If the user tag is developed into a large wide table with dozens of types of tags under the wide table, then the ETL job of the portrait wide table will take a long time every day, and it is not convenient to add tag types to the wide table.

To solve the problem that ETL takes a long time, you can start from the following aspects:

Partition and store the data and execute jobs separately

Performance tuning of tag script

Develop intermediate tables based on data sources common to some tags.

A solution of user tag sub-table and partition storage is introduced below.

According to the population attribute, behavior attribute, user consumption, risk control, social attribute and other dimensions of the label index system, the corresponding label tables are established to store the corresponding label data. This is shown in the following figure.

Population attribute table: dw.userprofile_attritube_all

Behavior property sheet: dw.userprofile_action_all

User consumption table: dw.userprofile_consume_all

Risk control table: dw.userprofile_riskmanage_all

Social property sheet: dw.userprofile_social_all

For example, create a wide table of demographic attributes for a user:

Similarly, users can also use the table structure in the above case for label data storage for other id dimensions (such as cookieid, deviceid, registerid, etc.).

In the above creation, the relevant user tags are developed by setting up a wide table of population attribute dimensions. in order to improve the efficiency of data insertion and query, partitioned tables can be used in Hive to store data in different directories. When Hive uses select query, it generally scans all the data in the whole table, and it will take a lot of time to scan the data that is not currently being queried. In order to scan some of the data concerned in the table, the concept of partition is introduced when building the table. When querying, you can control the amount of data traversed at a time through Hive's partitioning mechanism.

Label aggregation

In the case mentioned in the previous section, each tag for the user is inserted under the corresponding partition, but for a user, all the tags typed on him are stored under a different partition. In order to facilitate analysis and query, the tags on the user need to be aggregated.

After the tags are aggregated, all the tags on each user are aggregated into a field. The table structure is designed as follows:

CREATE TABLE `dw.userprofile_userlabel_map_ all` (`userid`string COMMENT 'userid', `userlabels`userlabels` tagsmap',) COMMENT' userid user tag aggregation 'PARTITIONED BY (`userdate`string COMMENT' data date')

Develop the udf function "cast_to_json" to aggregate the tags on the user into json strings, and execute the command to aggregate the tags stored in the partition:

Insert overwrite table dw.userprofile_userlabel_map_all partition (data_date= "data_date") select userid, cast_to_json (concat_ws (',', collect_set (concat (labelid,':',labelweight) as userlabels from "user's label table for each dimension" where data_date= "data_date" group by userid

The storage format of the aggregated user label is shown in the figure.

Aggregating the tags on the user is easy to query and calculate. For example, in a portrait product, after entering the user id, you can directly query the table, parse the tag id and the corresponding tag weight, and then display the relevant information of the user at the front end.

ID-MAP

When developing user tags, there is a very important content-ID-MApping, that is, users' identities from different sources are identified as the same subject by data means. Users' attributes and behavior-related data are scattered in different data sources. Through ID-MApping, users' behaviors in different scenarios can be linked together and the data isolated island can be eliminated. The following figure shows the many-to-many relationship between the user and the device.

The following figure shows the behavior of the same user on different platforms.

For example, when a user accesses and searches relevant content in an App station without logging in to App, the user records the behavior data related to the device id (that is, cookieid). After logging in to App, users access, collect, place orders and other related behaviors record the behavior data related to the account id (that is, userid). Although it is the same user, there is no connection between the behavior data recorded when logging in and not logging in to the device. By opening the corresponding relationship between userid and cookieid through ID-MApping, the behavior trajectory of users can be captured when they log in and do not log in to the device.

The following is an example of how to complete the data cleaning of Hive through the ETL work of ID-Mapping.

Slowly changing dimension is a common way in dimension table design. Dimensions are not constant and will change slowly over time. For example, the user's mobile phone number, mailbox and other information may change with the change of the user's state, and the price of the commodity will also adjust the price on the shelf over time. Therefore, slowly changing dimensions will be considered when designing dimension tables for users, goods, and so on. Similarly, when designing an ID-Mapping table, because a user can log in on multiple devices and a device can be logged in by multiple users, consider using a slowly changing dimension table to record this state change at different points in time (figure 3-9).

Zipper list is a design method for slowly changing dimension table, which records all the state change information of a thing from its beginning to its current state.

In the figure above, each time the userid is associated to a different cookieid is recorded through a zipper table. For example, users with a userid of 44463729 log in to one device on 20190101 and log in to another device on the 6th. Where start_date represents the start date of the record, end_date represents the end date of the record, and when end_date is 99991231, the record is still valid.

First of all, you need to obtain the access records of both cookieid and userid from the buried table and the access log table. In the following example, ods.page_event_log is the buried log table, and ods.page_view_log is the access log table, which inserts the obtained userid and cookieid information into the cookieid-userid relational table (ods.cookie_user_signin). The code executes as follows:

INSERT OVERWRITE TABLE ods.cookie_user_signin PARTITION (data_date ='${data_date}') SELECT t.* FROM (SELECT userid,cookieid,from_unixtime (eventtime,'yyyyMMdd') as signdate FROM ods.page_event_log-- buried table WHERE data_date ='${data_date} 'UNION ALL SELECT userid,cookieid,from_unixtime (viewtime) 'yyyyMMdd') as signdate FROM ods.page_view_log-- access log table WHERE data_date =' ${data_date}') t

Create a zipper table for ID-Map, compare the data added to the ods.cookie_user_ signin table every day with the historical data of the zipper table, and update if there are changes or new data.

CREATE TABLE `end_ zippertable` (`userid`string COMMENT 'account ID', `cookieid`string COMMENT' device ID', `start_ date`string COMMENT 'start_date', `end_ date`string COMMENT' end_date') COMMENT 'id-map zipper table' ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t'

After the creation is completed, the ETL schedule updates the data to the ID-Mapping zipper table every day, and the task is executed as follows.

INSERT OVERWRITE TABLE dw.cookie_user_zippertableSELECT t.* FROM (SELECT t1.user_num, t1.mobile, t1.reg_date, t1.start_date CASE WHEN t1.end_date = '99991231' AND t2.userid IS NOT NULL THEN'${data_date} 'ELSE t1.end_date END AS end_date FROM dw.cookie_user_zippertable t1LEFT JOIN (SELECT * FROM ods.cookie_user_signinWHERE data_date='$ {data_date}') T2 ON t1.userid = t2.useridUNION SELECT userid, cookieid,'${data_date}'AS start_date '99991231'AS end_dateFROM ods.cookie_user_signin WHERE data_date ='${data_date}') t

The data is written to the table, as shown in the figure above.

For this zipper table, you can view snapshot data for a certain day, such as 20190801.

Select * from dw.cookie_user_zippertable where start_date='20190801'

For example, there is currently a table that records the relationship between userid and cookieid, but it is many-to-many records (that is, one userid corresponds to multiple cookieid records and one cookieid corresponds to multiple userid records). Here, you can view the cookieid corresponding to userid at a certain point in time through the date of the zipper table. View the device id that a user (such as 32101029) is associated with on a certain day (such as 20190801)

Select cookieid from dw.cookie_user_zippertable where userid='32101029' and start_date='20190801'

The above figure shows that the user '32101029' has logged on to three devices in the history, and the user's login device at a specific time can be found by a limited period of time.

In development, you need to pay attention to the many-to-many association between userid and cookieid. If the association is made without conditional restrictions, it is likely to cause data ballooning:

In practical application, we will encounter many situations that need to associate userid and cookieid. For example, you need to develop tags in the userid dimension for the user's purchase times, purchase amount, login duration, login days and so on in the past 30 days. The first two tags can be easily processed from the corresponding business data table according to the algorithm, while the login time and login days data are stored in the relevant log data, and the log data table records the many-to-many relationship between userid and cookieid. Therefore, when developing labels according to business requirements, it is necessary to determine the definition of label caliber.

This is the end of the content of "what is the Application method of Hive". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.