Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the method of statistical system design in big data's development?

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the method of statistical system design in the development of big data". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn "what is the method of statistical system design in the development of big data"?

1. Background

In big data's production environment, the demand is often the faster the better, for real-time system development, the requirement is often a state value, such as how many times, how many, while for offline data development, because it is not real-time, so you can develop a variety of complex requirements, another based on Lambda architecture or Kappa architecture, often occasions real-time statistical system, real-time statistical system has been discussed in the previous Lambda architecture design This article is another more complex application scenario. For the pipelining generated in the Lambda architecture, it is necessary to satisfy the range query as soon as possible. What does it mean? in the Lambda architecture, although it is possible to query a pipeline of range statistics, in order to be as fast as possible, the offline calculation result is count,count (distinct), which can be divided in more detail and make a preliminary aggregation, such as generating a collection of Set. This can meet the requirement of merging data faster in the query layer, but it also increases the complexity and non-versatility of the architecture. Similarly, the real-time range query in this article is a complex design based on requirements, which solves such a problem. I want to find a range of count, how to ensure the fastest calculation results.

# # the original data table T1 is pipelined as follows transactionId,id1,id2,money,create_time## purpose # # enter id1, a range of create_time, to get count (money) 2. Design

The previous background mainly throws a question, how to design such a system to meet the rapid calculation of the time range of count, the need to reach the production level of real-time query system.

(1) in order to speed up count, there must be not only pipelined count, but only pipelined count, and it is time-consuming to encounter a large range of count, so the design pre-calculates count, that is, all count (money) up to xx, an id.

(2) when querying, you only need to find the leftmost time pipeline of the start time and the right approach pipeline of the end time, and then subtract it.

(3) in order to quickly query time and id, it is necessary to add these two indexes.

(4) if the requirement is a query for the last three months, then looking back, you can think of the design start time node, where it starts at 0, and then begins to accumulate continuously.

Note: (4) depending on the application requirements, the original node can be stuck three months ago, so that the data count at that point in time is 0.

The schematic diagram is as follows:

3. Realization scheme

The assumption is (3), a more general scheme, that is, more data will be stored.

Assuming that the original full data exists in the mysql table, you need to create a new T2 table with the following fields:

Id1,create_time, money_sum

(1) the calculation for the first time is as follows

SELECT id1, create_time, sum () over (PARTITION BY money_sum ORDER BY create_time) AS money_sumWHERE create_time > NOW

(2) incremental calculation

# # query T1 an incremental record select id1,create_time,money from T1 record # query the most recent record of T2 id1 SELECT id1,create_time, monye_sum + MONEY AS money_sumFROM (SELECT id1,create_time,money _ sum Row_number () over (PARTITION BY id1 ORDER BY create_time DESC) r FROM T2 WHERE id1 = xx) WHERE ritual 1 record # insert this record to T2 to get here I believe that everyone has a deeper understanding of "what is the method of statistical system design in big data's development". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report