What is the database SQL tens of millions of data scale processing? 04/21 Update SLTechnology News&Howtos

What is the database SQL tens of millions of data scale processing?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "database SQL 10 million-level data scale processing", the content of the explanation is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in-depth, together to study and learn "database SQL 10 million-level data scale processing what" it!

1. Too much data. It certainly won't work on a watch.

Like the lunar periodic table. 10 million a month, 120 million a year, it certainly won't work if it accumulates like this. So it's all based on one cycle of data, one table. Even a cycle of data is divided into several sub-tables. It mainly depends on the actual amount of data. When you create a new table, the table may need to be indexed, but you have to unindex it first, or create the table first, import the data, and then build the index.

After processing and counting if necessary, back up to tape or other media. And get rid of it.

From the point of view of the problem domain, the data in a cycle is the most relevant. For example, statistics of a customer's total bill for a certain account period, an increase over the previous month, and zero phone charges for customers, and so on. All this, the reference data is no more than the current cycle, or two cycles, or even more is a quarter, or half a year (such as three consecutive months of zero phone bills, or three months of unpaid bills, etc., the amount of storage and other statements may take a year of data). And such a situation is more common in data mining or senior management reports, and it is impossible to contain such statistics in the interface used by general business departments.

Therefore, the data can be separated by table, or even by database, which is more convenient to manage.

We have to get rid of the inherent idea that these data, like sanitation workers' garbage disposal, are almost a multi-step approach with manual disposal, that is, they will not be used as routine data (such as customer profiles, etc.) for a long time and frequently. So we can change our way of thinking, that is, try our best to deal with it when we need it, and clean it up when we don't need it. In other words, for example, a sub-table, you can divide it into 100 tables or 1000 tables. As long as it is convenient to count and get the required data.

View just says that you can make it easier to write select statements without any improvement in speed.

The main thing is that the way your sub-table can be built to reduce access to all data, you can improve speed. For example, if you do some statistics, those data happen to be in a sub-table. For example, you have 10 divisions, and when you count the id=1 segment, you happen to put the data in the first sub-table, so you can access only the first sub-table in memory, thus improving the statistical speed. If your statistics need to count all the data in the sub-table, the processing speed is still just as slow.

two。 Assuming that there are hundreds of thousands of data in each table, there are no bottlenecks in statistics. There should be no problem with the regular database.

3. The necessity of pretreatment.

Some people ask: I collect 10 million pieces of data, how long it will take, and whether it can be improved. Imagine how long it will take you to add up all the savings of the Chinese. Look at the scale of this problem, in fact, no matter how complex the database dbms, we say that he can not escape: find out the qualified data, sum up the calculation process one by one. Forget about the where conditions for the time being. The necessity of preprocessing is that data processing on this scale is itself a very time-consuming process, and it is necessary for us to process the results into one table or more tables in advance. When the user queries, it will be displayed again. For example, 10 million data is divided into 10 segments, depending on the growth of receivables in each segment, then we can pre-statistics the data into the segment expense table, then the client report is displayed very quickly. If any data summary has to be counted from the original data, it is not realistic. So we can set up the original data table, the intermediate result table, the result table, the summary table, the monthly closing table, the interval table and so on. Statistical attribution step by step.

Another thing to mention is that such an action must be very time-consuming, and! If such data is executed periodically by the server's stored procedures, it will only be processed once, and any client will only generate reports from the result table. If you do not use this method, any client report will be generated from the original data, in theory, it is possible, but such tens of millions of data summary processing will be done N times. And time is not allowed.

In addition, such statistical processes are best stored separately in db, while common data, such as customer profiles, are best copied to the new db for processing. In this way, it does not interfere with normal use.

You can run this process at night, either on another db or on another server. After processing, write a flag to tell the main db, then the client can count these reports.

4. Make a calculated field for a single row of data. For example, for example, the generation time of a record is 2009-01-01 12. If your statistics happen to need to be counted for a certain period of time, it is best to add fields, such as the hour field, and go down to the next batch command to get the number of hours, and then count.

5. It is taboo to do functions to column in select statements. Because the function will cause the query condition not to walk the index, but to traverse all the data instead. In this way, even if you look up one piece of data, you will traverse all the data. Isn't that pitiful?

6. As far as possible, all the conditions are numeric, that is, all use id, such as branch, town, type of business, access type, customer address, etc., all need to be encoded in fk mode. Only numeric id is used in the main table. Please remember that it is numeric id. Integer numbers are the fastest data types to calculate. If the amount is extremely large, you can use decimal (decimal = 0). The varchar type is inefficient, but there seems to be sql's md5 algorithm, and I think I can try this method (I haven't tried it yet).

7. Index, which is the first problem to be solved in massive data query.

Without an index, it is traversing. If the index is not covered, it will be traversed.

8. Complex statistics, using memory to do step-by-step processing, and then get the results, compared with a select statement to achieve much easier and more clear.

And the time spent on the watch is much shorter. Of course, very complex statistics may require conditional judgment, loops, etc., which cannot be handled by a select statement. Clauses in multi-tier where are also inefficient and easy to occupy table writing.

In principle, the issues I am discussing here are not the kind of small case based on website content management, but mainly for enterprise applications. For example, to check a "inventory customer growth table", the problem is not so simple as to directly compare the total telephone charges for two months, but also to find out how much he paid before, for example, how much more is included in the statistics. So, my understanding: complex problems must be stored procedures. It took a few real projects to understand that writing sql statements would be more common than programming code. The real program is actually sql.

Finally, if you are experienced enough, the execution time of the written statistical process is normal in minutes or even hours. So beginners should understand that the amount of data is proportional to the processing time. If it feels fast to deal with a few pieces of data, and the amount of data suddenly increases by several orders of magnitude, don't think that the time can be optimized to a few seconds.

The MRP in ERP can be calculated in a few hours. It's all normal. (mainly due to too many materials, too much bom and too many calculation steps)

9. Add one more point. If the amount of data exceeds the tens of millions or even billions of levels of our title. There is no problem with that, but it is still the idea of divide and conquer, which is to run data in parallel on multiple servers. Just like donating money to the disaster area, it is impossible to rely on the strength of one person. There are many people and great strength. Things like data sorting only need raw data and basic information, as well as some billing strategies and so on. It is also necessary that it can be distributed on multiple server at the same time. It is mainly determined by the amount of data you have and the processing speed of a single machine, as well as the total processing time you require. Some people say that do select statements also need to be distributed? It can only be said that it can be done if it is really necessary. For example, if you want to return all the abnormal data of the phone bill, you can also retrieve it from each station and then merge it together. I think it's OK.

All in all:

one. Reasonably design the table structure to make the statistical summary most efficient (including fk design and using digital id, no varchar, index design, calculated fields)

two. Divide the table reasonably to make the data scale of single table appropriate.

three. The memory is used to process in a plurality of steps.

four. Data pre-processing.

five. Distributed on multiple server and processed at the same time.

That is, divide and conquer and pretreatment.

Thank you for your reading, the above is the content of "database SQL 10 million-level data scale processing". After the study of this article, I believe you have a deeper understanding of what the database SQL 10 million-level data scale processing has. The specific use of the situation also needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.