Big data analysis and processing system capacity and system computing capacity 04/13 Update SLTechnology News&Howtos

Big data analysis and processing system capacity and system computing capacity

2025-04-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Text

Compared with transaction processing applications, big data service belongs to analytical processing applications. Because of their different data processing characteristics, there are some differences in capacity estimation methods.

Big data service usually goes through the process of data ETL, data storage, data analysis, data presentation and data opening, so it also has its own characteristics in the estimation of computing capacity, storage capacity and network capacity.

The infrastructure requirements for big data services at different stages are shown in figure 3-2-19:

Figure 3-2-19 Infrastructure requirements at different stages of big data's service

As can be seen from figure 3-2-19, for an ordinary big data project, it usually goes through three steps: data collection (1), data storage and data conversion (2.1meme2.1mem3.1meme3.2maginen3.4), and data presentation (4.1Power4.2). The specific processing process is:

Step 1: collect data from various data sources

Data sources are divided into internal and external data sources. The internal data source is the data of the enterprise itself. For example, the online data of telecom operators are the business usage records obtained from the switch.

The external data source is the data that the enterprise obtains from the outside, for example, the mobile terminal configuration data is obtained from the third-party company database. The way of collecting data is also divided into active and passive.

The active way is to grab the data from the data source actively, for example, the data can be obtained from the major websites through web crawlers; the passive way is that the enterprise sets the storage location for the data source and allows the data provider to store the data to the specified location according to the time policy.

Step 2: data storage and data conversion

Enterprises can adopt different data storage strategies according to different data characteristics. If the data scale is large or the expected data scale is large, the traditional relational database can not meet the requirements of fast processing, so we need to consider using distributed database, such as Hadoop/HBase.

Distributed databases like Hadoop/HBase are characterized by good scalability. If there is not enough storage space, only additional storage servers are needed. The deficiency is that HBase is only suitable for scenarios where the relationship between single table or multi-table is simple, and for applications that need data operation or multi-table association, it still needs to be implemented based on relational database.

The advantage of relational data is that it can integrate and count the data, so that users can view the analysis results from multiple dimensions. Of course, due to the architecture design of relational database based on stand-alone mode, although it can also support cluster deployment, the ability to scale out is limited.

It can be seen that multi-table association query has higher requirements for database management system than key-value mapping, but it has good expansibility without key-value mapping.

Therefore, when storing big data, it is necessary to consider comprehensively the application requirements and the storage characteristics of the database: using distributed data to store data with large scale, large increment and mainly data query, using relational database to complete the query and statistics function that requires multi-table association.

When the original data is stored in the database, it is necessary to extract, transform and load the data to ensure the data quality and application requirements. The data process usually goes through a preliminary ETL, then stores the data in the data warehouse, then ETL the data again, and processes the data into a data Mart oriented to different topics, so as to view the statistical results from multiple dimensions.

Step 3: data presentation phase

Although we have spent a lot of effort to complete the data extraction, transformation, enrichment and other work, but the data is for people to see, the better the data is displayed, the easier it is for users to see the hidden facts and rules behind the data.

For example, in order to see the amount of data traffic in each region, telecom operators can be based on electronic maps, and different data traffic intervals are marked with different colors, so that we can directly see the amount of data traffic in each province.

(1) the capacity estimation method of big data analysis and processing system

Big data analysis and processing system capacity estimation can be divided into two types: theoretical estimation method and experimental estimation method.

The data basis of the theoretical estimation method includes the number of files, the number of records of a single file, the size of a single record, the data collection cycle, and the data collection cycle includes one, one day, one month and so on. In this way, the total amount of data in a certain period of time can be calculated. Then, considering the redundant space coefficient of the disk, the total demand for disk space can be calculated. The theoretical estimation method is suitable for scenarios without sample data.

The calculation formula of the theoretical estimation method is: storage space size = file number single file record number single record size time length redundancy factor.

The experimental estimation method is based on the sample data of a certain period of time. Users can view the file size with commands that come with the operating system. If the data entering the data warehouse is continuous in time, the storage space requirements of big data's analysis and processing system can be calculated by multiplying the measured value of the sample data by the length of time.

The calculation formula of the experimental estimation method is: big data analysis and processing system storage space = sample data size, time length redundancy coefficient.

(2) the estimation method of computing capacity of big data analysis and processing system

The traditional data processing and storage architecture is a cluster mode of "host + disk array". The host can be a minicomputer, PC server or blade server, the disk array can be NAS, SAN, etc., and the protocols adopted can be FC, IP, etc.

The traditional data processing and storage architecture solves the problem of sharing storage resources and computing resources. The cluster composed of multiple servers can manage the computing resources uniformly, and the load balancer that receives the request will send the request to the server with sufficient computing resources according to the server load.

Disk arrays are shared in a way that is easier to understand when multiple disks are placed in one chassis, which can be expanded and hot-swappable disks within the chassis, which makes it easier to expand disk space.

The system architecture of "host + disk array" separates computing from storage, improves the ability of parallel processing by means of computing group and storage group, and meets the system requirements of highly concurrent transaction processing applications. However, this architecture also brings new problems, that is, the horizontal expansion capacity of computing and storage resources is limited.

Big data service is characterized by a large amount of data, especially with the passage of time, the amount of data will continue to increase, requiring computing and storage resources to have almost unlimited expansion capacity.

In order to meet the increasing amount of data, Google has proposed a distributed computing architecture based on MapReduce and GFS. Different from the "mainframe + disk array" architecture, Google uses cheap machines and software to connect a large number of computer devices with different capabilities, which reduces the procurement cost of IT infrastructure and improves the scalability of IT infrastructure. Then, inspired by Google's GFS/MapReduce architecture, Apache proposed the Hadoop distributed computing architecture.

It can be seen that the new distributed computing architecture for big data is completely different from the system architecture of "host + disk array", and big data's method of computing capacity estimation is also different.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.