In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
one。 Background analysis
Traditional relational database has occupied a solid dominant position in the enterprise market for a long time, and many people have not realized that there are other types of databases besides traditional relational database. Traditional relational databases are very good at transactional operations such as update operations. However, it is a bit stretched when dealing with bulk operations with a large amount of data. For example, DB2, as a relational database management system developed by IBM, is widely used in large-scale data warehouse projects, especially in the mobile industry. Since the establishment of the business analysis system, basically using DB2 database to build BI main warehouse to focus on data analysis, supporting internal management decision-making, marketing promotion and customer service and other work.
As big data brings a new business model and business growth, it not only puts forward higher challenges to the IT support work, but also puts forward new requirements for the existing technical architecture, including data warehouse support capacity. Today, with the rapid change of business requirements, the traditional database software seems to be stretched, and there are many shortcomings, including:
1. Performance is difficult to meet business needs
With the development of business, the amount of data increases geometrically after time accumulation, and the complexity of business puts forward high requirements for database performance. However, the traditional relational database has great pressure on IO,CPU and other resources on the support of complex SQL because of row storage and centralized operation, and it is difficult to meet the business requirements.
two。 Lack of scalability
Traditional database expansion can only be Scale-up (vertical expansion) to increase processors, high-end storage capacity and other resources to upgrade to meet the requirements of application performance, but larger and stronger servers are also expensive. After expansion, data balance operation or even database shutdown operation is required, which will directly affect production.
3. Poor high availability of products
There is a copy of the metadata of traditional databases such as DB2 configuration files and logical nodes, but the system defaults to the same directory and cannot be changed and can no longer be backed up, so its security is very low.
4. High investment cost
DB2 hosts need IBM minicomputers, ORCLE requires Exadata, and all storage requires devices that cost millions of dollars, such as EMC. Lisence for these software is also very expensive. And in terms of personnel expenditure, the cost of these database administrators is not low.
two。 Scheme selection
Considering that the distributed database supports the deployment mode based on X86, and it has become a trend for X86 to replace the traditional computer, it has the advantages of low cost and good expansibility, and the expansion mode of traditional database relying on computer upgrade has come to an end. It is decided to introduce MPP (Massively Parallel Processor) distributed database to replace the existing DB2 database. MPP in the database non-shared cluster, each node has its own disk storage system and memory system. The business data is divided into each node according to the database model and application characteristics. Each data node is connected to each other through a private network or commercial general network to cooperate with each other to provide database services as a whole. Non-shared database cluster has the advantages of complete scalability, high availability, high performance, excellent performance-to-price ratio, resource sharing and so on. Use MPP to synchronize to the data analysis bazaar after the completion of data modeling in the core data warehouse, and the data analysis bazaar as the big data Mart carries all the applications developed based on the database standard SQL.
The comparison of products is as follows:
1. Contrast IBM DB2
Vertica
DB2
Hardware architecture
True MPP architecture without sharing
No special nodes
Have master node
Have management module
Software architecture
Pure column database
Traditional row database
Compress
More than 12 compression algorithms, including Lempei-Ziv
You can specify a different compression algorithm for each column of the table
Compression ratio above 10x
Standard Lempel-Ziv
Table-level compression algorithm
Compression ratio of 1-3x
High availability
High availability through embedded k-safty
All nodes are available
No HA configuration required
Easy and simple configuration
Hot Standby hot backup mode
There must be at least one free Server
It will take 1-5 minutes to take over in the event of a failure
Complex configuration requires additional HA software
Daily management
No concepts of tablespace, index, MDC, etc.
Automatic database design
3 different levels of database parameters (db2set,dbm,db)
Need to manage tablespaces, indexes, partitions, MDC, etc.
Performance
It also supports real-time loading and real-time query.
50-1000 times performance improvement over traditional databases
It is impossible to realize real-time loading and real-time query at the same time.
Expandability
Add and delete nodes online
Implement database node expansion at the level of minutes and hours
Adding a node requires a restart
It usually takes hours or even days.
two。 Contrast Oracle Exadata
Hardware configuration comparison
Exadata 1/2 Rack
Vertica
Description
Server configuration
A total of 11, including 4 databases and 7 storage servers
11 DL380p Gen8
Keep the same number of Server as Exadata
CPU kernel number
Database node: 64 cores (2.9GHz focus E5-2690)
Storage node: 84 cores (2.0GHz focus E5-2630L)
176Core, 2.6GHz, E5-2670
20% more CPU cores
CPU processing capacity
SpectInt2006_rates
5559
7084
Overall CPU processing capacity increased by 28%
Memory
1024GB
1408GB
40% increase in memory
Hard disk model
600GB SAS disk 15000rpm (3.5)
900GB SAS disk 10000rpm (2.5)
The performance is basically the same.
Number of hard drives
eighty-four
two hundred and seventy five
3.3-fold increase in the number of hard drives
Available capacity
22.5TB
900GB*22*11=218TB
218TB * 70% (Raid5 loss) * 50% (K-safe=1) = 76TB
3.4 times increase in available capacity
Data loading speed
8TB/hour (theoretical maximum)
200MB/s per node, about 8TB/hour (measured average)
Basically the same.
Comparison of software features
Exadata
Vertica
Description
Data storage mode
Row storage + mixed column compression
Pure column data storage
Exadata can be compressed by columns only if the data entered in direct load mode
Compression method
6 compression algorithms (2 row compression, 4 column compression)
12 compression methods
Exadata can only specify compression at the table level; Vertica can specify different compression algorithms for each column of the table.
Loading and real-time query at the same time
Disable index is usually required for direct load loading, so real-time queries cannot be performed at the same time
Can be carried out at the same time
Vertica supports highly concurrent queries while data is loaded
Deploy Architectur
Shared everything
Architecture
Shared nothing MPP
Architecture
Shared everything architecture cannot extend too many nodes, while shared nothing's MPP architecture is more scalable and more suitable for parallel processing of large amounts of data.
Database management
Complex management requires very experienced DBA and dedicated OEM tools
Simple, automatic, without too much human intervention
Analysis function
A few simple analysis functions
Embedded multiple analysis functions and flexible analysis query
Hadoop interface
Not supported
Support
Vertica has an embedded interface with Hadoop to support both structured and unstructured analysis.
Cost comparison
Exadata 1/2 Rack
Vertica
Description
Hardware price
(transaction price / RMB)
800-10 million
2 million
DL380gen8 is quoted at 100000 per unit, plus some peripherals and services
Software price (transaction price)
8.64 million
3 million
The price of Vertica is estimated; Oracle is usually profitable.
3-year service charge
(800 / 8% / 864 / 22%) * 2 = 5.08 million
(2008.8% + 3000.21%) * 2 = 1.58 million
Hardware press 8%, software press 22% (Oracle) and 21% (Vertica)
Total investment for 3 years
21.72 million
6.58 million
Vertica scheme accounts for 30% of Exadata investment.
three。 Construction plan 1. Environmental deployment
In the aspect of disaster recovery, the main and standby database clusters in different places in the same city can be synchronized incrementally, so as to realize remote data disaster recovery. At the same time, in the aspect of backup, the product provides a special VBR backup and recovery tool, which can easily achieve full, incremental table granularity or library granularity of various forms of backup for the data in the entire cluster. Support data parallel backup and parallel recovery, backup files can be distributed in different backup storage areas, configure a different number of backup and recovery nodes.
The general design of the platform architecture is as follows:
1.1. Functional architecture
The construction of MPP resource pool cluster is mainly divided into two independent MPP database construction, namely the core data warehouse and the data Mart.
1.2. Physical deployment
The cluster can deploy more than 3 nodes, and the K-safy value can be set to n, that is, there are n redundant data for each data, and each redundant data is deployed in different racks.
In order to facilitate the future expansion of this project, we recommend a hierarchical network, with two 10 Gigabit access switches per cabinet, which are then cascaded to the core switches.
Switch:
Configure at least 2 core switches to form a high availability cluster.
Each cabinet is equipped with two 10 Gigabit switches and forms a high availability cluster, and each cabinet top switch is connected to two core switches through two 40G uplink ports.
Set VLAN for the internal cluster communication network of the MPP database to facilitate the isolation of external service networks.
The connection mode of each server: two 10GE network cards are respectively connected with the 10G optical ports of the two switches at the top of the cabinet through optical fiber, bind into a logical network card in the server section, set two sets of network IP addresses inside and outside, and set VLAN for the internal cluster communication network to facilitate external service network isolation, form a reliable internal communication network and avoid network failures.
For each computing node operating system disk, two disks are used as raid1 to achieve hot backup, or every seven disks are used as a group of Raid5,3 Raid5 and then Raid0 to synthesize a data disk to improve storage, read and write performance and take into account data security.
The 10-gigabit service network, the gigabit management network and the 20-gigabit binding network are adopted to solve the possible network bottleneck problem in batch data processing.
two。 Application migration
In order to reduce the interaction between economy-saving and economy-saving, once they are decoupled, all the existing DB2 applications such as warehouse model, regionalization application, mobile phone economy and so on are transferred to the MPP platform.
This migration and transformation adopts the strategy of gradual implementation according to the application, and follows the following steps:
Transfer the execution program code on DB2 to MPP directly, adjust the corresponding syntax according to the error, and through the batch adjustment mode, ensure that all programs can be executed normally.
After importing the data source, run the program, compare the result data on MPP with DB2, find the problem and solve the problem after finding the difference data, and ensure that the data on both sides are consistent.
Optimize the program (such as small table replication table, large table partition key is reasonable) and ensure that the time to generate data is not later than the time of DB2 generation.
After the stable operation of the program for half a month, switch the external interface data source, and finally achieve a smooth migration.
System optimization
With the gradual migration of provincial business applications to Vertica, the amount of business data has increased faster than expected, and there may be performance degradation and insufficient capacity, which are generally divided into two major categories of problems.
(1) loading problems, including: loader network instability, low loading efficiency, poor stability, loading timeout, loading data quality problems, and inconsistent loading file fields and table structure, etc.
(2) performance problems, including: performance degradation of the database after running for a period of time, slow execution efficiency of query insertion statements, exhaustion of system resources by junk SQL, unreasonable construction of Projection, a large number of invalid views and expired tables are not cleaned up in time.
The above problems include both platform level and business level (SQL quality and development specifications). Therefore, the system is specially optimized from the technical level and page level.
3.1. Technical level optimization
Network optimization: upgrade the network card driver, modify the network card mode, adjust the switch hash algorithm to complete the load balance of data traffic, adjust the TCP-related kernel parameters, and increase the size of the kernel socket receiving buffer area. After the improvement, the traffic of the network card can run steadily at 20,000 megabytes, the efficiency and stability of the network card are significantly improved, and the phenomena of packet loss and retransmission basically disappear.
Database optimization: optimize the control file parameters, load parallelism, increase the read-ahead cache, optimize the number of CPU threads, increase the communication delay parameters between the load server and the client, optimize the load timeout parameters, and reduce the load failure rate from about 30% to less than 0.5%. By optimizing the Projection configuration parameters, the execution time of the same SQL can be greatly reduced, such as the large table association operation between the customer table and the account table. Compared with the traditional database, the overall performance of Vertica is improved by 60% compared with the traditional database, and the database memory parameters are optimized to improve the system stability.
3.2. Business layer optimization
Because the platform application developers are not familiar with the new products, some of the SQL quality is poor (such as direct partition reading data, unreasonable table hash key selection, unreasonable replication or distribution table attribute selection, etc.), poor execution performance (redundant subquery, Cartesian product, unreasonable correlation fields), these problems are adjusted and optimized respectively.
At the same time, we also conducted product training and guidance, issued and continuously updated multi-version MPP database development specifications to help application developers make better use of the MPP platform. In this way, at least 99% of the SQL is running normally, and individual SQL time-consuming and newly developed SQL need to continue to be targeted optimized.
four。 Domestic mobile usage 1. Domestic mobile usage distribution
Data Analysis system widely used in domestic Telecom Industry
2. The application scope of Vertica in the mobile industry at present
BSS
OSS
MSS
Other
Business domain
▲ Business Analysis (BASS)
▲ detailed list (CDR) analysis
Analysis of ▲ flow Management
Precision Marketing Analysis of ▲
▲ History Library and report
▲ partner and Electronic Channel Analysis
User Classification and income Analysis of ▲
……
▲ signaling analysis
▲ Network error and performance Optimization
▲ network analysis
Comprehensive Analysis of ▲ Network
Comprehensive Monitoring and Analysis of ▲
▲ network capacity management
▲ customer experience Management
▲ roaming analysis
……
▲ ERP data warehouse
▲ financial statement system
Centralized data Analysis in ▲
……
▲ value-added service analysis ▲ platform
▲ big data analysis platform
▲ Internet Analysis
Analysis of ▲ user behavior
▲ user location Analysis
▲ clickstream analysis
……
Challenges
▲ has a large amount of data
▲ analysis and processing ability is required.
▲ has a huge amount of data
▲ requires high real-time performance.
High performance requirements for ▲ analysis
▲ has a huge amount of data
▲ analysis and processing ability is required.
▲ analysis algorithms are rich in requirements.
Advantages of HP
▲ ultra-fast analysis speed and unlimited scalability; ▲ SQL standard and openness
▲ real-time analysis and aggregation, unlimited scalability
▲ SQL standards and openness
▲ ultra-fast analysis speed; ▲ SQL standard and openness
The openness of ▲ Vertica and its integration with Hadoop/R
▲ ultra-fast analysis speed and unlimited scalability; ▲ unstructured analysis
3. Project case 3.1. Signaling analysis
Quasi-real-time analysis, realized by big data
60 nodes
HP BL460c G7 Blade (VC Flex10) D2200sb Storage (12*1TB7200RPM SATA)
Real-time data storage exceeds 200MB/s, including GB,IUPS,GN,DPI, etc.
The storage period of the original table is at the level of minutes
The storage period of summary table is 15 minutes.
Keep it for 1 month, the total amount exceeds 500TB.
Summary of multiple granularities: 5 minutes, hours, days
3.2. Network optimization
Quasi-real-time analysis to improve the quality of network service
Processing cloud: Hadoop cluster completes data analysis and summary
Storage cloud: HP Vertica cluster completes big data storage
6 node clusters with more than 30T of data
The main form table exceeds 20TB, and the minute-level quasi-real-time storage
3.3. Business analysis
Big data's high performance analysis supports fine marketing and differential competitiveness.
32 nodes, data volume 360TB
Greatly improved performanc
-- data compression ratio as high as 9 times
-- 10 billion-level data analysis query, second-level response, performance improvement of about 50 times
-- can support hundreds of users' self-help analysis.
Data self-service: business user self-service analysis
Marketing big data analysis:
Big data's Analysis of various Marketing modes and Promotion methods
-- Association analysis of customer attribute data, including customer profile query, tag customer base refresh, tag analysis data refresh
Migrating the main database to another 32-node cluster
3.4. Mobile big data analysis platform
Integrate BSS/OSS data, 280TB / 40 nodes
O-domain http/socket data, first enter the hadoop cluster for cleaning
Password A signaling data, 1-5TB/ days, first enter the hadoop cluster for cleaning
B domain divides data, 300GB/ days, and extracts directly from DB2.
Comprehensive big data
Analysis of user behavior and user profile
Self-Service
Marketing analysis
Data retention cycle
Detailed data: 7-15 days
Summary data: 6 months-1 year
High summary: 1-3 years
Manufacturer
Basic platform: Huawei
ETL/BI: Huawei created me
Precision Marketing: Asian Union
Mobile network-wide intelligent monitoring platform
An intelligent analysis platform for in-depth analysis of network-level monitoring data, providing data analysis services for group, intra-company and provincial companies
Requirements: ability to low-cost expansion of the stored data scale of rapid growth; ability to handle more unstructured data, such as web logs and Internet of things application logs; ability to carry out deep data mining under large-scale data, beyond the traditional random sampling analysis method, and replaced by full-data level analysis capability.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.