What are the classifications of MPP processing architecture 07/09 Update SLTechnology News&Howtos

What are the classifications of MPP processing architecture

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "what are the classifications of MPP processing architecture". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what are the classifications of MPP processing architecture"?

I. MPP architecture

MPP is a server classification method from the perspective of system architecture.

At present, there are three general categories of commercial servers:

SMP (symmetric multiprocessor architecture)

NUMA (inconsistent storage access structure)

MPP (massively parallel processing architecture)

Our protagonist today is MPP, because with the mature application of distributed and parallel technology, MPP engine gradually shows strong high throughput and low latency computing power, and many engines using MPP architecture can achieve 100 million seconds.

Let's take a look at these three structures:

1. SMP

The symmetrical multiprocessor architecture means that multiple CPU of the server work symmetrically without primary, secondary or subordinate relationships. The main feature of SMP server is sharing, and all the resources in the system (such as CPU, memory, Imax O, etc.) are shared. It is precisely because of this feature that leads to the main problem of the SMP server, that is, the scalability is very limited.

2. NUMA

That is, non-consistent storage access structure. This structure is to solve the problem of insufficient expansibility of SMP. Using NUMA technology, dozens of CPU can be combined in one server. The basic feature of NUMA is that it has multiple CPU modules, and nodes can connect and exchange information through interconnection modules, so each CPU can access the memory of the whole system (this is an important difference from the MPP system). But the speed of access is different, because the speed of CPU accessing local memory is much higher than that of other nodes in the system, which is the origin of inconsistent storage access NUMA.

This structure also has some defects, because the delay of accessing remote memory is much longer than that of accessing local memory, so when the number of CPU increases, the system performance can not increase linearly.

3. MPP

That is, massively parallel processing architecture. The system expansion of MPP is different from that of NUMA. MPP is made up of multiple SMP servers to connect and work together through a certain node internetwork to complete the same task. From the user's point of view, it is a server system. Each node only accesses its own resources, so it is a completely Share Nothing structure.

The expansibility of MPP structure is the strongest, and the theory can be expanded infinitely. Because the MPP is connected by multiple SPM servers, the CPU of each node cannot access the memory of the other node, so there is no problem of remote access.

MPP architecture diagram:

MPP architecture

The CPU in each node cannot access the memory of the other node, and the information exchange between the nodes is realized through the node internetwork, a process called data redistribution.

However, the MPP server needs a complex mechanism to schedule and balance the load of each node and parallel processing. At present, some servers based on MPP technology often shield this complexity through system-level software (such as database). For example, Teradata is a relational database software based on MPP technology (this is the earliest database using MPP architecture). When developing applications based on this database, developers are faced with the same database system no matter how many nodes the background server is composed of, without thinking about how to dispatch the load of some of these nodes.

MPP architecture characteristics:

Task execution in parallel

Distributed data storage (localization)

Distributed computing

High concurrency, with more than 300 users per node

Scale out to support the expansion of cluster nodes

Shared Nothing (no sharing at all) architecture.

The difference between NUMA and MPP:

There are many similarities between the two. First of all, NUMA and MPP are composed of multiple nodes; secondly, each node has its own CPU, memory, Imacro, etc.; both can exchange information through the node interconnection mechanism.

What is the difference between them? first of all, the node interconnection mechanism is different. The node interconnection of NUMA is realized within the same physical server, and the node interconnection of MPP is realized through Ihand O outside different SMP servers.

Secondly, the memory access mechanism is different. Within the NUMA server, any CPU can access the memory of the whole system, but the performance of remote memory access is much lower than that of local memory access. Therefore, remote memory access should be avoided as far as possible when developing applications. In the MPP server, each node only accesses local memory, and there is no remote memory access problem.

II. Batch architecture and MPP architecture

What are the similarities and differences between batch architecture (such as MapReduce) and MPP architecture, and what are their respective advantages and disadvantages?

Similarities:

Both batch architecture and MPP architecture are distributed parallel processing, which distributes tasks to multiple servers and nodes in parallel. After computing on each node, the results of each part are gathered together to get the final result.

Differences:

The difference between batch architecture and MPP architecture can be for example: we perform a task, first of all, the task is divided into multiple task execution, for MapReduce, these tasks are randomly assigned to idle Executor; for MPP architecture engine, each task that processes data is bound to the specified Executor that holds the data slice.

It is because of the above differences that the two architectures have their own advantages and disadvantages:

Advantages of batch processing:

For batch architecture, if the execution of an Executor is too slow, then the Executor will slowly be allocated to less task execution. The batch architecture has a speculative execution strategy that speculates that an Executor execution is too slow or has a fault, then the task will be allocated less or not allocated directly, so that the performance of the cluster will not be limited because of a node problem.

Defects in batch processing:

Everything comes at a price, and for batch processing, its advantages also cause its disadvantages, and intermediate results are written to disk, which severely limits the performance of processing data.

Advantages of MPP:

The MPP architecture does not need to write intermediate data to disk, because a single Executor only handles a single task, so you can simply stream the data to the next execution phase. This process, called pipelining, provides a significant performance boost.

Defects in MPP:

For MPP architecture, because task and Executor are bound, if an Executor executes too slowly or fails, the performance of the entire cluster will be limited by the execution speed of the failed node (the so-called bucket short board effect), so the biggest defect of MPP architecture is the short board effect. On the other hand, the more nodes in the cluster, the greater the probability of problems with a node, and once there is a node problem, the performance of the entire cluster will be limited for the MPP architecture, so it is not easy to have too many cluster nodes in the MPP architecture in actual production.

For example, the data of the following two architectures are off the disk: to achieve join operations on two large tables, for batch processing, such as Spark will write to the disk three times (first write: table 1 shuffle; second write according to join key: table 2 shuffle; third write according to join key: Hash table writes to disk), while MPP only needs to write once (Hash table write). This is because MPP runs mapper and reducer at the same time, while MapReduce divides them into dependent tasks (DAG), which are executed asynchronously, so data dependencies must be resolved by writing to intermediate data shared memory.

Batch architecture and MPP schema convergence:

The advantages and disadvantages of the two architectures are obvious, and they are complementary. If we can combine the two, can we give full play to their respective strengths? At present, batch processing and MPP are indeed gradually moving towards integration, and there are already some design solutions. After the technology is mature, it may be popular in the field of big data. Let's wait and see!

Third, the OLAP engine of MPP architecture

There are many OLAP engines using MPP architecture. Here are only a few common engines to compare, which can provide a reference for the company's technology selection.

OLAP engines based on MPP architecture are divided into two categories, one is the engine that does not store data and is only responsible for computing, and the other is the engine that stores data and is also responsible for computing.

1) engine that is only responsible for computing and not responsible for storage

1. Impala

Apache Impala is a query engine based on MPP architecture, which does not store any data, directly uses memory for calculation, takes into account the data warehouse, and has the advantages of real-time, batch processing, multi-concurrency and so on.

SQL-like (Hsql-like) syntax is provided, which can also have high response speed and throughput in multi-user scenarios. It is implemented by Java and C++, the interface and implementation of query interaction provided by Java, and C++ implements the query engine part.

Impala supports shared Hive Metastore, but instead of using slow Hive+MapReduce batches, by using a distributed query engine similar to that in commercial parallel relational databases (consisting of Query Planner, Query Coordinator, and Query Exec Engine), you can query data directly from HDFS or HBase using SELECT, JOIN, and statistical functions, thus greatly reducing latency.

Impala often provides services with the storage engine Kudu, and the biggest advantage of doing so is that it is faster to query and supports Update and Delete of the data.

2. Presto

Presto is a distributed query engine based on MPP architecture, which does not store data itself, but it can access a variety of data sources and support cascading queries across data sources. Presto is an OLAP tool that is good at complex analysis of large amounts of data, but for OLTP scenarios, Presto is not good at it, so don't use Presto as a database.

Presto is a memory computing engine with low latency and high concurrency. Need to obtain data from other data sources for operational analysis, it can connect a variety of data sources, including Hive, RDBMS (Mysql, Oracle, Tidb, etc.), Kafka, MongoDB, Redis and so on.

2) the engine responsible for both computing and storage

1. ClickHouse

ClickHouse is an open source database that has attracted much attention in recent years, which is mainly used in the field of data analysis (OLAP).

It contains its own storage and computing capabilities, completely independent to achieve high availability, and supports complete SQL syntax, including JOIN, etc., which has obvious technical advantages. Compared with the hadoop system, the way of database to do big data processing is easier to use, low learning cost and high flexibility. At present, the community is still developing rapidly, and it is also very hot in the domestic community, and various large factories have followed up on large-scale use.

ClickHouse has done very meticulous work in the computing layer, trying his best to drain the hardware capacity and improve the query speed. It implements many important technologies, such as single machine multi-core parallel, distributed computing, vector execution and SIMD instruction, code generation and so on.

Based on the requirements of OLAP scenario, ClickHouse customizes and develops a new set of efficient column storage engine, and realizes rich functions such as ordered data storage, primary key index, sparse index, data Sharding, data Partitioning, TTL, active and standby replication and so on. Together, the above functions lay a foundation for the analysis performance of ClickHouse extreme speed.

2. Doris

Doris is led by Baidu, a big data analysis engine rewritten according to Google Mesa thesis and Impala project, is a massive distributed KV storage system, its design goal is to support medium-sized and highly available scalable KV storage cluster.

Doris can achieve mass storage, linear scaling, smooth expansion, automatic fault tolerance, failover, high concurrency and low operation and maintenance cost. Deployment scale, it is recommended to deploy 4-100 + servers.

The main architecture of Doris3: DT (Data Transfer) is responsible for data import, DS (Data Seacher) module is responsible for data query, DM (Data Master) module is responsible for cluster metadata management, and data is stored in Armor distributed Key-Value engine. Doris3 relies on ZooKeeper to store metadata, so other modules rely on ZooKeeper to achieve stateless, and then the whole system can achieve a fault-free single point.

3. Druid

Druid is an open source, distributed, column-oriented real-time analysis data storage system.

The key features of Druid are as follows:

Subsecond OLAP query analysis: key technologies such as column storage, inverted index and bitmap index are adopted.

Complete the filtering, aggregation and multi-dimensional analysis of massive data at the subsecond level

Real-time streaming data analysis: Druid provides real-time streaming data analysis and efficient real-time writing

Visualization of real-time data in subseconds

Rich data analysis functions: Druid provides a friendly visual interface

SQL query language

High availability and high scalability:

Druid work node has a single function and does not depend on each other.

Druid cluster is easy in management, fault tolerance, disaster preparedness and capacity expansion.

4. TiDB

TiDB is an open source distributed relational database designed and developed by PingCAP Company. It is an integrated distributed database product that supports both OLTP and OLAP.

TiDB is compatible with important features such as MySQL 5.7protocol and MySQL Ecology. The goal is to provide users with one-stop OLTP, OLAP, HTAP solutions. TiDB is suitable for various application scenarios such as high availability, strong consistency, large data scale, and so on.

5. Greenplum

Greenplum is a very powerful relational distributed database with MPP architecture on the basis of open source PostgreSQL. In order to be compatible with Hadoop ecology, HAWQ is introduced. The analysis engine retains the high-performance engine of Greenplum, and the lower storage uses HDFS instead of local hard disk, which avoids the problem of poor reliability of local hard disk and integrates into Hadoop ecology.

3) comparison of commonly used engines

A diagram summarizes the commonly used OLAP engine comparisons:

Comparison of common OLAP engines

At this point, I believe you have a deeper understanding of "what are the classifications of MPP processing architecture". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.