[big data practical Information] big data platform implementation based on Hadoop-- overall architecture design 04/25 Update SLTechnology News&Howtos

[big data practical Information] big data platform implementation based on Hadoop-- overall architecture design

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Big data's popularity continues to heat up, and big data has become another popular new star after cloud computing. Let's not discuss whether big data is suitable for your company or organization, at least it has been touted as an omnipotent super battleship on the Internet. It seems that overnight we jumped from the Internet era to the big data era! As for what big data is, to be honest, so far, just like cloud computing, I always feel like watching the movie "Cloud Picture"-the feeling in the clouds. Perhaps the companies that are promoting big data's products to you will paint a beautiful utopian picture for you, but you should at least keep a clear head and ask yourself carefully, does our company really need big data?

As a third-party payment company, data is indeed the most important core asset of the company. As the company was established soon, with the rapid development of business, the transaction data increased geometrically, followed by a heavy burden of the system. Business departments, leaders, and even group bosses are clamoring for statements, analysis and competitiveness all day long. The only thing the R & D department can do is to execute an unimaginably complex SQL statement, and then the system goes on strike, memory overflow and downtime. It's a nightmare. OMG!please release me!!!

In fact, the pressure on the data department is unimaginable, and it may take several weeks or more to aggregate all the discrete data into a valuable report. This is obviously incompatible with the concept of rapid response required by the business unit. As the saying goes, if you want to do good work, you must first sharpen its tools. It's time for us to change shotguns.

There are a lot of articles on the Internet describing the benefits of big data, and a large group of people talk about their experiences with big data, but I would like to ask, how many people and how many organizations are really doing big data? What is the actual effect? Does it really bring value to the company? Can the value be quantified? I don't seem to see many comments on these questions. Maybe big data is so new (in fact, the concept at the bottom is not new, it's just a new bottle of old wine), so that people are still immersed in all kinds of wonderful YY.

As a serious technician, after a short period of blind worship, we should quickly enter the research of landing applications, which is the essential difference between architects on clouds and architects on bicycles. What I want to express is actually very simple: don't be confused by new things, don't worship any new things blindly, let alone follow others' words. This is absolutely undesirable for us researchers.

After talking a lot, it's time to get down to business. The senior management of the company decided to formally implement the big data platform within the scope of the group (also specially invited some community experts, looking forward to it.), as a third-party payment company to implement the big data platform is also understandable, so also actively participate in this project. Just before that, the research on the enterprise framework of OSGi has come to an end, so I want to use CSDN to record the implementation process of big data platform. I think it can provide a good reference for other individuals or companies with similar ideas!

The first part is the overall architecture design of big data platform.

Software architecture design

Big data platform architecture design follows the idea of hierarchical design, dividing the services provided by the platform into different module levels according to their functions, and each module level only interacts with the upper or lower module levels (through the interface of the level boundary) to avoid cross-layer interaction. The advantage of this design is that the interior of each functional module is highly cohesive, while the module and module are loosely coupled. This architecture helps to achieve high reliability, high scalability and easy maintainability of the platform. For example, when we need to expand the Hadoop cluster, we only need to add a new Hadoop node server in the infrastructure layer, without any changes to other module layers, and it is completely transparent to users.

According to its functions, the whole big data platform is divided into five module levels, from bottom to top:

Operating environment layer:

The runtime environment layer provides the runtime environment for the infrastructure layer, which consists of two parts, namely, the operating system and the runtime environment.

(1) We recommend installing the above version of REHL5.0 (64-bit) for the operating system. In addition, in order to improve the IO throughput of the disk, avoid installing RAID drivers, but distribute the data directories of the distributed file system on different disk partitions, so as to improve the IO performance of the disk.

(2) the specific requirements of the runtime environment are as follows:

Name version description

Hadoop for JDK1.6 or above requires a Java runtime environment and JDK must be installed.

Gcc/g++3.x or above requires a gcc compiler when running MapReduce tasks using Hadoop Pipes. Optional.

Python2.x or above when using Hadoop Streaming to run MapReduce tasks, python runtime is required, optional.

Infrastructure layer:

The infrastructure layer consists of two parts: Zookeeper cluster and Hadoop cluster. It provides infrastructure services for the underlying platform layer, such as naming services, distributed file systems, MapReduce and so on.

(1) the ZooKeeper cluster is used for naming mapping. As the naming server of the Hadoop cluster, the task scheduling console of the basic platform layer can access the NameNode in the Hadoop cluster through the naming server and has the function of failover.

(2) Hadoop cluster is the core of big data platform and the infrastructure of the basic platform layer. It provides services such as HDFS, MapReduce, JobTracker and TaskTracker. At present, we adopt the double master node mode to avoid the single point of failure of the Hadoop cluster.

Basic platform layer:

The basic platform layer consists of three parts: task scheduling console, HBase and Hive. It provides basic service invocation interface for user gateway layer.

(1) the task scheduling console is the scheduling center of MapReduce tasks, which assigns the order and priority of various tasks. Users submit job tasks through the scheduling console and return the results of their tasks through the Hadoop client of the user gateway layer. The specific implementation steps are as follows:

After receiving the job submitted by the user, the task scheduling console matches its scheduling algorithm

Request ZooKeeper to return the JobTracker node address of the available Hadoop cluster

Submit MapReduce job tasks

Poll whether the job task is completed

If the job finishes sending a message and calling the callback function

Proceed to the next job task.

As a perfect Hadoop cluster implementation, the task scheduling console tries to develop and implement on its own, so that the flexibility and control will be stronger.

(2) HBase is a column database based on Hadoop, which provides users with table-based data access services.

(3) Hive is a query service on Hadoop. Users submit query requests like SQL through the Hive client of the user gateway layer, and view the query results through the UI of the client. This API can provide quasi-real-time data query and statistics services for data departments.

User Gateway layer:

The user gateway layer is used to provide personalized calling interface and user identity authentication for end customers, and it is the only visible big data platform operation entrance for users. End users can interact with big data platform only through the interface provided by the user gateway layer. At present, the gateway layer provides three personalized call interfaces:

(1) the Hadoop client is the entry point for users to submit MapReduce jobs, and the returned processing results can be viewed from its UI interface.

(2) the Hive client is the entry point for users to submit HQL query services, and the query results can be viewed from its UI interface.

(3) Sqoop is the interface between relational database and HBase or Hive. Data from a relational database can be imported into HBase or Hive as required to provide users with the ability to query through HQL. At the same time, HBase or Hive or HDFS can also import the data back to the relational database for further data analysis by other analysis systems.

The user gateway layer can be expanded indefinitely according to the actual needs to meet the needs of different users.

Customer Application layer:

Customer application layer is a variety of terminal applications, including: a variety of relational databases, reports, transaction behavior analysis, billing, settlement and so on.

At present, the applications that I can think of that can be landed on the big data platform are:

1. Behavior analysis: import transaction data from relational database to Hadoop cluster, then write MapReduce job tasks according to data mining algorithm and submit them to JobTracker for distributed computing, and then put the calculation results into Hive. The end user submits the statistical analysis results of the HQL query through the Hive client.

two。 Statement: import transaction data from a relational database to the Hadoop cluster, then write MapReduce job tasks according to business rules and submit them to JobTracker for distributed computing. The end user extracts the statement result file through the Hadoop client (Hadoop itself is also a distributed file system with the usual file access capabilities).

3. Clearing and settlement: the UnionPay file is imported into HDFS, and then the POSP transaction data imported from the relational database is calculated by MapReduce (that is, reconciliation operation), and then the calculation result is connected to another MapReduce job to calculate the rate and distribution (that is, settlement operation). Finally, the calculation result is imported back to the relational database and the merchant transfer is triggered by the user (that is, the transfer operation).

Deploy Architectural Design

Key points description:

1. At present, the entire Hadoop cluster is placed in the UnionPay computer room.

There are 2 Master nodes and 5 Slave nodes in the 2.Hadoop cluster, and the two Master nodes are backups of each other. The failover function can be realized through ZooKeeper. Each Master node shares all Slave nodes, ensuring that the backup of the distributed file system exists in all DataNode nodes. All hosts in the Hadoop cluster must use the same network segment and be placed on the same rack to ensure the IO performance of the cluster.

The 3.ZooKeeper cluster is configured with at least 2 hosts to avoid a single node failure of the naming service. Through ZooKeeper, we can no longer need F5 to do load balancing, and the task scheduling console can directly realize the load balancing access of Hadoop name nodes through ZK.

4. All servers must be configured for keyless SSH access.

5. External or internal users need to go through the gateway to access the Hadoop cluster, and the gateway can provide services only after some identity authentication, so as to ensure the access security of the Hadoop cluster.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.