How to solve the manageability problem of big data distributed system 04/20 Update SLTechnology News&Howtos

How to solve the manageability problem of big data distributed system

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is about how to solve the manageability problem of big data distributed system. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it with the editor.

Today we will continue to learn more about the basic means of solving the manageability of distributed systems.

The basic means to solve the manageability of distributed system

Directory Service (ZooKeeper)

Distributed system is a whole composed of many processes, each member of the whole will have some states, such as its own responsible module, its own load, the mastery of some data, and so on. These data related to other processes become very important during fault recovery, capacity expansion and reduction.

Simple distributed systems can record this data through static configuration files: the connection correspondence between processes, their IP addresses and ports, and so on. However, a distributed system with a high degree of automation inevitably requires that these state data are saved dynamically. Only in this way can the program do disaster recovery and load balancing on its own.

Some programmers write their own DIR service (directory service) to record the running status of processes in the cluster. The processes in the cluster will automatically associate with this DIR service, so that during disaster recovery, expansion, and load balancing, you can automatically adjust the destination of the request according to the data in these DIR services, so as to bypass the failed machine or connect to the new server.

However, if we just use a process to do the job. Then the process becomes the "single point" of the cluster-meaning that if the process fails, the entire cluster may not be able to run. Therefore, the directory service that stores the cluster state also needs to be distributed. Fortunately, we have ZooKeeper, an excellent open source software, which is a distributed directory service area.

ZooKeeper can simply start an odd number of processes to form a small directory service cluster. This cluster will provide all other processes with the ability to read and write their huge "configuration tree". This data will not only be stored in one ZooKeeper process, but will be carried by multiple processes according to a set of very secure algorithms. This makes ZooKeeper an excellent distributed data storage system.

Because the data storage structure of ZooKeeper is a tree system similar to a file directory, we often use its function to bind each process to one of the "branches", and then by checking these "branches" to forward server requests, we can simply solve the problem of request routing (by whom to do it). In addition, you can mark the load status of the process on these "branches", so that load balancing can be done easily.

Directory service is one of the most critical components in distributed systems. And ZooKeeper is a good open source software that happens to be used to accomplish this task.

Message queuing services (ActiveMQ, ZeroMQ, Jgroups)

If we want to communicate between two processes across machines, we almost always use these protocols of TCP/UDP. But directly using the network API to write cross-process communication is a very troublesome thing. In addition to writing a lot of underlying socket code, we also have to deal with a series of problems, such as how to find the process to interact with data, how to ensure the integrity of the packet will not be lost, if the other process dies, or what the process needs to restart, and so on. These problems include a series of requirements, such as disaster recovery, capacity expansion, load balancing and so on.

In order to solve the problem of inter-process communication in distributed systems, people have summed up an effective model, which is the "message queue" model. The message queue model abstracts the interaction between processes into the processing of individual messages, and for these messages, we all have some "queues", that is, pipes, to temporarily store messages. Each process can access one or more queues from which messages are read (consumed) or written (production). Because there is a cached pipeline, we can safely change the state of the process. When the process starts, it will automatically consume the message. The routing of the message itself is also determined by the stored queue, which turns the complex routing problem into the problem of how to manage static queues.

In general, message queuing services provide simple "delivery" and "collect" interfaces, but the management of message queues is more complex, generally speaking, there are two kinds. Part of the message queuing service advocates peer-to-peer queue management: there is a separate message queue between each pair of communication nodes. The advantage of this approach is that messages from different sources can not affect each other, and will not crowd out the message cache space of other queues because there are too many messages in one queue. And the program that processes the message can also define the priority of processing itself-- charge first, process more of a queue, and process less of others.

However, this kind of peer-to-peer message queue will increase a large number of queues as the cluster grows, which is a complex matter for memory footprint and operation and maintenance management. Therefore, the more advanced message queuing service starts to allow different queues to share memory space, while the address information, establishment and deletion of message queues are automated. These automations often rely on the "directory service" mentioned above to register information such as the physical IP and port corresponding to the ID of the queue. For example, many developers use ZooKeeper as the central node of the message queuing service, while software such as Jgropus maintains a cluster state to store the past and present of each node.

Another type of message queue is similar to a public mailbox. A message queuing service is a process in which any consumer can post or receive messages. In this way, it is easier to use the message queue, and the operation and maintenance management is also more convenient. However, in this usage, any message has at least two inter-process communications from sending to processing, and the delay is relatively high. And because there is no scheduled delivery, collection constraints, so it is relatively easy to BUG.

No matter what kind of message queue service is used, interprocess communication is a problem that must be solved in a distributed server-side system, so as a server-side programmer, when writing distributed system code, the most commonly used code is based on message queue-driven code, which directly leads EJB3.0 to add "message-driven Bean" to the specification.

Transaction system

In distributed systems, transaction is one of the most difficult technical problems to solve. Because a process may be distributed across different processes, any process may fail, and this failure problem needs to cause a rollback. Most of this rollback involves multiple other processes. This is a diffusive multi-process communication problem. In order to solve transaction problems on distributed systems, there must be two core tools: one is a stable state storage system, and the other is a convenient and reliable broadcast system.

The state of any step in a transaction must be visible throughout the cluster and be disaster tolerant. This requirement is generally borne by the "directory service" of the cluster. If our directory service is robust enough, then we can synchronize the processing status of each transaction to the directory service. Once again, ZooKeeper can play an important role in this place.

If the transaction is interrupted and needs to be rolled back, the process involves multiple steps that have already been performed. Maybe this rollback only needs to be rolled back at the entrance (where there is data needed to save the rollback), or it may need to be rolled back on each processing node. If it is the latter, then the node in the cluster with an exception is required to broadcast a message such as "rollback! transaction ID is XXXX" to all other relevant nodes. The underlying layer of this broadcast is usually hosted by a message queuing service, while software such as Jgroups provides the broadcast service directly.

Although we are now talking about transactional systems, in fact, the "distributed locking" function often required by distributed systems can be accomplished by this system at the same time. The so-called "distributed lock" is a constraint that allows each node to check first and then execute. If we have an efficient directory service with a single operation, then the lock state is actually a state record of a "single-step transaction", while the rollback operation defaults to "pause the operation and try again later." This "lock" approach is simpler than transaction processing, so it is more reliable, so now more and more developers are willing to use this "lock" service rather than implement a "transaction system".

Automated deployment tool (Docker)

Due to the requirements of the distributed system, it is necessary to change the service capacity at run time (service interruption may be required): capacity expansion or reduction. When some nodes fail in the distributed system, new nodes are needed to resume work. If these are still like the old-fashioned server management methods, by filling out forms, declaring, entering the computer room, installing servers, and deploying software. In this way, the efficiency is definitely not good.

In the distributed system environment, we generally use the "pool" way to manage services. We apply for a batch of machines in advance and then run service software on some machines and others as backups. Obviously, this batch of servers can not only serve a single business, but will provide multiple different business bearers. Those backup servers will become a common backup "pool" for multiple businesses. As business requirements change, some servers may "exit" A services and "join" B services.

This frequent service change relies on highly automated software deployment tools. Our operators should master the deployment tools provided by the developers, rather than thick manuals, to carry out such operations. Some of the more experienced development teams will unify all the underlying business frameworks in the hope that most of the deployment and configuration tools can be managed by a common system. In the open source world, there are similar attempts. Nothing is more well-known than the RPM installation package format. However, the packaging method of RPM is still too complex to meet the deployment requirements of server-side programs. So then there was a programmable general deployment system represented by Chef.

After the emergence of virtual machine technology, PaaS platform provides strong support for automatic deployment: if we write applications according to the specifications of a certain PaaS platform, we can completely leave the program to the platform for deployment, and its load calculation and deployment planning are completed automatically. The leader in this area is Google's AppEngine: we can directly use Eclipse to develop a local Web application, then upload it to AppEngine, and all deployment is completed! AppEngine will automatically expand, scale down and recover from failures based on the number of visits to the Web application.

The really revolutionary tool, however, is the advent of Docker. Although virtual machine and sandbox technology are not new technologies, it is not long to use these technologies as deployment tools. Linux's efficient lightweight container technology provides great convenience in deployment-we can package our applications in a variety of libraries and collaborative software environments, and then deploy them on any Linux system at will.

In order to manage a large number of distributed server-side processes, we really need to do a lot of work to optimize their deployment management. Unifying the running specification of server-side processes is the basic condition to realize automatic deployment management. We can use Docker technology according to the "operating system" as the specification, or adopt some PaaS platform technologies according to the "Web application". Or we can define some more specific specifications and develop a complete distributed computing platform.

Log Service (log4j)

Server-side logging has always been an important and easy-to-ignore problem. In the beginning, many teams only regarded logging as an auxiliary tool for development debugging and excluding BUG. But you will soon find that after the service is up and running, logs are almost the only effective means for server-side systems to understand the program at run time.

Although we have a variety of profile tools, most of them are not suitable for operating services because it will seriously degrade their performance. So we need to analyze according to the log more often. Although the log is essentially line-by-line text information, it is highly valued by developers and operators because of its flexibility.

The log itself is conceptually a very vague thing. You can open any file and write some information. But modern server systems generally make some standardized requirements for logs:

The log must be line-by-line, which is more convenient for future statistical analysis; each line of log text should have some uniform headers, such as date and time is the basic requirement

Log output should be hierarchical, such as fatal/error/warning/info/debug/trace, etc., the program can adjust the level of output at run time, so as to save the consumption of log printing.

The head of the journal generally needs some header information such as the user's ID or IP address, which is used to quickly find, locate and filter a batch of log records, or there are some other fields used to filter and narrow the scope of the log view, which is called dyeing function; log files also need to have a "rollback" function, that is, to maintain a fixed size of multiple files, to avoid long-term operation, the hard disk is full.

Due to the above requirements, the open source community provides a lot of game log component libraries, such as the famous log4j, and the log4X family library with many members, which are widely used and well received tools.

However, compared with the log printing function, the log collection and statistical function is often easy to be ignored. As a programmer of a distributed system, he must hope to collect and statistics the logs of the entire cluster from a centralized node. The statistical results of some logs are even hoped to be obtained repeatedly in a very short period of time to monitor the health of the entire cluster.

To do this, there must be a distributed file system to hold the incoming logs (which are often sent over the UDP protocol). On this file system, we need a statistical system similar to Map Reduce architecture, so that we can quickly count and alarm the massive log information. Some developers will directly use the Hadoop system, while others will use Kafka as the log storage system, and then build their own statistical programs.

Log service is the dashboard and periscope of distributed operation and maintenance. Without a reliable logging service, the health of the entire system may be out of control. So whether you have more or less distributed system nodes, you must spend important energy and special development time to build a system for automatic statistical analysis of logs.

The above is how to solve the manageability problem of big data distributed system. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.