In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/02 Report--
1. Introduction of zookeeper concept.
Before introducing ZooKeeper, let's introduce the distributed coordination technology. The so-called distributed coordination technology is mainly used to solve the synchronous control among multiple processes in the distributed environment, so that they can access some shared resources in an orderly manner and prevent the consequences of resource competition (brain cleavage).
First of all, what is a distributed system? the so-called distributed system is an application system composed of multiple servers distributed in different regions to provide services for users. in the distributed system, the most important thing is process scheduling. Here, it is assumed that there is an application system composed of servers distributed in three regions, and a resource is mounted on the first machine. Then all three geographically distributed application processes compete for this resource, but we do not want multiple processes to access it at the same time, so we need a coordinator to allow them to access this resource in an orderly manner. This coordinator is the "lock" that is often mentioned in distributed systems. for example, when "process 1" uses the resource, it will first acquire the lock, and "process 1" will remain exclusive to the resource after it is acquired. at this point, other processes cannot access the resource, "process 1"; after using the resource, the lock will be released so that other processes can acquire the lock. Thus, through this "lock" mechanism, we can ensure that multiple processes in the distributed system can access the shared resources in an orderly manner. Here, this "lock" in this distributed environment is called distributed lock. This distributed lock is the core content of distributed coordination technology.
At present, Chubby of Google and ZooKeeper of Apache are good at distributed coordination technology, which are implementers of distributed locks. The lock service provided by ZooKeeper has been tested for a long time in the distributed field, and its reliability and availability have been verified in theory and practice.
ZooKeeper is a highly available, high-performance open source coordination service designed for distributed applications. It provides a basic service: distributed locking service. At the same time, it also provides data maintenance and management mechanisms, such as unified naming service, state synchronization service, cluster management, distributed message queue, distributed application configuration item management and so on.
2. Examples of zookeeper applications
1) what is the problem of single point of failure?
The so-called single point of failure is that in a master-slave distributed system, the master node is responsible for task scheduling and distribution, and the slave node is responsible for task processing, and when the master node fails, the whole application system is paralyzed. Then this fault is called a single point of failure. Then our solution is to solve the problem of single point of failure of distributed system by selecting the role of cluster master.
2) how to solve the single point of failure in the traditional way? And what are the shortcomings?
The traditional way is to use a standby node, which regularly sends ping packets to the primary node. After receiving the ping packet, the primary node sends a reply Ack message to the standby node. When the standby node receives the reply, it will think that the current primary node is running normally and let it continue to provide services. When the primary node fails, the standby node can not receive the reply message. At this time, the standby node thinks that the primary node is down, and then replaces it as the new primary node to continue to provide services.
Although this traditional method of solving a single point of failure has solved the problem to some extent, there is a hidden danger, that is, the network problem. There may be such a situation: the primary node does not fail, but the network fails when replying to the ack response, so the standby node cannot receive the reply, so it will think that the primary node has failed, and then The standby node will take over the services of the primary node and become the new primary node. At this time, there are two primary nodes (double Master nodes) in the distributed system. The emergence of double Master nodes will lead to service confusion in the distributed system. In this way, the entire distributed system will become unavailable. In order to prevent this from happening, it is necessary to introduce ZooKeeper to solve this problem.
3) how does zookeeper work?
The following are explained in three situations:
(1) start master
After introducing Zookeeper into the distributed system, multiple master nodes can be configured. Here, take the configuration of two master nodes as an example, assuming that they are master node An and master node B. when both master nodes are started, they will register node information to ZooKeeper. We assume that the node information registered by the master node A lock is master00001, and the node information registered by the master node B is master00002. After registration, there will be an election, and there are many algorithms for the election. Here, the node with the lowest number is used as the election algorithm, then the node with the lowest number will win the election and get the lock to become the master node, that is, the master node A will be locked as the master node. Then the primary node B will be blocked as a backup node. In this way, Zookeeper completes the scheduling of two Master processes. The allocation and cooperation of primary and standby nodes are completed.
(2) master failure
If the master node A fails, the node information registered with the ZooKeeper will be automatically deleted, and the ZooKeeper will automatically perceive the change of the node. After finding the fault of the master node A, the election will be issued again. At this time, the master node B will win the election and replace the master node A to become the new master node, thus completing the re-election of the master node and the secondary node.
(3) master recovery
If the master node is restored, it will register its own node information with ZooKeeper again, but at this time the node information it registers will become master00003 instead of the original information. ZooKeeper will sense the change of the node and launch the election again, at this time, the primary node B will win again and continue to serve as the primary node, and the primary node A will act as the backup node.
Zookeeper manages and synchronizes the state of the cluster repeatedly through such a coordination and scheduling mechanism.
4) zookeeper cluster architecture
Zookeeper generally provides services through the cluster architecture, and the following figure is the basic architecture diagram of zookeeper.
The main roles of zookeeper cluster are server and client, in which server is divided into leader, follower and observer. The meaning of each role is as follows:
Leader: leader role, mainly responsible for initiating and deciding votes, as well as updating system status. Follower: follows the role, receives the client's request and returns the result to the client, and participates in the voting during the election process. Observer: observer role where the user receives a request from the client and forwards the write request to leader while synchronizing the leader status but not voting. The purpose of Observer is to expand the system and improve scalability. Client: client role, which is used to initiate a request to zookeeper.
Each Server in the Zookeeper cluster stores a piece of data in memory. When the Zookeeper starts, a Server from the instance will be selected as the leader,Leader to handle operations such as data updates. If and only if most Server successfully modify the data in memory, the data modification is considered successful.
The process of Zookeeper writing is as follows: the client Client first communicates with a Server or Observe to initiate a write request, and then Server forwards the write request to Leader,Leader and then forwards the write request to other Server. After receiving the write request, the other Server writes data and responds to Leader,Leader after receiving most successful write responses, and finally responds to Client to complete a write operation.
3. Fundamentals and introduction to Kafka
1) basic concepts of kafka
Kafka is a high-throughput distributed publish / subscribe messaging system, which is the official definition of kafka, which may be difficult to understand. Here is a simple example: this is the era of big data, and a variety of business, social, search and browsing will generate a lot of data. So how to collect these data quickly and how to analyze them in real time is a problem that must be solved. At the same time, it also forms a business demand model, that is, all kinds of data of producer production (produce), and consumers (consume) consume (analyze and process) these data. So in the face of these needs, how to efficiently and stably complete the production and consumption of data? This requires the establishment of a communication bridge between producers and consumers, which is the message system. At the micro level, this business requirement can also be understood as how messages are transmitted between different systems.
Kafka is an open source system under the Apache organization. Its biggest feature is that it can process a large amount of data in real time to meet a variety of demand scenarios, such as data analysis based on hadoop platform, real-time system with low latency, storm/spark streaming engine and so on. Kafka is now used by many large companies as various types of data pipelines and messaging systems.
2) kafka role terminology
Some core concepts and roles of kafka
A Broker:Kafka cluster consists of one or more servers, each of which is called broker. Topic: every message posted to the Kafka cluster has a category called Topic (topic). Producer: the producer of the message, who is responsible for publishing the message to kafka broker. Consumer: the consumer of a message that pulls data from kafka broker and consumes these published messages. Partition:Partition is a physical concept. Each Topic contains one or more Partition, and each partition is an ordered queue. Each message in partition is assigned an ordered id (offset). Consumer Group: consumer group. You can assign a consumer group to each Consumer. If you do not specify a consumer group, it belongs to the default group. Message: message, the basic unit of communication, each producer can publish some messages to a topic.
3) kafka topology architecture
A typical Kafka cluster consists of several Producer, several broker, several Consumer Group, and one Zookeeper cluster. Kafka manages the cluster configuration through Zookeeper, elects leader, and rebalance when the Consumer Group changes. Producer publishes messages to broker,Consumer using the push pattern subscribes and consumes messages from broker using the pull pattern. A typical architecture is shown in the following figure:
As can be seen from the figure, a typical message system consists of a producer (Producer), a storage system (broker) and a consumer (Consumer). As a distributed message system, Kafka supports multiple producers and consumers. Producers can distribute messages to different Partition of different nodes in the cluster, and consumers can also consume multiple Partition on multiple nodes in the cluster. Multiple producers are allowed to write to the same Partition when writing messages, but a Partition is only allowed to be consumed by one consumer in a consumer group when reading messages, while a consumer can consume multiple Partition. In other words, consumers are mutually exclusive to Partition under the same consumption group, while different consumption groups are shared.
Kafka supports persistent storage of messages, and the persistent data is stored in the log file of kafka. After the producer produces the message, kafka will not pass the message directly to the consumer, but will first store it in broker. In order to reduce the number of disk writes, broker will temporarily cache the message, and then write it to disk when the number or size of messages reaches a certain threshold. This not only improves the execution efficiency of kafka, but also reduces the number of disk IO calls.
Every message in kafka is written to partition sequentially, which is very important, because in mechanical disk, if it is written randomly, the efficiency will be very low, but if it is written sequentially, then the efficiency will be very high. This sequential write to disk mechanism is a very important guarantee of high throughput in kafka.
4) Topic and partition
The topic in Kafka is stored in the form of partition, and each topic can set its number of partition. The number of Partition determines the number of log that make up the topic. The recommended number of partition must be greater than the number of consumer running at the same time. In addition, it is recommended that the number of partition should be less than or equal to the number of cluster broker, so that message data can be evenly distributed in each broker
So, why does Topic set up multiple Partition? this is because kafka is based on file storage, and the message content can be stored on multiple broker by configuring multiple broker, which can prevent the file size from reaching the upper limit of a stand-alone disk. At the same time, dividing a topic into any number of partitions can ensure the efficiency of message storage and message consumption, because the more partitions can hold more consumer, which can effectively improve the throughput of Kafka. Therefore, the advantage of splitting Topic into multiple partitions is that a large number of messages can be divided into batches of data and written to different nodes at the same time, sharing the load of write requests to each cluster node.
In terms of storage structure, each partition physically corresponds to a folder under which all messages and index files for this partition are stored. The partiton naming convention is topic name + sequence number, the first partiton sequence number starts at 0, and the maximum sequence number is the number of partitions minus 1.
There are multiple segment (segment) data files of equal size in each partition (folder), and each segment is the same size, but the size of each message may be different, so the number of messages in the segment data file may not be equal. The segment data file consists of two parts, index file and data file, which correspond one to one and appear in pairs. The suffixes ".index" and ".log" are represented as segment index files and data files respectively.
5) Producer production mechanism
Producer is the producer of messages and data, and when it sends messages to broker, it chooses which Partition to store them in according to the Paritition mechanism. If the Partition mechanism is set properly, all messages can be evenly distributed to different Partition, thus achieving data load balancing. If a Topic corresponds to a file, then the machine on which the file is located will become the performance bottleneck of the Topic, and with Partition, different messages can be written in parallel to different Partition of different broker, which greatly improves the throughput.
6) Consumer consumption mechanism
There are usually two modes for Kafka to publish messages: queue mode (queuing) and publish / subscribe mode (publish-subscribe). In queue mode, there is only one consumer group, and this consumer group has multiple consumers, and a message can only be consumed by one consumer in this consumer group; while in publish / subscribe mode, there can be multiple consumer groups. Each consumer group has only one consumer, and the same message can be consumed by multiple consumer groups.
Producer and consumer in Kafka adopt the mode of push and pull, that is, producer carries out push messages to broker, comsumer carries out pull messages from bork, and push and pull produce and consume messages asynchronously. One of the advantages of pull mode is that consumer can independently control the rate of consuming messages, while consumer can also control how messages are consumed, whether to pull data in bulk from broker or to consume data one by one.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.