What is the concept and related terms of Apache BookKeeper 07/12 Update SLTechnology News&Howtos

What is the concept and related terms of Apache BookKeeper

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What this article shares with you is about the concept of Apache BookKeeper and what the related terms are. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Apache BookKeeper is an enterprise-class storage system designed to ensure high persistence, consistency, and low latency. Pulsar by Yahoo Research (Yahoo! Research) was developed to achieve high availability of Hadoop distributed File system (HDFS) NameNode. Before that, NameNode did not have high availability features and had the problem of a single point of failure.

Background introduction

BookKeeper developers Benjamin Reed, Flavio Junqueira and Ivan Kelly have designed a flexible system based on their experience in building ZooKeeper, which can support a variety of workloads.

Originally, BookKeeper was a WAL mechanism for distributed systems. Now BookKeeper has developed into a basic construction module that supports multiple enterprise systems, such as Twitter's EventBus, Yahoo's Apache Pulsar and so on.

What is BookKeeper?

BookKeeper is a storage service that optimizes real-time workloads, with the characteristics of scalability, high fault tolerance and low latency. Based on our years of work experience, enterprise-class real-time storage platforms should meet the following requirements:

Read and write entry streams with very low latency (less than 5 milliseconds)

Ability to store data in a persistent, consistent, fault-tolerant manner

Streaming or rear-end transmission can be carried out when writing data

Effectively store and access historical data and real-time data

BookKeeper is designed to meet the above requirements and is widely used in a variety of use cases, such as providing high availability or multiple replicas for distributed systems (such as HDFS NameNode nodes, Manhattan key-value storage for Twitter), cross-machine replication in a single cluster or between multiple clusters (multiple data centers), and storage services for publish / subscribe (pub-sub) messaging systems such as Twitter's EventBus and Apache Pulsar Store immutable objects (for example, snapshots of checkpoint data) for streaming work.

The concept and terminology of BookKeeper

BookKeeper replicates and persists log streams. A log stream is a stream of records that forms a good sequence.

✏️ record

The data is written to the Apache BookKeeper log in an indivisible sequence rather than a single byte. A record is the smallest Imax O unit in a BookKeeper, also known as an address unit. A single record contains a sequence number (such as an incremental length) associated with or assigned to the record.

The client always starts reading from a specific record, or rear-ends a sequence. That is, the client listens to the sequence to find the next record to add to the log. The client can receive a single record at a time, or it can receive a block of data containing multiple records. Serial numbers can also be used to randomly retrieve records.

✏️ log

Two nouns for log storage are provided in BookKeeper: one is ledger (also called log segment) and the other is stream (also known as log stream).

Ledger is used to record or store a series of data records (logs). When the client shuts down actively or when the client acting as the writer goes down, the records being written to this ledger are lost, and the data previously stored in the ledger is not lost. Once Ledger is closed, it is immutable, that is, data records (logs) are not allowed to be added to closed ledger.

BookKeeper ledger: bounded data entries sequence

Stream (also known as log stream) is an unbounded and infinite sequence of data records. By default, stream is never lost. Stream is different from ledger. Ledger can only run once when appending records, while stream can run multiple times.

A stream consists of multiple ledger; each ledger is cycled according to a time-or space-based rolling policy. Before stream is deleted, stream is likely to exist for a relatively long time (days, months, or even years). The primary data retention mechanism of Stream is truncation, which includes deleting the oldest ledger according to a time-or space-based retention policy.

BookKeeper stream: unbounded data recording stream

Ledger and stream provide a unified storage abstraction for historical and real-time data. When writing data, the log streaming or rear-end transmission of real-time data records. The real-time data stored in ledger becomes historical data. The data accumulated in stream is not limited by stand-alone capacity.

✏️ Namespace

Typically, users classify and manage log flows in the namespace. A namespace is a mechanism that tenants use to create stream, and it is also a deployment or snap-in. Users can configure data placement policies at the namespace level.

All stream of the same namespace have the same namespace settings and store records in storage nodes configured according to the data placement policy. This provides strong support for mechanisms that manage multiple stream at the same time.

✏️ Bookies

Bookies is the storage server. A bookie is a separate BookKeeper storage server for storing data records. BookKeeper replicates and stores data entries across bookies. For performance reasons, ledger segments are stored on a single bookie, rather than the entire ledger.

Therefore, bookie seems to be part of the overall integration. For any given ledger L, integration refers to a set of bookies that stores the entries in L. When entries is written to ledger, entries is segmented across the integration (writing one packet of bookies instead of all bookies).

✏️ metadata

BookKeeper requires a metadata storage service to store information about ledger and available bookie. Currently, BookKeeper uses ZooKeeper to do this (in addition to data storage services, it also includes some coordination, configuration management tasks, and so on).

Interact with BookKeeper

When interacting with bookie, BookKeeper applications have two main functions: one is to create ledger or stream to write data, and the other is to open ledger or stream to read data. To interact with two different storage primitives in BookKeeper, BookKeeper provides two API.

API describes the lower-level API of Ledger API, which allows users to interact directly with ledger, with great flexibility, and users can interact with bookie as needed. Stream API higher-level, stream-oriented API, implemented through Apache DistributedLog. Users can interact with ledger without managing the complexity of interacting with stream.

Choosing which API to use depends on the degree of granularity that the user sets over the ledger semantics. Users can also use both API in a single application.

Put it together and watch.

The following figure is an example of a typical installation of BookKeeper.

Several considerations in the figure above:

A typical BookKeeper installation includes metadata stores (such as ZooKeeper), bookie clusters, and multiple clients that interact with bookie through the client libraries provided. For easy identification by the client, bookie broadcasts itself to the metadata store. Bookie interacts with the metadata store to collect deleted data as the Recycle Bin. Applications interact with BookKeeper through the provided client libraries (using ledger API or DistributedLog Stream API) Application 1 requires granularity control over ledger in order to use ledger API directly. Application 2 does not require lower-level ledger control, so a more simplified log flow API is used. These are the concepts and related terms of Apache BookKeeper. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.