What are the main functions of Schemaless 07/03 Update SLTechnology News&Howtos

What are the main functions of Schemaless

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "what are the main functions of Schemaless". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Schemaless trigger is a scalable, fault-tolerant, and lossless technology that listens for changes in Schemaless instances. It acts as an engine in the trip process, from when the driver presses "end the trip" and submits the fee to the system, until the corresponding data is entered into the database for analysis. In the final installment of the Schemaless series, we will take an in-depth look at the functions of Schemaless trigger and how to develop this extensible fault-tolerant system.

To put it simply, the basic unit of Schemaless data is named cell. It is immutable and cannot be overwritten once written. (in special cases, we can delete the old record); the cell can be referenced by the row key (row key), column name (column name), and reference key (ref key); the unit content is updated by writing a new version with a higher reference key, but the row key and column name remain the same. Schemaless does nothing on the data stored in it (hence the name schemaless). From Schemaless's point of view, it is only responsible for storing JSON objects.

Schemaless Trigger case

Let's take a look at how Schemaless trigger works in practice. The following code is a simplified version of asynchronous billing (capitalize the column name of Schemaless). Case Python code:

# We instantiate a client To communicate with the Schemaless instance schemaless_client = SchemalessClient (datastore='mezzanine') # registers a bill_rider function @ trigger (column='BASE') def bill_rider (row_key) for the BASE column: # row_key is the UUID status of the journey = schemaless_client.get_cell_latest (row_key 'STATUS') if status.is_completed: # that means we have already submitted the passenger bill return # otherwise try to submit the bill # We got the basic itinerary information from the BASE column trip_info = schemaless_client.get_cell_latest (row_key,' BASE') # submit the passenger bill result = call_to_credit_card_processor_for_billing_trip (trip_info) if result! = 'SUCCESS': # submission exception Have Schemaless trigger try again later. Raise CouldNotBillRider () # successfully submitted the passenger bill and wrote it to Mezzanine schemaless_client.put (row_key, status, body= {'is_completed': True,' result': result})

In the Schemaless instance, we define the trigger and specify the columns by adding decorator@trigger to the function. If there is something in the cell of the specified column, notify the Schemaless trigger framework to call the function-- in this case, bill_rider. The end of the trip is indicated here by a new unit in BASE. Trigger the trigger, and then send the line key through the function-- in this case, the stroke UUID. If you need more data, you must get the real data from the Schemaless instance-- in this case, from the itinerary store Mezzanine.

The information flow of the bill_rider trigger function is shown in the following table (here is the passenger checkout). The direction of the arrow indicates the caller and the callee, and the number next to it indicates the order of the process:

First enter the itinerary into the Mezzanine,Schemaless Trigger framework to invoke bill_rider. When called, the function requests the latest information about the STATUS column from the itinerary store. In this case, the is_completed field does not exist, which means that the passenger has not checked out yet. Then get the itinerary information of the BASE column, and use the function to call the credit card provider to check out. In this example, we successfully pay with a credit card and return the success information to Mezzanine, then set the is_completed of the STATUS column to True.

The Trigger framework ensures that bill_rider is called at least once per unit in each Schemaless instance. In general, the trigger function is triggered only once, but it may need to be called multiple times in the event of an error (whether it is a brief error with the trigger function or other features). That is to say, the trigger function is idempotent, and in this case you want to check whether the unit has been processed. If the answer is yes, the function is returned.

Keep this case in mind when looking at how Schemaless provides support in the process below. We will explain how Schemaless is treated as a change log, discuss API related to Schemaless, and share techniques that enable processes to support extensibility and fault tolerance.

Treat Schemaless as a log

Schemaless contains all cells, that is, all versions of the specified row key and column keypair. Because it contains all historical versions of the unit, Schemaless can be used as a change log in addition to random access to key-value storage. In fact, it is a partition log, and each shard is its own log, as shown below:

Each cell is written to a specific shard based on the row key (that is, UUID). All units in the shard have a unique identifier, called add ID. Adding ID is an automatically incrementing field that represents the order in which cells are inserted (the newer the cell, the larger the number of ID added). In addition to adding ID, each cell has a cell write time (datetime). Of all sharded backups, the addition of the unit ID is unique, which is important for failover.

Schemaless's API supports random access and log class access. Random access API is for individual units, all defined by row_key, column_key, and ref_key.

Schemaless also contains batch versions of these API endpoints, which are omitted here. The trigger function bill_rider mentioned earlier uses these functions to get and manipulate a single unit.

For log class access API, we are concerned about the part number and timestamp of the unit and the addition of ID (collectively referred to as location location):

Similar to random access API, log access API has more knob available to grab units from multiple shards in real time, but the above endpoints are more important. The location can be timestamp or added_id. Call get_cells_for_shard, which returns the next add ID in addition to the unit. For example, if you call get_cells_for_shards at location 1000 and request 10 units, the next location offset returned is 1010.

Tracking log

By accessing API through the log class, you can track Schemaless instances just as you can track files in the system (such as tail-f), or event queues similar to recent change polling (such as Kafka). The client then keeps track of the offset and uses it in polling. To boot the tracker, you need to start with the first item (such as position 0), or at any time, or after offset.

Schemaless trigger does the same tracking by accessing the API using the log class and maintains the trace offset. The benefit of polling API directly lies in making the process scalable and fault-tolerant through Schemaless trigger. Link the client program to the Schemaless trigger framework by configuring which Schemaless instance and which column to poll data from. The function or callback used is related to the data flow in the framework and is Schemaless trigger or called or triggered when a new cell is inserted into the instance. In turn, the framework finds the work process you are looking for in the main cluster where the program is running. The framework divides the work into available processes and then cleverly solves the failed processes by assigning the work assigned to the failed processes to other available processes. Work allocation means that programmers only need to write handlers (such as trigger functions) and make sure it is idempotent. Let Schemaless trigger handle the rest.

Architecture

In this section, we will discuss how Schemaless trigger extends and how to minimize the impact of failures. The following figure shows its architecture from a high point of view, taken from the previous billing service:

The billing service uses Schemaless trigger running on three different hosts, and we (for simplicity) assume that each host has only one worker process. The Schemaless trigger framework area distinguishes shards by worker process, so each worker process is responsible for processing only one specific shard. Note: process 1 pulls data from shard 1, process 2 pulls data from shard 2 and shard 5, and process 3 pulls data from shard 3 and shard 4. A worker process only deals with specified shard units, grabs new units, and calls registered callback functions for those shards. A worker process is the designated leader, which is responsible for assigning areas to the worker process. If the process is suspended, leader reassigns the patch assigned to the failed process to another process.

In a shard, the cells are triggered in the write order. In other words, if the trigger of a particular unit always fails due to program errors, it will hinder the processing of the unit in that area. To avoid latency, you can configure Schemaless trigger to mark units with multiple errors and put them in a separate queue. After that, Schemaless trigger will move on to the next unit. If the number of the tag unit exceeds a certain threshold, the trigger stops. It usually represents a system error and needs to be fixed manually.

Continue tracking by storing the addition ID,Schemaless trigge of the last successful trigger unit in each patch. The framework saves these offsets to shared storage, such as Zookeeper or the Schemaless instance itself, meaning that if the program restarts, trigger continues to execute from the storage offset of the storage area. Shared storage is also used in meta-info, such as coordinating the selection of leader and probing the process of adding or removing.

Scalability and fault tolerance

Schemaless trigger is designed for extensibility. In the tracked Schemaless instance, for any client program, we can add up to the number of worker processes consistent with the number of extents (usually 4096). In addition, we can add or remove worker online to handle the variable load of other trigger clients in the Schemaless instance independently. By tracking progress in the framework, we can add as many clients as possible to the Schemaless instance to which we want to send data. There is no logic on the server side to continuously track the client or push the status over.

Schemaless trigger is also fault tolerant. Any process failure can not affect the system.

If a client-side worker process goes wrong, leader reallocates the work to ensure that all extents have processes.

If an leader error occurs on the Schemaless trigger node, a new node will be selected as leader. Processing units can continue during the leader election, but work cannot perform reassignment or remove and add processes.

If there is an error in sharding storage (such as ZooKeeper), the unit process continues. However, just as during the leader election, work cannot perform redistribution, and the process cannot be changed in the event of a sharding storage error.

Finally, in a Schemaless instance, it is impossible for the Schemaless trigger framework to fail. It doesn't matter if any database node goes wrong, because the Schemaless trigger can be read from the backup.

This is the end of the content of "what are the main functions of Schemaless". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.