What is the state machine of BlueStore things 07/19 Update SLTechnology News&Howtos

What is the state machine of BlueStore things

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

The main content of this article is to explain "what is the state machine of BlueStore things". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what is the state machine of BlueStore things"?

Preface

BlueStore can be understood as a local journaling file system that supports ACID. All reads and writes are carried out in Transaction, and because overwriting is supported, the design of the writing process is relatively complex, involving a series of state transitions. We focus on the state machine, latency metrics, and how to ensure the sequence and concurrency of IO.

State machine queue_transactions

Queue_transactions is the unified entrance to the ObjectStore layer, and KVStore, MemStore, FileStore and BlueStore all implement this interface accordingly. The state_t state variable records which state things are in at the current moment. The state is initialized to STATE_PREPARE when the TransactionContext is created, and then processed differently in _ txc_add_transaction depending on the type of opcode (opcode). At the same time, the OpSequencer corresponding to PG (each PG has an OpSequencer) is obtained to ensure the serial execution of IO on PG, and for deferred-write, its data is written to RocksDB (WAL).

In the following stage, we enter the BlueStore state machine, and we analyze each state of the state machine guided by the writing process.

STATE_PREPARE

It has entered the state machine of things since state_prepare. This phase will call _ txc_add_transaction to convert things at the OSD level into things at the BlueStore level; then check whether there is any uncommitted IO. If so, set state to STATE_AIO_WAIT and call _ txc_aio_submit to submit the IO, and then exit the state machine. After that, the callback function txc_aio_finish will be called to enter the state machine again when the aio is completed; otherwise, the state machine will be entered.

_ txc_aio_submit function call stack:

Bdev- > aio_submit-> KernelDevice::aio_submit-> io_submit submits the aio to the kernel Libaio queue.

Main work: preparation work, generating case, initializing TransContext, deferred_txn, allocating disk space, etc.

Latency indicator: l_bluestore_state_prepare_lat, from entering the state machine to the completion of the prepare phase, the average delay is about 0.2ms.

STATE_AIO_WAIT

In this stage, _ txc_finish_io is called for processing such as IO order preservation of SimpleWrite, and then the state is set to STATE_IO_DONE and then _ txc_state_proc is called to proceed to the next state processing.

Main work: keep the order of IO and wait for the completion of AIO.

Delay indicator: l_bluestore_state_aio_wait_lat, from the completion of the prepare phase to the completion of the AIO, the average delay is limited by the device, SSD 0.03ms or so.

STATE_IO_DONE

Complete the AIO and enter the STATE_KV_QUEUED phase. Will do different treatments according to bluestore_sync_submit_transaction. This value is a Boolean value and defaults to false.

If true, set the status to STATE_KV_SUBMITTED and synchronously submit kv to RocksDB but no sync submit_transaction, and then applied_kv.

If false, you don't have to do the above, but you will do all of the following.

Finally, put things in kv_queue and notify kv_sync_thread through kv_cond to synchronize IO and metadata.

Main job: put things into kv_queue, and then inform kv_sync_thread,osr 's IO that keeping order may be block.

Delay indicator: l_bluestore_state_io_done_lat, the average delay is in 0.004ms, which is usually very small and mainly spent on IO order-preserving processing of SimpleWrite.

STATE_KV_QUEUED

This phase mainly synchronizes IO and metadata in the kv_sync_thread thread and sets the state to STATE_KV_SUBMITTED. Kv_sync_ thread is analyzed in the section on asynchronous threads.

Main work: take things out of the kv_sync_thread queue.

Latency indicator: l_bluestore_state_kv_queued_lat, from entering the queue to taking out things, the average delay is in 0.08ms, because it is processed sequentially by a single thread, it depends on the speed of kv_sync_thread processing things.

STATE_KV_SUBMITTED

Wait for the Sync of the KV metadata and IO data in kv_sync_thread to complete, then set the state to STATE_KV_DONE and call back the finisher thread.

Main work: wait for the Sync completion of KV metadata and IO data, and call back the finisher thread.

Delay indicator: l_bluestore_state_kv_committing_lat, from the queue to the completion of kv synchronization, the average delay 1.0ms, there is a lot of room for optimization.

STATE_KV_DONE

If it is SimpleWrite, set the state directly to STATE_FINISHING;. If it is DeferredWrite, set the state to STATE_DEFERRED_QUEUED and put it into deferred_queue.

Main work: as above.

Delay indicator: l_bluestore_state_kv_done_lat, average delay 0.0002ms, negligible.

STATE_DEFERRED_QUEUED

Main work: put the delayed IO into deferred_queue and wait for submission.

Delay indicator: l_bluestore_state_deferred_queued_lat, usually not small, will not be posted without data.

STATE_DEFERRED_CLEANUP

Main work: clean up the WAL of delayed IO on RocksDB.

Delay indicator: l_bluestore_state_deferred_cleanup_lat, usually not small, will not be posted without data.

STATE_FINISHING

Main work: set the status to STATE_DONE, and submit it if there is any DeferredIO.

Delay indicator: l_bluestore_state_finishing_lat, average delay 0.001ms.

STATE_DONE

Main work: identify the completion of the entire IO.

Delay indicator: l_bluestore_state_done_lat.

Delay analysis

BlueStore defines several delay indicators of the state machine, which are collected by PerfCounters and the function is BlueStore::_init_logger ().

You can use ceph daemon osd.0 perf dump or ceph daemonperf osd.0 to see the corresponding latency.

In addition to the delay for each state, we usually focus on the following two latency metrics:

B.add_time_avg (l_bluestore_kv_lat, "kv_lat", "Average kv_thread sync latency", "Kempl", b.add_time_avg (l_bluestore_commit_lat, "commit_lat", "Average commit latency", "Clearl")

BlueStore latency is mainly spent on l_bluestore_state_kv_committing_lat, that is, Clearl, probably around 1ms.

The method of delaying each phase of the BlueStore statistical state machine is as follows:

/ / delay of this phase = from the completion of the previous stage to the end of the stage, void log_state_latency (PerfCounters * logger, int state) {utime_t lat, now = ceph_clock_now (); lat = now-last_st logger- > tinc (state, lat); last_stamp = now;}

In the usage scenario of block storage, in addition to concurrent IO, users usually also use serial IO commands such as dd. At this time, it is limited by the absolute delay of read and write, and scale-out optimization such as expanding the capacity and increasing the number of threads is invalid, so we need to pay attention to two aspects of delay: concurrent IO delay and serial IO delay.

Concurrent IO latency optimization: multithreading of kv_sync_thread and kv_finalize_thread; Custom WAL;async read.

Serial IO latency optimization: parallel submission of metadata and data; parallel processing of sync operations with other states.

IO order preservation

Ensuring the sequence and concurrency of IO is an inevitable problem for distributed storage. Because BlueStore uses asynchronous IO, the later submitted IO may be completed earlier than the earlier submitted IO, so it is more important to ensure the order of the IO to prevent data confusion. The client may continuously submit read and write requests to an Object in PG, with each request corresponding to a Transaction. At the OSD level, the concurrent read and write requests are serialized at the PG level through PGLock, and then submitted sequentially to the ObjectStore layer. The ObjectStore layer ensures that the read and write requests are processed sequentially through PG's OpSequencer.

There are two types of BlueStore writes: SimpleWrite and DeferredWrite, so let's analyze the problem of IO order preservation under SimpleWrite and DeferredWrite.

SimpleWrite

Because Libaio is used in the STATE_AIO_WAIT phase, it is necessary to ensure that the txc in the OpSequencer corresponding to the PG enters the kv_queue in the queued order and is processed by kv_sync_thread, that is, the order of the txc in the OpSequencer is the same as that in the kv_queue.

Void BlueStore::_txc_finish_io (TransContext * txc) {/ / acquire the OpSequencer to which txc belongs and lock it to ensure mutually exclusive access to osr OpSequencer * osr = txc- > osr.get (); std::lock_guard l (osr- > qlock); / / set the state of the state machine to STATE_IO_DONE txc- > state = TransContext::STATE_IO_DONE; / / clear aio txc- > ioc.running_aios.clear () where txc is running / / locate the current txc location in osr OpSequencer::q_list_t::iterator p = osr- > q.iterator_to (* txc); while (p! = osr- > q.begin ()) {--p; / / if there is an unfinished IO txc, then you need to stop the current txc operation and wait for the previous txc to complete IO. / / the purpose is to ensure that the previous IO of txc is completed. If (p-> state

< TransContext::STATE_IO_DONE) { return; } // 前面的txc已经进入大于等于STATE_KV_QUEUED的状态了，那么递增p并退出循环。 // 目的是：找到状态为STATE_IO_DONE的且在osr中排序最靠前的txc。 if (p->

State > TransContext::STATE_IO_DONE) {+ + p; break;}} / / process tx with status STATE_IO_DONE in turn / / put txc into kv_sync_thread 's kv_queue and kv_queue_unsubmitted queues do {_ txc_state_proc (& * pause +);} while (p! = osr- > q.end () & & p-> state = = TransContext::STATE_IO_DONE);.} DeferredWrite

When DeferredWrite is in IO, it is also submitted to the kernel Libaio queue through Libaio to write data, and it is also necessary to ensure the sequence of IO.

The corresponding data structure is as follows:

Class BlueStore {typedef boost::intrusive::list

< OpSequencer, boost::intrusive::member_hook< OpSequencer, boost::intrusive::list_member_hook, &OpSequencer::deferred_osr_queue_item>

> deferred_osr_queue_t; / / osr's with deferred io pending deferred_osr_queue_t deferred_queue;} class OpSequencer {DeferredBatch * deferred_running = nullptr; DeferredBatch * deferred_pending = nullptr;} struct DeferredBatch {OpSequencer * osr; / / txcs in this batch deferred_queue_t txcs;}

BlueStore contains a member variable deferred_queue;deferred_queue queue contains OpSequencer; that needs to execute DeferredIO each OpSequencer contains two variables of DeferredBatch type deferred_running and deferred_pending; DeferredBatch contains an array of txc.

If the PG has a write request, it will queue up to join the txc in the deferred_pending in the OpSequencer corresponding to the PG. When the time is right, all the txc will be submitted to the Libaio at one time, and the next submission will not be made until the execution is completed, so that the DeferredIO will not be out of order.

Void BlueStore::_deferred_queue (TransContext * txc) {deferred_lock.lock (); / queue osr if (! txc- > osr- > deferred_pending &! txc- > osr- > deferred_running) {deferred_queue.push_back (* txc- > osr);} / append txc to deferred_pending txc- > osr- > deferred_pending- > txcs.push_back (* txc); _ deferred_submit_unlock (txc- > osr.get ()) Void BlueStore::_deferred_submit_unlock (OpSequencer * osr) {. / / switch the pointer to ensure that the next submission will not be made until each operation is completed osr- > deferred_running = osr- > deferred_pending; osr- > deferred_pending = nullptr;. While (true) {. / / prepare all txc to write buffer int r = bdev- > aio_write (start, bl, & b-> ioc, false);}. / / submit all txc bdev- > aio_submit (& b-> ioc) at once;} thread queue

Thread + queue is the basis for implementing asynchronous operations. The one-time IO of BlueStore goes through the state machine to enter multiple queues and be processed by different threads and then called back. Thread + queue is an important part of the BlueStore state machine. There are roughly seven types of threads in BlueStore.

Mempool_thread: no queue. The backend monitors the memory usage. If the memory usage limit is exceeded, trim will be done.

Aio_thread: the team is listed as the Libaio kernel queue, harvesting the completed aio event.

Discard_thread: line up as discard_queued and Trim the extent on the SSD disk.

Kv_sync_thread: queue as kv_queue, deferred_done_queue, deferred_stable_queue, Sync metadata and data.

Kv_finalize_thread: the team is listed as kv_committing_to_finalize, deferred_stable_to_finalize, and performs the cleaning function.

Deferred_finisher: call the callback function to submit the request for DeferredIO.

Finishers: multiple callback threads Finisher to notify the user that the request is completed.

We mainly analyze aio_thread, kv_sync_thread and kv_finalize_thread.

Aio_thread

Aio_thread is relatively simple and belongs to the KernelDevice module. Its main function is to harvest the completed aio event and trigger the callback function.

Void KernelDevice::_aio_thread () {while (! aio_stop) {. / / get the completed aio int r = aio_queue.get_next_completed (cct- > _ conf- > bdev_aio_poll_ms, aio, max); / / set the flush flag to true. Io_since_flush.store (true); / / get the return value of aio long r = aio [I]-> get_return_value ();. / / call aio completed callback function if (ioc- > priv) {if (--ioc- > num_running = = 0) {aio_callback (aio_callback_priv, ioc- > priv);}

Delay indicators: state_aio_wait_lat, state_io_done_lat.

Kv_sync_thread

When the IO is completed, either the txc or the dbh is put into the queue, although it corresponds to different queues, but the subsequent operations are performed by the kv_sync_thread.

For SimpleWrite, the new disk block is written (in the case of cow, the new block is also written, but the KBE operation in the transaction increases the recycling of the old block), so the block is written first by aio_thread, and then the meta-information is synchronized by kv_sync_thread. No matter when you hang up, the data will not be corrupted.

For DeferredWrite, in the prepare phase of a thing, the data that needs DeferredWrite is written into the db_transaction based on RocksDB encapsulation as a KLV pair (also known as WAL). When it is still in memory, during the first commit operation of kv_sync_thread, the wal is persisted in the KHV system, and then subsequent operations are carried out. In the abnormal case, the wal can be played back without data corruption.

The main operations performed by kv_sync_thread are as follows: after Libaio writes the data, it needs to update the metadata KLV through kv_sync_thread, which mainly contains the Onode of object, extended attributes, disk space information of FreelistManager and so on, which must be operated sequentially.

The queues involved are as follows:

Kv_queue: the txc queue that needs to execute commit. Save the txc in kv_queue into kv_committing and submit it to RocksDB, that is, execute the operation db- > submit_transaction, set the state to STATE_KV_SUBMITTED, and put the txc in kv_committing into kv_committing_to_finalize, waiting for thread kv_finalize_thread to execute.

Deferred_done_queue: the dbh queue that has completed the DeferredIO operation and does not have a sync disk. The dbh of this queue will have two results: 1) if the flush operation is not performed, it will be put into the deferred_stable_queue for further processing in the next cycle. 2) if the flush operation is performed, the data has been stored on the disk, that is, it is already stable, and it will be directly inserted into the deferred_stable_queue queue. Here stable means that the data has been sync to disk, and the wal recorded in the previous RocksDB is useless and can be deleted.

Deferred_stable_queue:DeferredIO has been deactivated, waiting to clean up the WAL in RocksDB. Operate the txc in dbh in turn, delete the wal in RocksDB, then dbh queue deferred_stable_to_finalize and wait for the thread kv_finalize_thread to execute.

Void BlueStore::_kv_sync_thread () {while (true) {/ / swap pointer kv_committing.swap (kv_queue); kv_submitting.swap (kv_queue_unsubmitted); deferred_done.swap (deferred_done_queue); deferred_stable.swap (deferred_stable_queue) / / process deferred_done_queue if (force_flush) {/ / flush/barrier on block device bdev- > flush () / / if we flush then deferred done are now deferred stable deferred_stable.insert (deferred_stable.end (), deferred_done.begin (), deferred_done.end ()); deferred_done.clear ();} / process kv_queue for (auto txc: kv_committing) {int r = cct- > _ conf- > bluestore_debug_omit_kv_commit? 0: db- > submit_transaction (txc- > t); _ txc_applied_kv (txc) } / / process deferred_stable_queue for (auto b: deferred_stable) {for (auto & txc: B-> txcs) {get_deferred_key (wt.seq, & key); synct- > rm_single_key (PREFIX_DEFERRED, key) }} / / submit synct synchronously (block and wait for it to commit) / / synchronous kv, there are two operations: setting bluefs_extents and deleting wal: int r = cct- > _ conf- > bluestore_debug_omit_kv_commit? 0: db- > submit_transaction_sync (synct); / / put it in the finalize thread queue and notify it to process. Std::unique_lock m (kv_finalize_lock); kv_committing_to_finalize.swap (kv_committing); deferred_stable_to_finalize.swap (deferred_stable); kv_finalize_cond.notify_one ();}}

Latency indicators involved: state_kv_queued_lat, state_kv_committing_lat, kv_lat

Kv_finalize_thread

Clean up the thread, which contains two queues:

Kv_committing_to_finalize: call _ txc_state_proc again to enter the state machine, set the status to STATE_KV_DONE, and execute a callback function to inform the user that the io operation is complete.

Deferred_stable_to_finalize: traverse the dbh in deferred_stable, call _ txc_state_proc to enter the state machine, set the state to STATE_FINISHING, continue to call _ txc_finish, set the state to STATE_DONE, the state machine ends, and the thing is finished.

Void BlueStore::_kv_finalize_thread () {while (true) {/ / swap pointer kv_committed.swap (kv_committing_to_finalize); deferred_stable.swap (deferred_stable_to_finalize); / / process kv_committing_to_finalize queue while (! kv_committed.empty ()) {TransContext * txc = kv_committed.front (); _ txc_state_proc (txc); kv_committed.pop_front () } / / process deferred_stable_to_finalize for (auto b: deferred_stable) {auto p = b-> txcs.begin (); while (p! = b-> txcs.end ()) {TransContext * txc = & * p; p = b-> txcs.erase (p); / / unlink here because _ txc_state_proc (txc); / / this may destroy txc} delete b;} deferred_stable.clear ();}}

Latency indicators involved: state_deferred_cleanup_lat, state_finishing_lat

IO statu

It is mainly divided into SimpleWrite, DeferredWrite and SimpleWrite+DeferredWrite.

At this point, I believe you have a deeper understanding of "what is the state machine of BlueStore things". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.