Design Data Intensive Applicat 05/12 Update SLTechnology News&Howtos

Design Data Intensive Applicat

2025-05-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

What is SHARDING:

Divide large data sets into blocks and store them on different servers

Purpose:

Scalability: Different shards can be placed on different servers to distribute read requests

Complex queries can be executed in parallel on different shards

Write requests are distributed across servers

Question 1: How to divide?

Keep data uniform on each server to avoid data skew

Randomization:

Advantages: Uniform data

Disadvantages: Unable to know which node the data is in. Each fragment holds a primary key within a range of consecutive key values (partition by key range)

Advantages: Easy to figure out which node the primary key is in. Primary key can be stored in order, convenient range search

Disadvantages: Each fragment data may not be uniform, need to adjust the fragment boundary--> manual or automatic. Primary key prefix solves distribution problem Fragmentation by primary key HASH value: (riak, couplbase, voldemort)

Pros: Theoretically uniform data, depending on HASH algorithm. It's easy to figure out where the data is.

Disadvantages: Difficult to perform range search Mixed mode: union primary key, first by the first attribute HASH of the primary key, then by other attributes in order (cassandra)

Suitable for processing one-to-many data

Handle data tilt and hot keys read and write:

Need to apply layer solution: such as increasing random prefix and suffix to key values. Disadvantages: The data of the same key value is scattered in different shards, increasing the reading complexity. Question 2: How to query the data?

Fragmentation solves the problem of write and primary key queries, but how do you solve other queries? How to create a secondary index in case of data fragmentation?

Local index: Each shard maintains a dictionary mapping of secondary query criteria to the primary key list separately

Pros: Easy to update index while writing data

Disadvantages: queries must be found in the secondary index of each fragment, and then merge the results Global index: an independent index structure covers all fragments, and the index itself is also fragmented, according to the query conditions corresponding to the index (term partitioned)

Advantages: query index falls into a single fragment, high efficiency, if RANGE fragmentation also supports range query

Disadvantages: writing data is complex, write operations affect multiple shards (data shards and index shards may not be on the same node), distributed transaction support is required, or asynchronous mode is adopted, sacrificing consistency, and newly written data may not be immediately visible in the index. Question 3: How to deal with cluster expansion or node fragmentation data with downtime?

Fragmented data needs to be migrated from one node to another (partition balancing)

Data rebalancing requirements: After migration, the load must be kept uniform (cluster expansion) Cluster must be available during migration, read and write have no impact Migration must minimize unnecessary data movement and reduce cluster IO overhead Data rebalancing policy:hash module will lead to changes in the nodes where a large number of fragments are located after expansion, which does not meet the above requirements. 3 Do not directly map keys to nodes, but first map keys to partitions, and then map partitions to nodes. The number of partitions is much larger than the number of nodes, so that new nodes get some partition data while keeping the key-to-partition mapping unchanged (riak, elasticsearch, couplbase, voldemort)

Benefits: Minimize data movement during scale-up

Disadvantages: The number of partitions is always fixed and cannot be increased or decreased. It is difficult to determine the number of partitions. If the data volume of each partition is too large or too small, it will bring extra overhead. Dynamic primary key range fragmentation: Data fragmentation is sorted according to the primary key. When the fragmentation exceeds the configured size, it will automatically split into two fragments. When the fragmentation is too small due to data deletion, it will merge with adjacent fragments. (hbase, rethinkDB)

Advantages: Fragment size automatically adapts to cluster data volume

Disadvantages: When the database is initialized, there is only one fragment, and the read and write load cannot be effectively distributed. Solution: Configure pre-fragmentation.

Dynamic fragmentation can also be applied to HASH. The number of fragments is proportional to the number of nodes: i.e., the number of fragments per node is fixed. When a new node is added, a certain number of fragments are randomly selected to be equally divided, and half of the data is moved to the new node. (cassandra, ketama)

Disadvantages: Only HASH fragments are supported. Random selection may result in uneven data manual or automatic balancing:

automatic rebalancing

Advantages: No manual intervention required

Disadvantages: Fragmented data movement is an expensive operation that has an unknowable impact on cluster performance and is prone to avalanche effects.

artificial rebalancing

Advantages: strong controllability

Disadvantages: Slow response

Request routing:

After rebalancing, the client needs to know which node to connect to

The client can connect to any node. If the partition exists, the request is processed. Otherwise, the node is responsible for sending the request to the node where the fragment is located.

Advantages: Clients do not need to store fragments METADATA,

Disadvantages: request roundtrip time may be long Separate routing layer is responsible for receiving client requests and forwarding, routing layer needs to know fragment storage METADATA

Advantages: Clients do not need to store fragments METADATA,

Cons: request roundtrip may take longer

Client stores shard METADATA and routes directly to new node

Advantages: Direct routing, fast speed

Disadvantages: Clients need to be aware of fragmentation topology changes

Client-aware routing change is a challenging problem. (network latency/partitioning, etc.), requires distributed consistency protocols, or uses centralized routing METADATA storage such as zookeeper, etc.

Parallel QUERY execution:

Analytical databases need to decompose complex QUERY into multiple fragments and phases that can be executed concurrently to form a directed acyclic graph.

Other:

SHARDING and REPLICATION are usually used together, and a shard is stored on multiple servers.

Consistency HASH: It mainly solves the problem of random selection of fragment boundaries in CDN networks without a centralized consistency protocol, which is generally not suitable for database use.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.