In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "how is the architecture of large-scale Internet system designed". In daily operation, I believe many people have doubts about how to design the architecture of large-scale Internet system. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "how is the architecture of large-scale Internet system designed?" Next, please follow the editor to study!
Next, we will look at high-level trade-offs and trade-offs:
Performance and scalability
Latency and Throughput
Availability and consistency
Remember that every aspect faces trade-offs and trade-offs.
Then, we'll dig into more specific topics, such as DNS, CDN, and load balancers.
1. Performance and scalability
If the increase in service performance is proportional to the increase in resources, the service is scalable. In general, improving performance means serving more units of work, and on the other hand, as the dataset grows, you can also handle larger units of work. one
Look at performance and scalability from another perspective:
If your system has performance problems, it is slow for a single user.
If your system has scalability problems, individual users are faster but slower under high loads.
two。 Latency and Throughput
Delay is the time it takes to perform an operation or the result of an operation.
Throughput is the number of such operations or operations performed per unit time.
In general, you should aim to maximize throughput with acceptable latency.
3. CAP theory of usability and consistency
In a distributed computing system, only the following two points can be satisfied at the same time:
Consistent ─ gets the latest data every time it accesses, but may receive an error response
Availability ─ receives a non-error response every time it visits, but there is no guarantee to get the latest data
Partition fault tolerant ─ the system can continue to run in the event of any partition network failure
The network is not reliable, so you should support partition fault tolerance and make a trade-off between software availability and consistency.
CP ─ consistency and partition fault tolerance
Waiting for a response from the partition node may cause a delay error. If your business requirements require atomic reading and writing, CP is a good choice.
AP ─ availability and Partition Fault tolerance
The latest version of the data available on the response node may not be up to date. When the partition is parsed, the write (operation) may take some time to propagate.
AP is a good choice if business requirements allow for final consistency, or if the system is required to continue to run when there is an external failure.
4. Consistency mode
With multiple copies of the same data, we are faced with the choice of how to synchronize them so that the client can display the data consistently. Recall the definition of consistency in CAP theory. ─ can get the latest data every time it visits, but it may receive an incorrect response.
Weak consistency
After writing, the access may or may not see (write data). Try to optimize it so that it can access the latest data.
This approach can be seen in systems such as memcached. Weak consistency works well in real-world use cases such as VoIP, video chat and real-time multiplayer games. For example, if you lose the signal in a call for a few seconds, you won't hear what you're saying when you reconnect.
Final consistency
After writing, the access will eventually see the written data (usually within milliseconds). Data is replicated asynchronously.
Systems such as DNS and email use this approach. The final consistency works well in high availability systems.
Strong consistency
Access is visible immediately after writing. The data is replicated synchronously.
This approach is used in file systems and relational databases (RDBMS). Strong consistency works well in systems that require recording.
5. Availability mode
There are two modes that support high availability: failover (fail-over) and replication (replication).
Failover working to standby switchover (Active-passive)
The failover process for work to standby is that the work server sends periodic signals to the standby server. If the cycle signal is interrupted, the standby server switches to the IP address of the working server and resumes service.
The downtime depends on whether the standby server is in a "hot" standby state or needs to be started from a "cold" standby state. Only the work server handles the traffic.
A failover from work to standby is also known as master-slave switchover.
Duplex switching (Active-active)
In a duplex switch, both sides are managing the traffic and spreading the load between them.
If it is an extranet server, DNS will need to know both sides. If it is an intranet server, the application logic will need to know both sides.
Duplex switchover can also be called master master switchover.
Defect: failover
Failover requires additional hardware and complexity.
If the working system fails before the newly written data can be copied to the standby system, the data may be lost.
Replicate master ─ slave replication and master ─ master replication
This topic further explores the database section:
Master ─ Slave replication
Primary ─ Master replication
6. Domain name system
Domain name system translates domain names such as www.example.com into IP addresses.
The domain name system is hierarchical, with some DNS servers at the top. When querying (domain name) IP, the route or ISP provides information to connect to the DNS server. The lower-level DNS server caches the mapping, which may be invalidated due to DNS propagation latency. DNS results can be cached in the browser or operating system for a period of time, depending on the survival time TTL.
NS record (Domain name Service) ─ specifies the DNS server that resolves the domain name or subdomain name.
MX Records (Mail Exchange) ─ specifies the mail server that receives the information.
A record (address) ─ the IP address record corresponding to the specified domain name.
CNAME (specification) ─ one domain name is mapped to another domain name or CNAME record (example.com points to www.example.com) or to an A record.
Platforms such as CloudFlare and Route 53 provide the ability to manage DNS. Some DNS services route traffic in a centralized manner:
Weighted round robin scheduling
Prevent traffic from entering the server under maintenance
Load balancing among clusters of different sizes
ABG B test
Based on delay routing
Location-based routing
Defect: DNS
Although caching can reduce DNS latency, there is a slight delay in connecting to the DNS server.
Although they are usually managed by governments, Internet service providers and large companies, DNS service management can still be complex.
The DNS service has recently suffered a DDoS attack that prevents users who do not know the Twtter IP address from accessing the Twiiter.
7. Content delivery Network (CDN)
Content delivery Network (CDN) is a global distributed proxy server network that provides content from a location close to the user. Usually, static content such as HTML/CSS/JS, pictures and videos is provided by CDN, although Amazon CloudFront and others also support dynamic content. DNS parsing of CDN tells the client which server to connect to.
Storing content on CDN provides performance in two ways:
Provide resources from a data center close to the user
With CDN, your server doesn't have to actually process requests.
CDN push (push)
When the content on your server changes, push CDN to accept the new content. Push it directly to CDN and rewrite the URL address to point to the CDN address of your content. You can configure when the content will expire and when to update. Content is pushed only when it is changed or added, traffic is minimized, but storage is maximized.
CDN pull (pull)
CDN pull is when the first user requests the resource, pulling the resource from the server. You leave the content on your server and rewrite the URL to point to the CDN address. Until the content is cached on the CDN, the request will only be slower
Time to Live (TTL) determines how long the cache lasts. CDN pull minimizes storage space on the CDN, but results in redundant traffic if out-of-date files are pulled before actual changes.
CDN pull works well for high-traffic sites, because traffic can be more evenly dispersed only if the most recently requested content is saved in CDN.
Defect: CDN
The cost of CDN may vary depending on traffic, and you may not use CDN after tradeoff.
If you update content before the TTL expires, the CDN cache content may become obsolete.
CDN needs to change the URL address of the static content to point to CDN.
8. Load balancer
The load balancer distributes incoming requests to computing resources such as application servers and databases. In either case, the load balancer returns the response from the computing resource to the appropriate client. The utility of the load balancer is:
Prevent requests from entering bad servers
Prevent resource overload
Help eliminate a single point of failure
Load balancers can be implemented through hardware (expensive) or software such as HAProxy. The added benefits include:
SSL Terminator ─ decrypts incoming requests and encrypts server responses so that back-end servers no longer have to perform these potentially costly operations.
You no longer need to install X.509 certificates on each server.
Session retention ─ if the Web application does not track the session, issue a cookie and route the request from a specific client to the same instance.
Multiple load balancers in working ─ standby or dual mode are usually set up to avoid failure.
Load balancers can route traffic in a variety of ways:
Random
Minimum load
Session/cookie
Polling scheduling or weighted polling scheduling algorithm
Four-tier load balancing
Seven-tier load balancing
Four-tier load balancing
The four-tier load balancer decides how to distribute the request based on monitoring the information of the transport layer. Typically, this involves the source, destination IP address, and port in the request header, but does not include the contents of the packet (message). Layer 4 load balancers perform network address translation (NAT) to forward network packets to upstream servers.
Seven-layer load balancer
The seven-tier load balancer decides how to distribute the request according to the monitoring application layer. This involves the content of the request header, the message, and the cookie. Layer 7 load balancer terminates network traffic, reads messages, makes load balancing decisions, and then transmits them to specific servers. For example, a seven-tier load balancer can connect video traffic directly to the server hosting video, while directing more sensitive user billing traffic to a more secure server.
At the expense of flexibility, four-tier load balancing takes less time and computing resources than seven-tier load balancing, although this has little impact on the performance of modern commercial hardware.
Horizontal expansion
Load balancers also help scale horizontally and improve performance and availability. Using commercial hardware is more cost-effective and has higher availability than hardware that is more expensive to scale vertically on a single piece of hardware. It is easier to recruit business hardware talent than to recruit specific enterprise system talent.
Defects: horizontal expansion
Horizontal scaling introduces complexity and involves server replication
Servers should be stateless: they should not contain data associated with the user, such as session or profile images.
Session can be stored centrally in a database or in a datastore with persistent caches (Redis, Memcached).
Downstream servers such as caches and databases need to scale with the upstream servers to handle more concurrent connections.
Defect: load balancer
If there are not enough resources configured or misconfigured, the load balancer can become a performance bottleneck.
The introduction of a load balancer to help eliminate a single point of failure leads to additional complexity.
A single load balancer can cause a single point of failure, but configuring multiple load balancers adds further complexity.
9. Reverse proxy (web server)
Reverse proxy is a web server that can call internal services centrally and provide a unified interface to public clients. The request from the client is first forwarded by the reverse proxy server to the server that can respond to the request, and then the proxy returns the response result of the server to the client.
The benefits include:
Increase security-hide the information of the back-end server, block the IP in the blacklist, and limit the number of connections per client.
Improve scalability and flexibility-clients can only see the IP of reverse proxy servers, which allows you to add or subtract servers or modify their configuration.
Local termination of the SSL session-decrypts incoming requests and encrypts the server response so that the back-end server does not have to perform these potentially costly operations.
Eliminates the need to install X.509 certificates on each server
Compress-compress server response
Cache-directly returns the cache result of the hit
Static content-provide static content directly
HTML/CSS/JS
Picture
Video
Wait
Load balancer and reverse proxy
Deploying load balancers is useful when you have multiple servers. Typically, load balancers route traffic to a set of servers with the same function.
Reverse proxies are useful even when there is only one web server or application server, and you can refer to the benefits described in the previous section.
Solutions such as NGINX and HAProxy can support layer 7 reverse proxy and load balancing at the same time.
Disadvantages: reverse proxy
The introduction of reverse proxy will increase the complexity of the system.
A single reverse proxy server can still have a single point of failure, and configuring multiple reverse proxy servers, such as failover, can further increase complexity.
10. Application layer
By separating the Web services layer from the application layer (also known as the platform layer), the two layers can be scaled and configured independently. To add a new API, you only need to add an application server, not an additional web server.
The principle of single responsibility advocates the cooperation of small, autonomous services. Small teams can plan growth more aggressively by providing small services.
The working processes in the application layer can also be asynchronized.
Micro service
The topic related to this discussion is micro-services, which can be described as a series of small, modular services that can be deployed independently. Each service runs in a separate thread and communicates through a clearly defined lightweight mechanism to achieve business goals. one
For example, Pinterest may have these microservices: user profiles, followers, Feed streams, search, photo uploads, and so on.
Service discovery
Systems like Consul,Etcd and Zookeeper can help services find each other by tracking registration names, addresses, ports, and so on. Health checks can help confirm the integrity of the service and whether a HTTP path is often used. Both Consul and Etcd have a built-in key-value store to store configuration information and other shared information.
Disadvantages: application layer
Adding an application layer composed of multiple loosely coupled services will be very different from a single system in terms of architecture, operation, process, and so on.
Microservices increase the complexity of deployment and operation.
11. Database
Relational database management system (RDBMS)
A relational database like SQL is a collection of data items organized in the form of tables.
Proofreading note: here the author SQL may refer to MySQL
ACID is used to describe the characteristics of relational database transactions.
Atomicity-all operations within each transaction are either completed or not completed.
Consistency-any transaction transitions the database from one valid state to another.
Isolation-the result of concurrent transaction execution is the same as that of sequential transaction execution.
Persistence-after the transaction is committed, the impact on the system is permanent.
Relational database extension includes many technologies: master-slave replication, master-master replication, federation, fragmentation, de-normalization, and SQL tuning.
Master-slave replication
The master library is responsible for both read and write operations, and copies and writes to one or more slave libraries, while the slave library is only responsible for read operations. The tree form of the slave library then copies the writes to more slave libraries. If the master library is offline, the system can run in read-only mode until a slave library is promoted to the master library or a new master library appears.
Disadvantages: master-slave replication
Promoting a slave library to a master library requires additional logic.
Reference disadvantage: the common problem of master-slave replication and master-master replication in replication.
Master master replication
Both main libraries are responsible for read and write operations, and write operations coordinate with each other. If one of the main libraries hangs up, the system can continue to read and write.
Disadvantages: master replication
You need to add a load balancer or make changes in the application logic to determine which database to write to.
Most master-master systems either cannot guarantee consistency (violation of ACID) or have write delays due to synchronization.
With the addition of more write nodes and the improvement of latency, how to resolve conflicts becomes more and more important.
Reference disadvantage: the common problem of master-slave replication and master-master replication in replication.
Downside: replication
If the main library dies before copying newly written data to other nodes, there is a possibility of data loss.
The write is replayed to the copy responsible for the read operation. The copy may be blocked by too many writes, resulting in an abnormal read function.
The more reads from the library, the more write data that needs to be replicated, resulting in more severe replication delays.
In some database systems, writing to the main library can be written in parallel with multiple threads, but reading replicas only supports sequential writes by a single thread.
Replication means more hardware and additional complexity.
Union
Federation (or by function) divides the database according to the corresponding function. For example, you can have three databases: forums, users, and products, rather than just a single database, reducing read and write traffic to each database and replication latency. A smaller database means more data suitable for memory, which in turn means a higher chance of cache hits. There is no centralized main library that can only write serially, you can write in parallel to improve load capacity.
Downside: alliance
If your database schema requires a large number of functions and tables, the efficiency of federation is not good.
You need to update the application's logic to determine which database to read and write to.
Joining data from two libraries with server link is more complex.
Federation requires more hardware and additional complexity.
Slice
Sharding allocates data to different databases so that each database manages only one subset of the entire data set. Take the user database as an example, with the increase in the number of users, more and more fragments will be added to the cluster.
Similar to the advantages of federation, sharding can reduce read and write traffic, reduce replication, and improve cache hit rates. There are also fewer indexes, which usually means faster queries and better performance. If one shard goes wrong and the others still work, you can use some form of redundancy to prevent data loss. Similar to federation, there is no centralized main library that can only write serially, and you can write in parallel to improve load capacity.
A common practice is to separate the user table by the initials of the user's last name or the user's geographical location.
Disadvantages: slicing
You need to modify the logic of the application to implement sharding, which leads to complex SQL queries.
Unreasonable sharding may lead to unbalanced data load. For example, frequently accessed user data will cause the load of the shard to be higher than that of other shards.
Rebalancing introduces additional complexity. The slicing algorithm based on consistent hash can reduce this situation.
The data operation of connecting multiple shards is more complex.
Sharding requires more hardware and additional complexity.
Non-normalization
De-normalization attempts to gain read performance at the expense of write performance. Redundant copies of data in multiple tables to avoid costly join operations. Some relational databases, such as PostgreSQL and Oracle, support materialized views that can handle redundant information storage and ensure that redundant copies are consistent.
When data is segmented using techniques such as federation and fragmentation, the complexity of processing join operations across data centers is further increased. De-normalization can circumvent this complex join operation.
In most systems, the frequency of read operations is much higher than that of write operations, and the ratio can reach 100 or even 1000. Read operations that require complex database joins are very expensive and consume a lot of time on disk operations.
Disadvantages: non-standardization
The data will be redundant.
Constraints can help redundant copies of information stay synchronized, but this increases the complexity of database design.
A non-normalized database may perform worse than a normalized database under high write loads.
SQL tuning
SQL tuning is a wide range of topics, there are many related books can be used as a reference.
It is important to use benchmarking and performance analysis to simulate and identify system bottlenecks.
Benchmark-use tools such as ab to simulate high load situations.
Performance analysis-helps track performance issues by enabling tools such as slow query logs.
Benchmarking and performance analysis may lead you to the following optimizations.
Improved mode
For fast access, MySQL stores data in contiguous blocks on disk.
Use the CHAR type to store fixed-length fields, not VARCHAR.
CHAR is very efficient in fast, random access. If you use VARCHAR, if you want to read the next string, you have to read to the end of the current string first.
Use the TEXT type to store large chunks of text, such as blog text. TEXT also allows Boolean search. Using the TEXT field requires storing a pointer on disk to locate a block of text.
Use the INT type to store larger numbers up to 2 ^ 32 or 4 billion.
Use the DECIMAL type to store currencies to avoid floating point representation errors.
Avoid using BLOBS to store objects and store the location where objects are stored.
VARCHAR (255) is the maximum number of characters stored in 8-digit numbers, making maximum use of bytes in some relational databases.
Set NOT NULL constraints in applicable scenarios to improve search performance.
Use the correct index
The columns you are querying (SELECT, GROUP BY, ORDER BY, JOIN) will be faster if indexes are used.
The index is usually represented as a self-balanced B-tree, which keeps the data in order and allows search, sequential access, insert, and delete operations in logarithmic time.
Setting the index will store the data in memory and take up more memory space.
The write operation is slower because the index needs to be updated.
When loading a large amount of data, disable the index and then load the data, and then rebuild the index, which may be faster.
Avoid high-cost join operations
If there is a performance need, it can be de-normalized.
Split data table
Splitting hot spot data into separate data tables can help with caching.
Tuning query cache
In some cases, query caching can cause performance problems.
NoSQL
NoSQL is the general name of key-value database, document database, column database or graph database. The database is non-normalized, and table joins are mostly done in the application code. Most NoSQL cannot implement transactions that are truly ACID compliant and support ultimate consistency.
BASE is often used to describe the characteristics of NoSQL databases. Compared with CAP theory, BASE emphasizes usability over consistency.
Basic availability-the system ensures availability.
Soft state-even if there is no input, the system state may change over time.
Final consistency-after a period of time, the system eventually becomes consistent because the system does not receive any input during that period.
In addition to choosing between SQL or NoSQL, it is also helpful to know which type of NoSQL database is best suited for your use case. We'll take a quick look at key-value storage, document storage, column storage, and graph storage databases in the next section.
Key-value storage
Abstract model: hash table
Key-value storage can usually achieve O (1) time read and write, and store data in memory or SSD. Data storage can maintain keys according to dictionary order, so as to achieve efficient key retrieval. Key-value stores can be used to store metadata.
Key-value storage has high performance and is often used to store simple data models or frequently modified data, such as caches stored in memory. The operation provided by key-value storage is limited, and if more operations are required, the complexity will be transferred to the application level.
Key-value storage is the basis for more complex storage systems such as document storage and, in some cases, even graph storage.
Document type storage
Abstract model: key-value storage of documents as values
Document type storage centers on documents (XML, JSON, binaries, etc.), which stores all the information about the specified object. Document storage provides API or query statements to implement queries according to the internal structure of the document itself. Note that many key-value storage databases have the feature of storing metadata with useful values, which also blurs the boundaries between the two storage types.
Based on the underlying implementation, documents can be organized by collections, tags, metadata, or folders. Although different documents can be organized or grouped together, they may have completely different fields from each other.
Some document type stores such as MongoDB and CouchDB also provide query statements similar to the SQL language to implement complex queries. DynamoDB supports both key-value storage and document type storage.
Document type storage is highly flexible and is often used to deal with occasionally changing data.
Column storage
Abstract model: nested ColumnFamily mapping
The basic data unit for type storage is the column (name / value pair). Columns can be grouped in a column family (a data table similar to SQL). The super column family is subdivided into ordinary column family. You can use row keys to access each column independently, and columns with corresponding key values form a row. Each value contains a version timestamp to resolve version conflicts.
Google released the first column storage database, Bigtable, which affected the Cassandra of HBase and Facebook, the active open source databases in the Hadoop ecosystem. Storage systems such as BigTable,HBase and Cassandra store keys in alphabetical order and can read key columns efficiently.
Column storage has high availability and high scalability. It is usually used for big data related storage.
Graph database
Abstract model: figure
In the graph database, a node corresponds to a record and an arc corresponds to the relationship between two nodes. The graph database is optimized to represent complex relationships or many-to-many relationships with a large number of foreign keys.
Graph databases provide high performance for data models that store complex relationships, such as social networks. They are relatively new and have not yet been widely used, and it is relatively difficult to find development tools or resources. Many diagrams can only be accessed through REST API.
SQL or NoSQL?
The reason for choosing SQL:
Structured data
A strict model
Relational data
Complex join operations are required
Business
A clear expansion pattern
Existing resources are richer: developers, communities, code bases, tools, etc.
Querying through an index is very fast
The reason for choosing NoSQL:
Semi-structured data
Dynamic or flexible mode
Non-relational data
No complex join operations are required
Store data at the TB (or even PB) level
Highly data-intensive workload
IOPS High Throughput
Sample data suitable for NoSQL:
Burying data and log data
Ranking or scoring data
Temporary data, such as shopping carts
Frequently accessed ("hot") table
Metadata / Lookup Table
twelve。 Caching
Caching can improve page loading speed and reduce the load on the server and database. In this model, the dispatcher first checks to see if the request has been responded to before, and if so, returns the previous result directly to omit the real processing.
The reading of database fragments evenly distributed is the best. However, hot data will cause uneven read distribution, which will cause bottlenecks. If a cache is added in front of the database, it will smooth out the impact of uneven load and sudden traffic on the database.
Client cache
The cache can be located on the client (operating system or browser), on the server, or on a different cache layer.
CDN caching
CDN is also considered a cache.
Web server cache
Reverse proxies and caches, such as Varnish, can provide both static and dynamic content directly. The Web server can also cache requests and return results without having to connect to the application server.
Database caching
The default configuration of the database usually includes the cache level, which is optimized for general use cases. Adjusting the configuration and using different modes in different situations can further improve performance.
Application caching
Memory-based caches such as Memcached and Redis are key stores between the application and the data store. Because the data is stored in RAM, it is much faster than a typical database stored on disk. RAM has more restrictions than disk, so for example, the cache invalidation algorithm of least recently used (LRU) can put "hot data" in RAM without dealing with some of the more "unpopular" data.
Redis has the following additional features:
Persistence option
Built-in data structures such as ordered sets and lists
There are several cache levels, divided into two main categories: database queries and objects:
Row level
Query level
Complete serializable object
Fully rendered HTML
In general, you should try to avoid file-based caching because it makes copying and automatic scaling difficult.
Database query-level caching
When you query the database, store the hash value of the query statement and the query results in the cache. This approach encounters the following problems:
It is difficult to delete cached results with complex queries.
If an item of data such as a piece of data in a table is changed, you need to delete all cached results that may contain changed items.
Object-level caching
Treat your data as objects, just like your application code. Let the application combine data from the database into a class instance or data structure:
If the underlying data of the object has changed, delete the object from the cache.
Allow asynchronous processing: workers assembles objects by using the latest cached objects.
Recommended cached content:
User session
Fully rendered Web page
Activity flow
User graph data
When to update the cache
Since you can only store limited data in the cache, you need to choose a cache update strategy that applies to your use case.
Cache mode
The application reads and writes from memory. The cache does not interact directly with the memory, and the application performs the following operations:
Look for records in the cache, if the required data is not in the cache
Load the required content from the database
Store the found results in the cache
Return what you want
Def get_user (self, user_id): user = cache.get ("user. {0}", user_id) if user is None: user = db.query ("SELECT * FROM users WHERE user_id = {0}", user_id) if user is not None: key = "user. {0}" .format (user_id) cache.set (key, json.dumps (user)) return user
Memcached is usually used in this way.
The data added to the cache is read very quickly. Cache mode is also known as deferred loading. Only the requested data is cached, which prevents the unrequested data from filling up the cache space.
Disadvantages of caching:
If the requested data is not in the cache, it takes three steps to get the data, which can cause significant delays.
If the data in the database is updated, the data in the cache will become obsolete. This problem needs to be mitigated by setting upload TTL to force cache updates or write-through mode.
When a node fails, it will be replaced by a new node, which increases the delay time.
Direct writing mode
The application uses the cache as the primary data store, reading and writing data to the cache, while the cache is responsible for reading and writing data from the database.
Application adds / updates data to the cache
Cache writes to the data store synchronously
Return what you want
Application Code:
Set_user (12345, {"foo": "bar"})
Cache code:
Def set_user (user_id, values): user = db.query ("UPDATE Users WHERE id = {0}", user_id, values) cache.set (user_id, user)
The direct write mode as a whole is a slow operation because of the save and write operation, but it is fast to read the data that has just been written. Compared with reading data, users are usually more receptive to updating data at a slower speed. The data in the cache is not out of date.
Disadvantages of direct write mode:
If a new node is created due to failure or scaling, the new node is not cached until the database is updated. Caching with write-through mode can alleviate this problem.
Most of the data written may never be read, which can be minimized with TTL.
Writeback mode
In writeback mode, the application does the following:
Add or update entries in the cache
Write data asynchronously to improve write performance.
Disadvantages of writeback mode:
The cache may lose data before its contents are successfully stored.
Performing write-through mode is more complex than caching or writeback mode.
Refresh
You can configure the cache to automatically refresh recently accessed content before expiration.
If the cache can accurately predict which data may be requested in the future, refresh may result in reduced latency and read time.
Disadvantages of refresh:
Failure to accurately predict the data you need in the future may lead to worse performance than not using refreshes.
Disadvantages of caching:
You need to maintain consistency between the cache and the real data source, such as the database is invalid according to the cache.
You need to change the application, such as adding Redis or memcached.
Invalid caching is a difficult problem, and when to update the cache is a complex problem.
13. Async
Asynchronous workflows help reduce the time of requests that would otherwise be executed sequentially. They can help reduce request time by doing time-consuming work ahead of time, such as regularly summarizing data.
Message queue
Message queuing receives, retains and delivers messages. If performing operations in sequence is too slow, you can use message queues with the following workflows:
The application publishes the job to the queue and then notifies the user of the job status
A worker takes the job out of the queue, processes it, and then shows that the job is complete
Instead of blocking user operations, jobs are processed in the background. In the meantime, the client may do some processing to make it look as if the task has been completed. For example, if you want to send a tweet, the tweet may appear on your timeline immediately, but it may take some time to push your tweet to all your followers.
Redis is a satisfactorily simple message broker, but messages can be lost.
RabbitMQ is popular but requires you to adapt to the "AMQP" protocol and manage your own nodes.
Amazon SQS is managed, but may have high latency, and messages may be delivered twice.
Task queue
The task queue receives the task and its related data, runs them, and then passes its results. They can support scheduling and can be used to run compute-intensive jobs in the background.
Celery supports scheduling, mainly developed in Python.
Back pressure
If the queue starts to grow significantly, the queue size may exceed the memory size, resulting in cache misses, disk reads, and even slower performance. Back pressure can help us by limiting the queue size, thus maintaining high throughput and good response time for jobs in the queue. Once the queue is full, the client gets the server busy or HTTP 503 status code to try again later. The client can retry the request later, perhaps exponentially Backoff.
Disadvantages of async:
Use cases such as simple computing and real-time workflows may be more suitable for synchronous operations because the introduction of queues can increase latency and complexity.
14. communication
Hypertext transfer Protocol (HTTP)
HTTP is a way to encode and transfer data between the client and the server. It is a request / response protocol: requests and responses from clients and servers to relevant content and completion status information. HTTP is independent, allowing requests and responses to flow through many intermediate routers and servers that perform load balancing, caching, encryption and compression.
A basic HTTP request consists of a verb (method) and a resource (endpoint). The following are common HTTP verbs:
Verb description * idempotent security caches GET reads resources YesYesYesPOST creates resources or triggers processes NoNoYes that process data, if response contains refresh information PUT creates or replaces resource YesNoNoPATCH partially updates resource NoNoYes, if response contains refresh information DELETE deletes resource YesNoNo
Multiple executions do not produce different results.
HTTP is an application layer protocol that relies on lower-level protocols such as TCP and UDP.
Transmission Control Protocol (TCP)
TCP is a connection-oriented protocol over an IP network. Use a handshake to establish and disconnect. All packets sent are guaranteed to arrive at the destination in the original order, using the following measures to ensure that the packets are not damaged:
The sequence number and check code of each packet.
Confirm packet and automatic retransmission
If the sender does not receive the correct response, it will resend the packet. If you time out multiple times, the connection will be disconnected. TCP implements flow control and congestion control. These assurance measures lead to delays and usually result in lower transmission efficiency than UDP.
To ensure high throughput, Web servers can maintain a large number of TCP connections, resulting in high memory usage. Having a large number of open connections between Web server threads can be costly and resource-consuming, that is, a memcached server. Connection pooling can help in addition to switching to UDP where applicable.
TCP is useful for applications that require high reliability but are time-tight. For example, including Web server, database information, SMTP,FTP and SSH.
Use TCP instead of UDP in the following situations:
You need the data intact.
You want to automatically make the best assessment of network throughput.
User Datagram Protocol (UDP)
UDP is connectionless. Datagrams (similar to packets) are guaranteed only at the Datagram level. Datagrams may arrive at their destination in disorder, or they may be lost. UDP does not support congestion control. Although not as guaranteed as TCP, UDP is generally more efficient.
UDP can broadcast datagrams to all devices in the subnet. This is useful for DHCP because devices in the subnet have not yet been assigned IP addresses, and IP is necessary for TCP.
UDP is less reliable but suitable for Internet telephony, video chat, streaming media and real-time multiplayer games.
Use UDP instead of TCP in the following situations:
You need low latency.
Worse than data loss is data latency
You want to implement your own error correction method
Remote procedure call Protocol (RPC)
In RPC, the client calls a method in another address space (usually a remote server). The calling code looks as if it is calling a local method, and the specific process of the client-server interaction is abstracted. Remote calls are generally slower and less reliable than local calls, so it is helpful to distinguish between the two. Popular RPC frameworks include Protobuf, Thrift, and Avro.
RPC is a request-response protocol:
The client program ── calls the client stub program. Just like calling a local method, the parameters are pushed onto the stack.
The client stub program ── packages the id and parameters of the request process into the request information.
The client communication module ── sends information from the client to the server.
The server communication module ── transmits the accepted package to the server stub program.
The server-side stub program ── unpacks the result, calls the server-side method according to the procedure id and passes the parameters.
Example of RPC call:
GET / someoperation?data=anIdPOST / anotheroperation {"data": "anId"; "anotherdata": "another value"}
RPC focuses on exposing methods. RPC is usually used to deal with performance issues in internal communications, so you can handle local calls manually to better suit your situation.
Select the local library (that is, SDK) when:
You know your target platform.
You want to control how your "logic" is accessed.
You want to control the errors that occur in your library.
Performance and end-user experience are the things you care about most.
HTTP API that follows REST tends to be more suitable for public API.
Cons: RPC
The RPC client is tightly tied to the service implementation.
A new API must be defined in each operation or use case.
RPC is difficult to debug.
You may not be able to easily modify the existing technology. For example, if you want to make sure that RPC is cached correctly on a cache server like Squid, it may take some extra effort.
Declarative state transition (REST)
REST is a mandatory client / server architecture design model, and the client is based on a series of resource operations managed by the server. The server provides an interface to modify or obtain resources. All communications must be stateless and cacheable.
There are four rules for RESTful interface:
Flag resource (URI in HTTP) ── uses the same URI regardless of operation.
The representation of changes (HTTP's actions) ── uses actions, headers and body.
Self-describable error message (status code in HTTP) ── uses a status code and does not recreate the wheel.
HATEOAS (HTML interface in HTTP) ── your web server should be accessible through the browser.
An example of a REST request:
GET / someresources/anIdPUT / someresources/anId {"anotherdata": "another value"}
REST focuses on exposing data. It reduces the degree of client / server coupling and is often used in common HTTP API interface design. REST uses a more common and normalized method to expose resources through URI, expressed through header, and manipulated through actions such as GET, POST, PUT, DELETE, and PATCH. Because of its stateless nature, REST is easy to scale out and isolate.
Cons: REST
Because REST focuses on exposing data, it may not adapt well when resources are not naturally organized or structurally complex. For example, returning update records that match a specific set of events over the past hour is difficult to represent as a path. With REST, it is possible to use URI paths, query parameters, and possible request bodies.
REST generally relies on a few actions (GET, POST, PUT, DELETE, and PATCH), but sometimes these alone won't meet your needs. For example, moving out-of-date documents to an archive folder may not be easily expressed in the above verbs.
In order to render a single page, getting complex resources nested in a hierarchy requires clients, and multiple round-trip communications between servers. For example, get blog content and its associated comments. For mobile applications that use uncertain network environments, these multiple round-trip communications are very troublesome.
Over time, more fields may be added to the API response, and older clients will receive all new data fields, even those they don't need, resulting in increased load size and greater latency.
RPC and REST comparison operation RPCREST registration POST / signupPOST / persons logout POST / resign
{
"personid": "1234"
} DELETE / persons/1234 read user information GET / readPerson?personid=1234GET / persons/1234 read user item list GET / readUsersItemsList?personid=1234GET / persons/1234/items add an item POST / addItemToUsersItemsList to the user item list
{
"personid": "1234"
Itemid: "456"
} POST / persons/1234/items
{
Itemid: "456"
} Update an item POST / modifyItem
{
Itemid: "456"
"key": "value"
} PUT / items/456
{
"key": "value"
} Delete an item POST / removeItem
{
Itemid: "456"
} DELETE / items/45615. Safety
More content is needed in this section. Come on!
Security is a broad topic. Unless you have considerable experience, security background, or security knowledge required for the position you are applying for, you do not need to know anything other than the basics of security:
Encryption during transportation and waiting
All user input and parameters sent from the user are processed to prevent XSS and SQL injection.
Use parameterized queries to prevent SQL injection.
Use the principle of minimum permissions.
16. Appendix
Sometimes you will be asked to make conservative estimates. For example, you may need to estimate the time it takes to generate thumbnails of 100 images from disk or how much memory a data structure requires. The power table of 2 and some time data that every developer needs to know are convenient references.
Table Power Exact Value Approx Value Bytes---7 1288 25610 to the power of 2 1024 1 thousand 1 KB16 65536 64 KB20 1048576 1 million 1 MB30 1073741824 1 billion 1 GB32 4294967296 4 GB40 1099511627776 1 trillion 1 TB the number of delays that every programmer should know Latency Comparison Numbers--L1 cache reference 0.5 nsBranch mispredict 5 nsL2 cache reference 7 ns 14x L1 cacheMutex lock/unlock 100 nsMain memory reference 100 ns 20x L2 cache 200x L1 cacheCompress 1K bytes with Zippy 10000 ns 10 usSend 1 KB bytes over 1 Gbps network 10000 ns 10 usRead 4 KB randomly from SSD* 150000 ns 150 us ~ 1GB/sec SSDRead 1 MB sequentially from memory 250000 ns 250 usRound trip within same datacenter 500000 ns 500 usRead 1 MB sequentially from SSD* 1000000 ns 1000 us 1 ms ~ 1GB/sec SSD 4X memoryDisk seek 10000000 ns 10000 us 10 ms 20x datacenter roundtripRead 1 MB sequentially from 1 Gbps 10000000 ns 10000 us 10 ms 40x memory, 10x SSDRead 1 MB sequentially from disk 30000000 ns 30000 us 30 ms 120x memory, 30X SSDSend packet CA- > Netherlands- > CA 150000000 ns 150000 us 150 msNotes-1 ns = 10 ^-9 seconds1 us = 10 ^-6 seconds = 1000 ns1 ms = 10 ^-3 seconds = 1000 us = 1000000 ns
Indicators based on the above figures:
Read sequentially from disk at 30 MB/s
Read from Ethernet order of 1 Gbps at 100 MB/s
Read from SSD at a speed of 1 GB/s
Read from main memory at a speed of 4 GB/s
Can circle the earth 6-7 times per second
There are 2000 round trips per second in the data center
Delay number visualization
At this point, the study of "how is the architecture of a large-scale Internet system designed" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.