In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
Preface
These principles are summed up through actual combat.
There is a bloody lesson behind every principle.
These principles are mainly aimed at database developers, so we must pay attention to them in the development process.
I. Core principles
1. Try not to do operations in the database.
As the saying goes: don't let your toes think about things, that's the job of the brain.
As a database developer, we should let the database do more of what she is good at:
Try not to do complex operations in the database, move to the program side CPU, and apply MYSQL as simply as possible.
For example:
Try not to use such operation functions as md5 (), Order by Rand (), etc., in mysql.
two。 Try to control the amount of data in a single table
We all know that too much data in a single table will affect the efficiency of data query, and in serious cases, it will cause the whole library to get stuck.
In general, according to the estimated amount of data in a single table within one year:
Pure INT does not exceed 1000W and contains CHAR not more than 500W
At the same time, we should try our best to make a reasonable sub-table so that the amount of data in a single table is not overloaded. The common sub-table strategies are:
Sub-table through USERID (according to ID interval sub-table): more applications in the financial industry, large number of users, user characteristics obviously according to DATE sub-table (by day, week, month sub-table): there are many applications in the telecommunications industry, such as user online record table, user short message table, phone list, etc., according to AREA sub-table (province, city, distinction table) other
The main applicable scenarios for partition tables are:
The ① table is so large that it cannot all be stored in memory, or there is only hot data at the end of the table. The rest is historical data.
The data of the ② partition table is easier to maintain and can operate independently on independent partitions.
The data of ③ partition table can be distributed on different machines, thus using resources efficiently.
④ can use partitioned tables to avoid some special bottlenecks
⑤ can back up and restore separate partitions.
However, there are also some limitations in using partitioned tables, which should be noted when using them:
① a table can only have a maximum of 1024 partitions
In ② version 5.1, the partition table expression must be an integer, and column partitioning can be used in 5.5.The partition table expression must be an integer
If there is a primary key and unique index column in the ③ partition field, both the primary key column and the unique column must be included
Foreign key constraints cannot be used in ④ partitioned tables
⑤ needs to modify the structure of the existing table
All partitions of ⑥ must use the same storage engine
There are some restrictions on the functions and expressions that can be used in ⑦ partition functions
Some ⑧ storage engines do not support partitioning
⑨ cannot use load index into cache for MyISAM partitioned tables
⑩ for MyISAM tables, more file descriptors need to be opened when using partition tables.
3. Try to control the number of table fields
The number of fields in a single table cannot be too large. Optimize and adjust the table according to the business scenario, and adjust the number of table fields as few as possible. This has the following advantages:
IO efficient full table traversal table repair faster improve concurrent alter table faster
So how many fields are appropriate for a single table?
Evaluated according to the volume of 1G in a single table and the amount of data in 500W rows:
Sequential reading of 1G files requires no more than 50 pure INT fields and no more than 20 CHAR (10) fields in a single 200Byte single table in N seconds.
It is suggested that the upper limit of the number of fields in a single table should be controlled at 20,50.
4. Balanced Paradigm and redundancy
The design of database table structure is also balanced. In the past, we often said that we should strictly follow the three paradigms, so let's first talk about what the paradigm is:
The first paradigm: individual fields can no longer be divided. Uniqueness.
The second paradigm: there is no non-primary attribute that depends only on part of the primary key. Eliminate incomplete dependence.
The third paradigm: eliminate transitive dependence.
Summarize paradigms and redundancy in one sentence:
Redundancy is the exchange of storage for performance
The paradigm is performance in exchange for storage.
Therefore, redundancy is generally more popular in practical work.
When designing the model, the specific tradeoff between these two aspects should first be based on the computing power and storage resources provided by the enterprise.
Secondly, in the general Internet industry, data warehouses are implemented according to the Kimball pattern, and modeling is also task-driven, so the tradeoff between redundancy and paradigm meets the needs of the task.
For example, a piece of indicator data must be processed before 8: 00 in the morning, but the time window for calculation is very small. In order to minimize the time-consuming calculation of indicators, it is necessary to minimize the correlation of multiple tables in the process of calculation. More redundancy is needed in model design.
5. Reject 3B
The concurrency of database, like urban traffic, is growing nonlinearly.
This requires us to pay attention to the bottleneck under high concurrency when doing database development to prevent database paralysis caused by high concurrency.
The rejection 3B here refers to:
Large SQL (BIG SQL): to reduce large transactions (BIG Transaction) and large quantities (BIG Batch)
II. Principles of field category
1. Make good use of numeric field types
Three types of numeric values:
Integers: TINYINT (1Byte), TINYINT (1Byte), SMALLINT (2B), MEDIUMINT (3B), INT (4B), BIGINT (8B) floating-point types: FLOAT (4B), DOUBLE (8B) DECIMAL (MPerry D)
Use a few common examples to illustrate:
1) INT (1) VS INT (11)
Many people can not tell the difference between INT (1) and INT (11), you must be very curious, in fact, 1 and 11 only show the length of the difference, that is, no matter what the value of int (x) x is, the value range of the stored number or the value range of the data type of int itself, x is just the length of the data display.
2) BIGINT AUTO_INCREMENT
As we all know, signed int can support up to about 2.2 billion, which is much larger than our requirements and the performance limit that a single MySQL table can support. For OLTP applications, the size of a single table should generally be kept at the level of 10 million, and will not reach the upper limit of 2.2 billion. If you want to increase the reservation, you can change the primary key to unsigned int, the upper limit is 4.2 billion, this reservation is already very sufficient.
Using bigint will take up more disk and memory space, after all, memory space is limited, invalid occupation will lead to more data swapping in and out, additional pressure on IO, which is disadvantageous to performance.
Therefore, it is recommended to use the int unsigned type for self-increment primary keys, but bigint is not recommended.
3) DECIMAL (NJO)
When using the DECIMAL data type, the general number of decimal places will not be 0. If the decimal place is set to 0, it is recommended to use the INT type.
two。 Convert characters to numbers
Numeric VS string indexes have more advantages:
Query more efficiently, faster and take up less space
Example: use unsigned INT to store IP instead of CHAR (15)
INT UNSIGNED
You can use INET_ATON () and INET_NTOA () to convert between IP strings and numeric values
3. Give priority to ENUM or SET
For some enumerated data, we recommend using ENUM or SET first. Such scenarios are suitable:
1) string type
2) the possible value is known and limited
Storage:
1) ENUM occupies 1 byte and is converted to numerical operation
2) SET depends on the node and occupies a maximum of 8 bytes
3) when comparing, you need to add 'single quotation marks (even if it is numeric).
For example:
`sex` enum ('Flying Magazine M') COMMENT' gender'
`c1`enum ('0,1', 2', 3') COMMENT 'review'
4. Avoid using NULL fields
Why do we try to add NOT NULL DEFAULT''when designing database table fields? we have to talk about the disadvantages of using NULL fields:
It is difficult to optimize the query
NULL columns are indexed and extra space is required
Invalid composite index with NULL
For example:
1) `a`char (32) DEFAULT NULL [not recommended]
2) `b` int (10) NOT NULL [not recommended]
3) `c`int (10) NOT NULL DEFAULT 0 [recommended]
5. Use less and split TEXT/BLOB
TEXT type processing performance is much lower than that of VARCHAR.
Force the generation of hard disk temporary tables to waste more space VARCHAR (65535) = > 64K (pay attention to UTF-8)
Try not to use TEXT/BLOB data types
If business needs must be used, it is recommended to split it into separate tables
For example:
CREATE TABLE T1 (id INT NOT NULL AUTO_INCREMENT, data TEXT NOT NULL, PRIMARY KEY (id)) ENGINE=InnoDB
6. Do not store pictures in the database
First, the picture above:
It can be seen that if all the pictures are stored in the database, the size of the database will become larger and the reading and writing speed will be slower.
Disadvantages of storing pictures in database:
The speed of reading / writing to the database can never catch up with the speed of file system processing. Database backups become huge, and more and more time-consuming access to files needs to cross your application layer and database layer.
★ recommended treatment: save the image path in the database
Generate the path according to the year, month and day. Specifically according to the year, month, day or year to generate the path, according to their own needs (not necessarily according to the date to generate).
It is key to understand why it is scattered into multiple folders, and it is clear when it comes to one principle:
The operating system has a limit on the number of files in a single directory. When there are a large number of files. Getting files from the directory becomes slower and slower. Therefore, in order to maintain speed, it is necessary to disperse into multiple directories according to fixed rules.
The picture is scattered to the disk path. What is saved in the database field is "images/2012/09/25/ 1343287394783.jpg" similar to this.
The original uploaded image file name will be renamed and saved, such as generated according to the timestamp, 1343287394783. Jpg . This is to avoid duplicate file names, which will appear when multiple people upload pictures to the same directory.
Anyway, with what kind of rules to name the picture, as long as the picture name is unique.
For example, the concurrent visits to the website are large, and the generation of the directory is as detailed as possible. For example, accurate to an hour, an hour can be a folder. At the same time, two users are uploading pictures at the same time in 0.001 seconds (because then they will be saved in the same hour folder). Because the timestamp is accurate to seconds. In order to achieve the uniqueness of the picture name without overwriting, the generation can continue to add milliseconds after the timestamp, and so on. To sum up, the greater the number of concurrent visits. The more accurate it will be.
Digression:
1) Why the saved disk path is "images/2012/09/25/1343287394783.jpg" instead of "/ images/2012/09/25/1343287394783.jpg" (with a slash at the front)
When you need to take out the image path on the page to display the picture, if it is a relative path, you can use ". /" + "images/2012/09/25/1343287394783.jpg" to assemble.
If you need a separate domain name (such as when doing cdn acceleration), a domain name such as img1.xxx.com,img2.xxx.com
Direct assembly "http://img1.xxx.com/"+"images/2012/09/25/1343287394783.jpg""
2) Why the saved disk path is "images/2012/09/25/1343287394783.jpg" instead of "http://www.xxx.com/images/2012/09/25/1343287394783.jpg""
Here actually involves the knowledge of CDN, the specific knowledge of CDN is not much expanded here, in short:
Cdn services: ideal for static content. So like merchandise pictures, as the traffic increases, you just need to upload the pictures to their server by renting the cdn service.
Example: Beijing visits Changsha server, the distance is too far. I can put the product pictures on the cloud service in Beijing (I think the cloud storage provided to the website now is actually cdn, which provides streaming and nearby access to the website). In this way, when Beijing users visit, the pictures are actually obtained nearby. There is no need for long distance transmission.
Use a domain name, img.xxx.com, to load pictures. This domain name is resolved to the cloud service in Beijing.
Practice: "images/2012/09/25/1343287394783.jpg" is saved in the database
These images are not actually stored on the web server. Upload to the cdn server in Beijing.
I took it out of the database and directly "img.xxx.com/" + "images/2012/09/25/1343287394783.jpg"
For example, if there are more than one, name img1.xx.com or img2.xx.com.
Whatever you want. So if you save the domain name directly. It looks like a lot of trouble. The trouble of migration.
III. Principles of indexing
1. Add indexes cautiously and reasonably
The purpose of adding indexes is to improve the query. Adding indexes will slow down updating indexes. The more indexes the better, the less indexes can be added (comprehensive evaluation of data density and data distribution, preferably no more than 20% of the number of fields) combined with core SQL limited consideration of overwriting indexes.
Example: do not create an index on the gender column
Theoretical articles will tell you that fields with high repetition rates are not suitable for indexing. Not to mention that the gender field has only two values, netizens test that there are 26 possibilities for a field to use the first letter of Pinyin as a value. with the addition of the index, the speed of using the index is slower than that of not using the index.
Why is gender not suitable for indexing? Because you need to pay extra IO overhead to access the index, all you get from the index is the address. To actually access the data, you have to IO the table. If you want to fetch a few data from 1 million rows of data in the table, then using the index to locate quickly, the IO cost of accessing the index is very worthwhile. But if you are fetching 500000 rows of data from 1 million rows of data, such as the gender field, you need to access the index 500000 times and the table 500000 times, which adds up to no less overhead than a full scan of the table directly.
two。 Character fields must be indexed with a prefix
Degree of differentiation:
Single letter differentiation: 26
4 letter distinction: 26 "26" 26 "26 = 456976
5 letter distinction: 26 "26" 26 "26" 26 = 11881376
6-letter distinction: 26 "26" 26 "26" 26 = 308915776
Character fields must be indexed with a prefix, for example:
`pinyin` varchar, DEFAULT NULL COMMENT 'community pinyin`, KEY `idx_ pinyin` (`pinyin` (8)),) ENGINE=InnoDB
3. Do not do operations on index columns
There are two reasons:
1) will result in the inability to use the index
2) it will cause a full table scan
For example:
BAD SAMPLE:
Select * from table WHERE to_days (current_date)-to_days (date_col) = DATE_SUB ('2011-10-22 minute 10 DAY)
4. Self-incrementing column or global ID as INNODB primary key
Cluster index on primary key secondary index store primary key value primary key should not be updated insert value in self-increasing order avoid using string as primary key clustering index split recommend using business-independent AUTO_INCREMENT column or global ID generator as proxy primary key if the primary key is not specified, InnoDB will replace it with a unique and non-null index
5. Try not to use foreign keys
Try not to use foreign keys in online OLTP systems: foreign keys can save the amount of development and have additional overhead to "reach" other tables line by line, which means that locks are easy to deadlock when high concurrency
It is suggested that the constraint should be guaranteed by the program.
For example, we used to build a table sentence like this:
CREATE TABLE `user` (`user_ id` int (11) NOT NULL AUTO_INCREMENT COMMENT 'primary key', `user_ name` varchar (50) NOT NULL DEFAULT''COMMENT' username', PRIMARY KEY (`user_ id`) ENGINE=InnoDB DEFAULT CHARSET=utf8 CREATE TABLE `order` (`id`int (11) NOT NULL AUTO_INCREMENT COMMENT 'key', `total_ price` decimal (10piag2) NOT NULL DEFAULT '0.005), `user_ id` int (11) NOT NULL DEFAULT' 0keys, PRIMARY KEY (`id`), KEY `for_indx_user_ id` (`user_ id`), CONSTRAINT `for_indx_user_ id` FOREIGN KEY (`user_ id`) REFERENCES `user` (`user_ id`) ENGINE=InnoDB DEFAULT CHARSET=utf8
After not using foreign key constraints:
CREATE TABLE `user` (`user_ id` int (11) NOT NULL AUTO_INCREMENT COMMENT 'primary key', `user_ name` varchar (50) NOT NULL DEFAULT''COMMENT' username', PRIMARY KEY (`user_ id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE `order` (`id`int (11) NOT NULL AUTO_INCREMENT COMMENT 'primary key', `total_ price` decimal (10pp2) NOT NULL DEFAULT '0.00key, `user_ id` int (11) NOT NULL DEFAULT' 0mm, PRIMARY KEY (`id`) ENGINE=InnoDB DEFAULT CHARSET=utf8
When foreign key constraints are not applied, we usually add an index to fields that do not establish foreign key constraints in order to speed up the query.
CREATE TABLE `order` (`id` int (11) NOT NULL AUTO_INCREMENT COMMENT 'primary key', `total_ price` decimal (10piag2) NOT NULL DEFAULT '0.005, `user_ id` int (11) NOT NULL DEFAULT' 0keys, PRIMARY KEY (`id`), KEY `idx_user_ id` (`user_ id`),) ENGINE=InnoDB DEFAULT CHARSET=utf8
In actual development, foreign key constraints are generally not established.
IV. SQL principles
The 1.SQL statement is as simple as possible
In the development process, we try to keep the SQL statement simple. Let's compare the large SQL with several simple SQL.
The traditional design idea BUG MySQL NOT a SQL can only be in a high concurrency of 5000 + QPS of a CPU operation. A SQL of 1 second means? Maybe one big SQL will block the whole database.
Reject big SQL and disassemble it into several simple SQL
Simple SQL cache has a higher hit rate and reduces table locking time, especially when MyISAM uses multiple CPU
two。 Keep transactions (connections) short
Transaction / connection use principle: open and use, turn off and turn off transaction-independent operations are put outside the transaction, reduce the occupation of lock resources without breaking consistency, use multiple short transactions instead of long transactions
For example:
1) waiting for uploading pictures when posting
2) A large number of sleep connections
3. Avoid using SP/TRIG/FUNC whenever possible
In online OLTP systems, we should:
Use as few stored procedures as possible and use as few triggers as possible to reduce the use of MySQL functions to process the results
Leave all these things to the client program.
4. Try not to use SELECT *
When using SELECT *, more CPU, memory, IO and network bandwidth will be consumed
When writing query statements, we should try not to use SELECT * and only take the data columns we need:
More secure design: reducing the impact of table changes makes it possible to use covering index to reduce hard disk temporary table generation, especially with TEXT/BLOB
For example:
Not recommended:
SELECT * FROM tagWHERE id = 999148
Recommended:
SELECT keyword FROM tagWHERE id = 999148
5. Rewrite OR to IN ()
Same field, rewrite or to in ()
OR efficiency: O (n)
IN efficiency: O (Log n)
When n is very large, OR will be much slower
Pay attention to control the number of IN. It is recommended that n be less than 200.
For example:
Not recommended:
Select * from opp WHERE phone='12347856' or phone='42242233'
Recommended:
Select * from opp WHERE phone in ('12347856', '42242233')
6. Rewrite OR to UNION
Different fields, change or to union
Reduce the number of "or" queries on different fields. Merge index is often mentally retarded if you are confident enough: set global optimizer_switch='index_merge=off'
For example:
Not recommended:
Select * from opp WHERE phone='010-88886666'or cellPhone='13800138000'
Recommended:
Select * from opp WHERE phone='010-88886666 'union Select * from opp WHERE cellPhone='13800138000'
7. Avoid negative queries and% prefix fuzzy queries
In actual development, we should try our best to avoid negative queries. What is negative queries? there are mainly the following:
NOT,! =,!, NOT EXISTS, NOT IN, NOT LIKE, etc.
At the same time, we also need to avoid% prefix fuzzy queries, because this will use B + Tree, at the same time, will not be able to use the index, and will lead to full table scanning, performance and efficiency can be imagined
For example:
8. Reduce COUNT (*)
We often use COUNT (*) in development, but we don't realize that this usage will cause a lot of waste of resources, because COUNT (*) is expensive, so we don't have to use it as little as possible.
For counting class statistics, we recommend:
Real-time statistics: use memcache, two-way update, early morning running benchmark non-real-time statistics: try to use separate statistical tables, recalculate regularly
Compare COUNT (*) with several other COUNT:
`id` int (10) NOT NULL AUTO_INCREMENT COMMENT 'id', `sale_ id` int (10) unsigned DEFAULT NULL
Conclusion:
COUNT (*) = COUNT (1) COUNT (0) = COUNT (1) COUNT (1) = COUNT (100) COUNT (*)! = COUNT (col)
9.LIMIT efficient paging
Traditional pagination:
Select * from table limit 10000 no. 10
LIMIT principle:
The larger the 10 offset of Limit 10000, the slower it is.
Recommended pagination:
Select * from table WHERE id > = 23423 limit 11; # 10 limit 1 (10 entries per page) select * from table WHERE id > = 23434 limit 11
Paging method 2:
Select * from table WHERE id > = (select id from table limit 10000 limit 1) limit 10
Paging method 3:
SELECT * FROM table INNER JOIN (SELECT id FROM table LIMIT 10000 Magazine 10) USING (id)
Paging mode 4:
# first use the program to obtain ID:select id from table limit 10000 from table WHERE id in # and then use in to obtain the corresponding record Select * from table WHERE id in (123456 …)
It is necessary to analyze and reorganize the index according to the actual scenario.
Example:
10. Use UNION ALL instead of UNION
If there is no need to deduplicate the results, but only to jointly query and display multiple tables, use UNION ALL, because UNION has deduplication overhead
For example:
MySQL > SELECT * FROM detail20091128 UNION ALL SELECT * FROM detail20110427 UNION ALL SELECT * FROM detail20110426 UNION ALL SELECT * FROM detail20110425 UNION ALL SELECT * FROM detail20110424 UNION ALL SELECT * FROM detail20110423
11. Decompose the connection to ensure high concurrency
High concurrency DB does not recommend JOIN with more than two tables
Properly decompose the join to ensure high concurrency:
Cacheable large amount of early data uses multiple MyISAM tables for small ID IN () joins of large tables to reference the same table multiple times
For example:
Original SQL:
MySQL > Select * from tag JOIN tag_post on tag_post.tag_id=tag.id JOIN post on tag_post.post_id=post.id WHERE tag.tag=' used Toys'
Decompose SQL:
MySQL > Select * from tag WHERE tag=' used Toys'; MySQL > Select * from tag_post WHERE tag_id=1321; MySQL > Select * from post WHERE post.id in (123456314141)
12.GROUP BY removal sort
Grouping and automatic sorting can be achieved by using GROUP BY
No sorting required: Order by NULL
Specific sort: Group by DESC/ASC
For example:
13. Compare column values of the same data type
Principle: number to number, character to character
Comparison of numeric columns and character types: conversion to double precision at the same time
Comparison of character columns with numeric types: character columns convert entire columns to numeric values, without using index queries
For example:
Field: `remark` varchar (50) NOT NULL COMMENT 'remarks, default is empty'
MySQL > SELECT `id`, `gift_ code` FROM gift WHERE `deal_ id` = 640 AND remark=115127; 1 row in set (0.14 sec) MySQL > SELECT `id`, `gift_ code`FROM pool_gift WHERE `deal_ id` = 640 AND remark='115127'; 1 row in set (0.005 sec)
14.Load data derived data
Fast import of bulk data:
Batch loading is faster than single-line loading. There is no need to load Insert values,values faster than index loading every time the cache is refreshed. Values reduces index refresh Load data is about 20 times faster than insert.
Try not to use INSERT... SELECT, one is that there is delay, and the other is that synchronization errors will occur.
15. Break up mass updates
Operate in the early hours of the morning as far as possible to avoid the peak in the early hours of the morning. The default is 100 messages / second (special reconsideration).
For example:
Update post set tag=1 WHERE id in; sleep 0.01; update post set tag=1 WHERE id in; sleep 0.01;
16.Know Every SQL
As DBA and even database developers, we must have a good understanding of every SQL in the database. Common commands are:
SHOW PROFILEMYSQLslaMySQLdumpslowexplainShow Slow LogShow ProcesslistSHOW QUERY_RESPONSE_TIME (Percona)
V. principles of the agreed category
1. Isolate online and offline
Build the ecological environment of the database to ensure the development of wireless access to the database.
Principle: online connection, offline connection
Pro library for production data pre library for pre-production environment test library for testing and testing dev library for development
two。 Prohibit subqueries that have not been confirmed by DBA
In most cases, the optimization is poor. In particular, subqueries using IN id in WHERE can generally be rewritten with JOIN.
For example:
MySQL > select * from table1 where id in (select id from table2); MySQL > insert into table1 (select * from table2); / / may cause replication exception
3. Never explicitly lock on the program side
It is extremely difficult to debug and troubleshoot when the external lock is not controllable to the database.
For consistency issues such as concurrent deductions, we use transactions to deal with them, and check conflicts before Commit.
4. The unified character set is UTF8
5. Unified naming convention
1) Database tables and other names are uniformly lowercase
2) Index naming defaults to "idx_ field name"
3) the name of the library should be abbreviated to 2-7 letters as far as possible
DataSharing = = > ds
4) be careful to avoid naming with reserved words
It is recommended that database developers keep all of the above holes in mind. I hope it will be helpful to your study, and I also hope you will support it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.