I have to tell you the principle of MySQL optimization 2. 07/19 Update SLTechnology News&Howtos

I have to tell you the principle of MySQL optimization 2.

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

If some students finish reading the last article about MySQL, there are two very open questions at the end of the article. If you are interested, you can think about it in your head. This article will also try to answer these two questions, hoping to give you some reference. Now you can think about a question: what if you choose the appropriate fields according to your business, carefully design the table and index, and carefully check all the SQL and confirm that there is no problem, but the performance still does not meet your requirements? Are there any other optimization strategies? The answer is yes. Let's continue to discuss some of the common advanced features of MySQL and how they work.

Partition table

Reasonable use of the index can greatly improve the query performance of MySQL, but if the amount of data in a single table reaches a certain extent, the index will not work, because in the case of large amount of data, unless the index is overwritten, because the query back to the table will produce a large number of random Icano, the response time of the database may reach an unacceptable degree. And the cost of index maintenance (disk space, Ipool O operations) can be very high.

Therefore, when the data volume of a single table reaches a certain extent (in the MySQL4.x era, the performance inflection point of MyISAM storage engine industry is 500W rows, and the performance inflection point of MySQL5.x era is 1KW-2KW row level, which needs to be tested according to the actual situation), in order to improve performance, the most commonly used method is to split tables. The strategy of splitting tables can be vertical split (for example, split orders with different order states into different tables) or horizontal split (for example, split orders into different tables on a monthly basis). But generally speaking, the sub-table can be seen as solving the problem of large amount of data from a business point of view, it can improve the performance to a certain extent, but also greatly increase the complexity of coding, students who have had this experience may have a lot of experience.

The sub-table in the business layer greatly increases the complexity of the coding, and the related code dealing with the database will be scattered all over the application, so it is difficult to maintain. Whether the logic of the sub-table can be abstracted and processed uniformly, so that the business layer does not care about whether the underlying table is divided or not, but only needs to focus on the business. Of course, the answer is yes. at present, there are many database middleware that can mask the details of the sub-table, so that the business layer can query the data after the sub-table like a single table. If you move the abstract logic down to the service layer of the database, it is the partition table that we are going to talk about today.

Partition can be seen as an effective way to solve the big data problem from the technical level. Simply understood, it can be considered that the bottom layer of MySQL helps us to achieve sub-table. The partition table is an independent logical table, and the bottom layer is composed of multiple physical sub-tables. The storage engine manages the underlying tables of the partition the same as the ordinary tables (all underlying tables must use the same storage engine), and the indexes of the partitioned tables add exactly the same index to each underlying table. From the point of view of the storage engine, the underlying table is no different from the normal table, and the storage engine does not need to know. When executing a query, the optimizer filters those partitions that do not have the data we need according to the definition of the partition, so that the query does not need to scan all partitions, but only needs to find the partition that contains the needed data.

To better understand partitioned tables, let's start with an example: an order table, the amount of data is about 10TB, how can it be designed to optimize performance?

First of all, it is certain that because of the huge amount of data, it is certainly not possible to scan the whole table. If you use the index, you will find that the data is not aggregated in the way you want, but will produce a large number of fragments, which will eventually lead to thousands of random Icano in a query, and the application will freeze. So you need to choose some coarse-grained and less expensive ways to retrieve data. For example, first find a large piece of data according to the index, and then scan the data sequentially.

This is what partitions do, and when you understand partitions, you can also use them as the initial form of the index and locate the desired data in a very inexpensive way to locate which "area" you need. In this "area", you can scan sequentially, you can build indexes, and you can cache the data in memory. Because partitions do not need additional data structures to record what data each partition has, the cost is very low. You only need a simple expression to express what data is stored in each partition.

When you partition a table, you can use the following statement when creating the table:

CREATE TABLE sales {order_date DATETIME NOT NULL-- other columns} ENGINE=InnoDB PARTITION BY RANGE (YEAR (order_date)) (PARTITION packs 2014 VALUES LESS THAN (2014), PARTITION packs 2015 VALUES LESS THAN (2015) PARTITION packs 2016 VALUES LESS THAN (2016) PARTITION packs 2017 VALUES LESS THAN (2017) PARTITION p_catchall VALUES LESS THAN MAXVALUE)

Various functions can be used in the partition clause, but the return value of the expression must be a definite integer and cannot be a constant. MySQL also supports other partitions, such as key values, hashes, and list partitions, but they are rare in production environments. RANGE COLUMNS type partitioning can be used after MySQL5.5, so that even if it is based on a time partition, it no longer needs to be converted to an integer.

Next, take a brief look at the various operation logic on the partition table:

SELECT: when querying a partition table, the partition layer first opens and locks all the underlying tables. The optimizer first determines whether some partitions can be filtered, and then calls the corresponding storage engine interface to access the data of each partition.

INSERT: when inserting a record, the partition layer first opens and locks all the underlying tables, then determines which partition receives the record, and then writes the record to the corresponding underlying table, similar to the DELETE operation

UPDATE: when updating a piece of data, the partition layer first opens and locks all the underlying tables, then determines the partition corresponding to the data, then takes out the data and updates it, determines which partition the updated data should be stored in, and finally writes to the underlying table and deletes the underlying table where the original data is located.

Some operations support conditional filtering. For example, when deleting a record, MySQL needs to find the record first, and if the WHERE condition happens to match the partition expression, it can filter out all partitions that do not contain the record, which is also valid for the UPDATE statement. If it is an INSERT operation, only one partition will be hit, and all other partitions will be filtered.

Although each operation "opens and locks all the underlying tables first", this does not mean that the partitioned table locks the whole table during processing. If the storage engine can implement row-level locks on its own, such as InnoDB, the corresponding table locks are released at the partition layer. This locking and unlocking process is similar to a query on a normal InnoDB.

When using partitioned tables, in order to ensure the scalability of large amounts of data, there are generally two strategies:

Scan all the data without indexing. That is to say, as long as the data to be queried is limited to a few partitions according to WHERE conditions, the efficiency is good.

Index data, separate hotspots. If the data has obvious "hot spots", and apart from this part of the data, the other data is rarely accessed, then you can store this part of the hot data in a separate partition, so that the data in this partition can have the opportunity to cache in memory. In this way, the query can value access to a small partitioned table, can use the index, but also can effectively use the cache.

The advantage of the partition table is that the optimizer can filter some partitions according to the partition function, but it is very important to bring partition columns in the WHERE condition, sometimes even if it seems redundant, so that the optimizer can filter out the partitions that do not need to be accessed. Without these conditions, MySQL needs to let the corresponding storage engine access all partitions of the table. If the table is very large, it may be very slow.

The above two partitioning strategies are based on two very important premises: the query can filter out a lot of additional partitions, and the partitions themselves do not incur a lot of extra cost. These two premises are problematic in some scenarios, such as:

1. Null values invalidate partition filtering

Assuming a PARTITION BY RANGE YEAR (order_date) partition, records are stored in the first partition if all order_date are NULL or illegal values. So WHERE order_date BETWEEN '2017-05-01' AND '2017-05-31, this query will check two partitions, not the one we thought in 2017 (the first partition will be checked extra), because YEAR () will return NULL when receiving an illegal value. If the first partition has a very large amount of data and uses a full table scan strategy, it can be very costly. To solve this problem, we can create a useless partition, such as PARTITION p_null values less than (0). If all the inserted data is valid, the first partition is empty.

This technique is not needed after MySQL5.5, because you can directly use the column itself instead of column-based functions for partitioning: PARTITION BY RANGE COLUMNS (order_date). This problem can be avoided by using this syntax directly.

2. The partition column and the index column do not match

When partitioned columns and index columns do not match, the query may not be able to be partitioned unless each query condition contains partitioned columns. Suppose an index is defined on column an and partitioned on column b. Because each partition has its own independent index, scanning the index on column b requires scanning the corresponding index in each partition, which is not too slow, but it is certainly better to be able to skip mismatched partitions. This problem seems easy to avoid, but one situation to be aware of is associative queries. If the partitioned table is the second table in the association order, and the index used by the association does not match the partition criteria, then you need to access and search all partitions of the second table for each eligible row in the first table (the principle of associative query, please refer to the previous article)

3. The cost of choosing a partition may be high.

There are many types of partitions, and different types of partitions are implemented in different ways, so their performance is also different, especially range partitions. All partition definitions are scanned when confirming which partition this row belongs to. Such linear scanning is not efficient, so as the number of partitions increases, the cost will become higher and higher. Especially when inserting data in bulk, because each record needs to confirm which partition it belongs to before inserting, if the number of partitions is too large, it will cause a sharp decline in insertion performance. So it's necessary to limit the number of partitions, but don't worry too much, for most systems, about 100 partitions are fine.

4. The cost of opening and locking all the underlying tables can be high at some point

As mentioned earlier, opening and locking all underlying tables does not have much impact on performance, but in some cases, such as querying only the primary key, the cost of locking is slightly higher than that of the primary key query.

5. The cost of maintaining the partition may be high

The speed of adding and deleting partitions is very fast, but modifying partitions will cause data replication, which is similar to the principle of ALTER TABLE, you need to create a historical partition, then copy the data to it, and finally delete the original partition. Therefore, when designing a database, it is a very good habit to create partition tables reasonably, taking into account the needs of business growth. In later versions of MySQL5.6, you can use ALTER TABLE EXCHAGE PARTITION statements to modify partitions, and the performance will be greatly improved.

There are other limitations to partitioned tables, such as that all underlying tables must use the same storage engine, and some storage engines do not support partitioning. Partition is generally applied to a server, but the physical resources of a server are always limited, when the data reaches this limit, even if the partition, the performance may be very low, so sub-library is necessary at this time. But no matter it is partition, sub-library or sub-table, their ideas are all the same, you can have a good experience.

View

For some complex queries related to tables, using views can sometimes greatly simplify the problem, so you can see the view in many situations, but is the view really as simple as we think? How is it different from a SQL statement that uses JOIN directly? How much do you know about the principle behind the view?

The view itself is a virtual table that does not store any data, and the dataset of the query view is generated by other tables. The bottom layer of MySQL implements the view through two algorithms: temporary table algorithm (TEMPTABLE) and merge algorithm (MERGE). The so-called temporary table algorithm is to store the results of the SELECT statement in the temporary table, and when you need to access the view, you can access the temporary table directly. The merge algorithm rewrites the query containing the view, including the SQL defined by the view directly into the query SQL. To understand the difference between the two algorithms through two simple examples, create the following view:

/ / the function of the view is to query unpaid orders CREATE VIEW unpay_order ASSELECT * FROM sales WHERE status = 'new'WITH CHECK OPTION; / / which will be discussed below

To query an order whose purchaser is csc in an unpaid order, you can use the following query:

/ / query orders for which the purchaser is csc and unpaid SELECT order_id,order_amount,buyer FROM unpay_order WHERE buyer = 'csc'

Use temporary tables to simulate the view:

CREATE TEMPORARY TABLE tmp_order_unpay AS SELECT * FROM sales WHERE status = 'new';SELECT order_id,order_amount,buyer FROM tmp_order_unpay WHERE buyer =' csc'

Use the merge algorithm to merge the view-defined SQL into the query SQL:

SELECT order_id,order_amount,buyer FROM sales WHERE status = 'new' AND buyer =' csc'

MySQL can define views nested, that is, defining one view on another, and you can use SHOW WARNINGS after EXPLAN EXTENDED to see the rewritten results of a query that uses the view. If you use the view implemented by the temporary table algorithm, it will appear as a derived table (DERIVED) in EXPLAIN. Note that EXPLAIN needs to be actually executed and temporary tables are generated, so it is likely to be slow.

Obviously, there are no indexes on the temporary table, and it is difficult for the optimizer to optimize the query on the temporary table, so using the merge algorithm as much as possible will have better performance. So the question is: why use views when merging algorithms (similar to direct queries) have better performance?

First of all, the view can simplify the operation of the upper layer of the application, making the application more focused on the data it cares about. Second, views can provide security for sensitive data, for example, defining different views for different users can prevent sensitive data from appearing on user views that should not see the data; you can also use views to implement column-based permission control without actually creating column permissions in the database. In addition, the view can facilitate the system operation and maintenance, such as using the view when refactoring the schema, so that the application code can continue to run without reporting errors when modifying the underlying table structure of the view.

Based on this, the use of views is more based on business or maintenance costs, and it will not have much effect on performance improvement in itself (note: this is only based on MySQL considerations, views in other relational databases may have better performance, such as ORACLE and MS SQL SERVER both support materialized views, both of which have better performance than MySQL views). And views implemented using temporary table algorithms can perform very poorly at some point, such as:

/ / the function of the view is to calculate the amount of daily expenditure, DATE ('2017-06-15 12 create_time 00 create_time 23') = 2017-06-15CREATE VIEW cost_per_day ASSELECT DATE (create_time) AS date,SUM (cost) AS cost FROM costs GROUP BY date

To count daily income and expenditure, you can use the following SQL for income tables similar to the above:

SELECT c. DatedC.CostCall. Amount from cost_per_day AS cJOIN sale_per_day AS s USING (date) WHERE date BETWEEN '2017-06-01' AND '2017-06-30'

In this query, MySQL first executes the SQL of the view to generate a temporary table, and then associates the sale_per_day table with the temporary table. Here the BETWEEN condition in the WHERE clause cannot be pushed down to the view, so when the view is created, all the data is placed in the temporary table instead of one month's data, and the temporary table does not have an index.

Of course, the temporary table data in this example will not be too large, after all, the number of dates will not be too large, but you still need to consider the performance of generating temporary tables (if the costs table data is too large, GROUP BY may be slow). And the index is not a problem in this example, as we know from the previous article that if MySQL uses a temporary table as the first table in the association order, the index in sale_per_day can still be used. However, if the two views are associated, the optimizer does not have any indexes to use, so it is necessary to strictly test whether the performance of the application meets the requirements.

We rarely update the view in the actual business scenario, so we have the impression that the view cannot be updated. But in fact, in some cases, the view can be updated. Updatable view refers to updating the related tables involved in the view by updating the view. As long as the appropriate conditions are specified, the data can be updated, deleted, or even inserted into the view. From the above understanding, it is not difficult to infer that the essence of updating a view is to update the table associated with the view, transforming the WHERE clause that creates the view into the WHERE clause of the UPDATE statement, and only the view that uses the merge algorithm can be updated, and the updated columns must come from the same table. Review the above SQL statement to create a view, in which there is a sentence: WITH CHECK OPTION, its function is to indicate that rows updated through the view must meet the WHERE condition definition of the view itself, and cannot update columns other than the view definition column, otherwise a check option failed error will be thrown.

The view is also prone to misunderstanding: "for some simple queries, the view uses the merge algorithm, while for some more complex queries, the view uses the temporary table algorithm." But in fact, the implementation algorithm of the view is determined by the properties of the view itself and has nothing to do with the SQL acting on the view. So when will the view use the temporary table algorithm and when will the merge algorithm be used? Generally speaking, MySQL will use the temporary table algorithm to implement the view as long as the original table record and the record in the view cannot establish an one-to-one mapping relationship. For example, when the SQL for creating a view contains GROUP BY, DISTINCT, UNION, aggregate function, and subquery, the view will use the temporary table algorithm (these rules may change in future versions, please refer to the official manual for details).

Compared with other relational database views, MySQL views are much weaker in functionality, such as ORACLE and MS SQL SERVER both support materialized views. Materialized view refers to storing the view result data in a table that can be queried, and periodically refreshing the data from the original table to this table. Like ordinary physical tables, this table can create indexes, primary key constraints, and so on. Performance will be qualitatively improved compared to temporary tables. Unfortunately, MySQL does not currently support materialized views, and of course MySQL does not support creating indexes in views.

Stored procedures and triggers

Coming back to the second question, many people will throw out such a view when sharing: do not use stored procedures as much as possible, stored procedures are very difficult to maintain, and will increase the cost of using them, and business logic should be put on the client side. If the client can do these things, why stored procedures?

If you have an in-depth understanding of stored procedures, you will find that stored procedures are not as bad as they have been described. I have experienced some products that use stored procedures heavily. To what extent do I rely? To put it this way, the upper application basically only deals with the logic of interaction and dynamic effects, and all business logic and even parameter verification are implemented in stored procedures. Once there was a super-large stored procedure whose file size reached an astonishing 80K. You can imagine how complex its business logic is. In the eyes of most people, this kind of technical architecture is a little unreasonable, but in fact this product is very successful.

The reason for its success is to some extent due to the advantages of stored procedures. Because the business layer code does not have any code that invades the business, it can repair BUG and develop new functions very quickly without changing the front-end display effect. Because this product needs to be deployed in the customer's private environment, it is particularly important to respond quickly to the customer's needs, thanks to this architecture, it can respond quickly when the customer has a problem or puts forward a new demand, in extreme cases, we can fix the customer's problem within an hour. It is this rapid response mechanism that allows us to get a large number of customers.

Of course, stored procedures have other advantages, for example, it is very convenient to encrypt stored procedure code without worrying about source code leakage caused by application deployment to a private environment, debugging stored procedures like other applications, setting permissions for stored procedures to ensure data security, and so on. Everything is beautiful, but our product is based on MS SQL SERVER, which can easily implement complex business logic through T-SQL. You can think of T-SQL as a programming language that contains all the functions of SQL, as well as process control, batch processing, timing tasks, and so on. You can even use it to parse XML data. For more information about T-SQL, please refer to MSDN. Currently, only MS SQL SERVER supports T-SQL in mainstream relational databases, so MySQL does not have some of the capabilities described above. For example, MySQL's stored procedure debugging is very inconvenient (of course, you can get good support through paid software).

In addition, MySQL stored procedures have some other limitations:

The optimizer cannot evaluate the execution cost of the stored procedure

Each connection has a separate stored procedure execution plan cache. If there are multiple connections that need to call the same stored procedure, cache space will be wasted to cache the same execution plan.

Therefore, using stored procedures in MySQL is not a good strategy, especially in some scenarios with big data and high concurrency, handing complex logic to the upper application implementation can easily expand existing resources to obtain higher computing power. And for familiar programming languages, they are more readable and more flexible than stored procedures. However, in some scenarios, if the stored procedure is much faster than other implementations and is a small operation, you can appropriately consider using the stored procedure.

Similar to stored procedures, there are triggers that allow you to perform specific actions when executing INSERT, UPDATE, and DELETE. In MySQL, you can choose whether to trigger before or after SQL execution. Triggers are generally used to implement mandatory restrictions that can complicate business code if implemented in an application, and it can also reduce communication between the client and the server. The implementation of MySQL triggers is very simple, so the functionality is very limited, if you are already heavily dependent on triggers in other database products, you should be careful when using MySQL triggers, because the performance of MySQL triggers is not the same as expected.

First, you can define at most one trigger for each event in a table, and it only supports row-based triggers, that is, triggers are always for one record, not for the entire SQL statement. If it is a batch update, it may be very inefficient. Second, triggers can mask the nature of the server's work, behind a simple SQL statement, because triggers may contain a lot of invisible work. In addition, it is difficult to troubleshoot when there is something wrong with the trigger. Finally, triggers do not necessarily guarantee atomicity, for example, triggers under the MyISAM engine fail to execute and cannot be rolled back. Triggers on the InnoDB table are executed in the same transaction, so their execution is atomic, and both the original operation and the trigger operation fail or succeed at the same time.

Although triggers have so many limitations, they still have applicable scenarios, for example, triggers are very convenient when you need to log changes to MySQL data.

Foreign key constraint

At present, in most Internet projects, especially in big data's scenario, the use of foreign keys is no longer recommended, mainly considering the cost of using foreign keys:

Foreign keys usually require a lookup operation to be performed in another table each time the data is modified. In the InnoDB storage engine, foreign keys will be forced to use indexes, but in big data's case, the overhead of foreign key checking can not be ignored, especially when the selectivity of foreign keys is very low, it will lead to a very large and low selective index.

If you insert a record into a child table, the foreign key constraint causes InnoDB to check the record of the corresponding parent table, which requires locking the corresponding record of the parent table to ensure that the record will not be deleted when the transaction is completed. This can lead to extra lock waits and even some deadlocks.

In high concurrency scenarios, the database can easily become a performance bottleneck, so it is natural to want the database to scale horizontally. In this case, you need to put the consistency control of the data on the application layer, that is, the application server can bear the pressure. In this case, foreign keys cannot be used at the database level.

Therefore, when you do not need to think too much about the sex of the database, such as some internal projects or traditional industry projects (which have a limited number of users and generally do not have too much data), using foreign keys is a good choice. after all, if you want to ensure that the relevant tables always have consistent data, it is much easier to use foreign keys than to check consistency in the application. Foreign keys will also be more efficient in deleting and updating related data than in applications.

Bind variable

Maybe when you see the word "bind variable", it will be a little strange, in other words, you may be familiar with it: prepared statement. SQL that binds variables, using question marks to receive the location of parameters, and when you really need to execute a specific query, use specific values instead of these question marks, such as:

SELECT order_no, order_amount FROM sales WHERE order_status =? And buyer =?

Why use binding variables? The well-known reason is that it can be precompiled to reduce the risk of SQL injection, other than that?

When creating a binding variable SQL, the client sends a prototype SQL statement to the server. After receiving the framework of the SQL statement, the server parses and stores part of the execution plan of the SQL statement, and returns a SQL statement processing handle to the client. From then on, the client executes a specific query by sending the value of each question mark and this handle to the server. This makes it possible to execute a large number of repetitive statements more efficiently because:

The server only needs to parse the SQL statement once

Some optimizers of the server only need to be optimized once, because MySQL caches part of the execution plan

In communication, only parameters are sent, not the whole statement, network overhead is less, and it is more efficient to send parameters and handles in binary than to send ASCII text.

It is important to note that MySQL does not always cache the execution plan, and MySQL cannot cache this part of the execution plan if some execution plan needs to be calculated based on the parameters involved. For example:

/ / pretend to have an example here, you can think about it for yourself.

The biggest trap in using bound variables is that you know how it works, but you don't know how it is implemented. Sometimes it is difficult to explain the differences between the following three types of bound variables:

Binding variables simulated by the client: the driver of the client receives a SQL with parameters, brings the value of the parameters into it, and finally sends the complete query to the server.

Server binding variables: the client uses a special binary protocol to send a SQL statement with parameters to the server, and then uses the binary protocol to send specific parameter values to the server for execution.

Binding variable of SQL interface: the client first sends a SQL statement with parameters to the server, which is similar to using the SQL statement of prepared, then sends the set parameters, and finally sends the execute instruction to execute SQL, all of which are using the ordinary text transfer protocol.

For example, some JDBC drivers that do not support precompilation will not send the SQL statement to the database for preprocessing when calling connection.prepareStatement (sql), but wait until the executeQuery method is called before sending the whole statement to the server, which is similar to the first case. Therefore, when using binding variables in a program, it is necessary to understand how the driver you are using is implemented. By extension, the framework and open source tools you use should not only stay at this level, but you can have time to understand its principle and implementation, otherwise you may be deceived and don't even know it.

User-defined function

MySQL itself has a lot of built-in functions, such as SUM, COUNT, AVG and so on, but in practical applications, we often need more. In most cases, more powerful functions are implemented at the application level, but in fact MySQL also provides an opportunity to extend the MySQL function, which is the user-defined function (user-defined function), also known as: UDF. It is important to note the difference between UDF and stored procedures and the creation of functions through SQL. Stored procedures can only be written in SQL, while UDF does not have this restriction and can be implemented in any programming language that supports C calling conventions.

UDF must be pre-compiled and dynamically linked to the server, this platform dependency makes UDF powerful in many ways, UDF is very fast, and can access a large number of operating system functions, but also can use a large number of library functions. If you need a statistical aggregate function that is not supported by MySQL and cannot be implemented using stored procedures, and you want to call it in different languages, then UDF is a good choice, at least not in every language to implement the same logic.

The greater the ability, the greater the responsibility. An error in UDF may directly crash the server, or even disturb the server's memory and data, so you need to pay attention to its potential risks when using it. You also need to be careful when upgrading the MySQL version, as you may need to recompile or modify these UDF to make them work in the new version.

Here is a simple example to show how to create a UDF: convert the result set to JSON. For more information, please refer to lib_mysqludf_json.

/ / 1. First, use C language to realize the function / / 2, omit steps 1 and 2 here, implement and compile to .so / / 3, and use SQL to create the function drop function json_array;create function json_array returns string soname 'lib_mysqludf_json.so'. / / 4. Use function select json_array (customer_id, first_name, last_name, last_update) as customerfrom customerwhere customer_id = 1 / / 5. The result is as follows: +-- + | customer | +-+ | [1, "MARY", "SMITH" "2006-02-15 04:57:20"] | +-+

The general implementation flow is as follows: using C language to implement logic-> compiling to .so file-> creating function-> using function. UDF may be rarely used in practice, but as developers, knowing such a powerful tool gives us more options when solving thorny problems.

Character set

Finally, let's talk about the character set.

The first impression of most people about the character set may be that the database character set should use UTF8 as far as possible, because the UTF8 character set is the most suitable character set to achieve the conversion between different character sets, which can avoid the problem of garbled code to the greatest extent and facilitate data migration in the future. But why?

A character set refers to a mapping from binary coding to a certain type of character symbol. You can refer to how to use a byte to represent the English alphabet. Proofreading rules refer to a set of sorting rules for a character set, that is, what kind of rules are used to sort certain types of characters. Each type of coded character in MySQL has its own character set and proofreading rules. MySQL's support for various character sets is perfect, but it also brings some complexity, and even some performance sacrifices in some scenarios.

A character set may correspond to multiple proofreading rules, and there is a default proofreading rule, so how is the character set used in MySQL? You can set the character set in MySQL in two ways: setting the default value when creating an object and explicitly setting it when the client communicates with the server.

MySQL uses a "ladder" approach to set character set defaults, each database, each table has its own default values, they are inherited layer by layer, and eventually the lowest default settings will affect the objects you create. For example, when creating a database, the default character set of the database will be set according to the character_set_server on the server. Similarly, the character set of all tables in the database will be specified according to the character set of database. Whether for databases or tables and columns, the default character set works only if they do not explicitly specify a character set.

When the client communicates with the server, they can use different character sets, and the server will do the necessary conversion work. When the client sends a request to the server, the data is encoded in the character set set by character_set_client, and when the server receives the SQL or data from the client, it is converted according to the character set set by character_set_connection When the server is about to add, delete, modify and check, the data will be converted to character_set_database again (the character set used in the database, even if the default configuration is not separately configured, please refer to the above). Finally, when the server returns data or error messages, the data will be encoded according to the character set set by character_set_result. The server can use SET CHARACTER SET to change the above configuration, and the client can also change the character set configuration according to the corresponding API. Both the client and server use the correct character set to avoid problems in communication.

So how do I choose a character set?

When considering which character set to use, the main measure factor is the stored content, on the premise of meeting the stored content, try to use a small character set. Because a smaller character set means less space consumption and higher network transmission efficiency, it also indirectly improves the performance of the system. If the content is stored in Latin language characters such as English characters, then there is no problem with using the default latin1 character set. If you need to store non-Latin language characters such as Chinese characters, Russian, Arabic, etc., it is recommended to use the UTF8 character set. Of course, different characters take up different space in using the UTF8 character set. For example, English characters use only one byte in the UTF8 character set, while a Chinese character takes up 3 bytes.

In addition to the character set, proofreading rules are also issues we need to consider. For proofreading rules, generally speaking, you only need to consider whether to compare strings in a case-sensitive manner or whether to use string-encoded binaries to compare sizes, and the corresponding proofreading rules have suffixes of _ cs, _ ci, and _ bin, respectively. The difference between case-sensitive and binary proofreading rules is that binary proofreading rules directly use bytes of characters to compare, while case-sensitive proofreading rules have more complex rules for multibyte character sets, such as German. To take a simple example, there are three corresponding proofreading rules for the UTF8 character set:

Utf8_bin stores each character in a string in binary data, case-sensitive

Utf8_general_ci is not case-sensitive, and ci is an abbreviation for case insensitive, that is, case-insensitive

Utf8_general_cs is case-sensitive, and cs is an abbreviation for case sensitive, that is, case-sensitive

For example, when you create a table that uses UTF8 encoding and is case-sensitive, you can use the following statement:

CREATE TABLE sales (order_no VARCHAR (32) NOT NULL PRIMARY KEY, order_amount INT NOT NULL DEFAULT 0,.) ENGINE=InnoDB COLLATE=utf8_general_cs

Therefore, there is no problem with using the UTF8 character set directly in the project, but it is important to keep in mind that do not use multiple different character sets in a database, the incompatibility between different character sets is very difficult. Sometimes, everything looks fine, but when a special character appears, everything goes wrong, and it's hard to find the cause of the error.

Does the character set affect the performance of the database?

Some character sets and proofing rules may require multiple CPU operations, which may consume more memory and storage space, as mentioned earlier. In particular, the impact of using different character sets in the same database may be greater.

Conversion between different character sets and proofreading rules may incur additional overhead. For example, if the data table sales has an index on the buyer field, you can speed up the following ORDER BY operations:

SELECT order_no,order_amount FROM sales ORDER BY buyer

The index can be used for sorting only if the character set required for sorting in the SQL query is the same as the character set for the server data. You might say, isn't that nonsense? In fact, MySQL can specify the proofing rules used in sorting separately, such as:

/ / you say, isn't this enough to eat? I think so, too. There may be scenarios where it is applicable / / at this time, index sorting can not be used, but file sorting SELECT order_no,order_amount FROM sales ORDER BY buyer COLLATE utf8_bin can only be used.

When you associate two tables with columns with different character sets, MySQL attempts to convert the character set of one of the columns. This is like encapsulating a function outside a data column, making it impossible for MySQL to use the index on that column. There are still some holes in the MySQL character set, but there are not many character set problems encountered in practical application scenarios, so stop there.

Conclusion

MySQL has some other advanced features, but we rarely use them in most scenarios, so it's not discussed here, but it's always good to know more, at least when you need it. Many of us always think that what we have learned is as fragmented as fragments, and we can't find a solution. Have you ever thought that maybe there are not enough fragments? Too few points, naturally can not be connected into lines, too few lines, naturally can not form a network. Therefore, there is no other way to maintain curiosity, learn more and accumulate more. Quantitative change will change qualitatively one day. Write here and share with you.

As mentioned in some of the articles I wrote earlier, architectural design is an art of balance, and its essence should be a compromise, a compromise to existing resources. Sometimes we unconsciously fall into a certain point. for example, in order to pursue the expansibility of data, many people start to divide the database and table as soon as they come up, and then make the application very complicated. In the end, the project is dead before the table is filled with data. Therefore, in the case of limited resources or the unknown future, it would be better to try to use the characteristics of the database and the language itself to complete the corresponding work. To solve the big data problem, you should also be able to think of partitions. Some businesses don't have to be done in the business layer even in a distributed environment. Rational use of stored procedures and triggers may make it easier for you.

Finally, all the knowledge points discussed in this article come from "High performance MySQL". It is strongly recommended that you read this book.

references

Baron Scbwartz et al.; Ning Haiyuan Zhou Zhenxing et al.; High performance MySQL (third Edition); Electronic Industry Press, 2013

This article has been synchronously updated to the official account of Wechat: understate CODE > > I have to tell you the principles of MySQL optimization 2

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.