What is Schema and data type optimization like 04/10 Update SLTechnology News&Howtos

What is Schema and data type optimization like

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you about what Schema and data type optimization is like. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Good logical and physical design is the cornerstone of high performance. Schema should be designed according to the query statements to be executed by the system, often weighing various factors.

Select optimized data type

Choosing the right data type is critical to achieving high performance.

Smaller ones are usually better.

In general, you should try to use the smallest data type that can store data correctly, for example, if you only need to store 0-200, then tinyint unsigned is better. Smaller data types are generally faster because they take up less disk, memory, and CPU cache, and require fewer CPU cycles to process, but need to determine the smallest type that does not exceed the range.

Just keep it simple.

Operations with simple data types usually require fewer CPU cycles, such as integers are less expensive than string operations, because character sets and proofreading rules (collations) make character comparisons more complex than integers. For example, MySQL built-in types should be used instead of strings to store dates and times, and integers should be used to store IP addresses.

Try to avoid NULL

In general, it is best to specify the column bit NOT NULL unless you really need to store null values. If the query contains columns that can be NULL, it is more difficult for MySQL to optimize, because columns that can be NULL make indexing, index statistics, and value comparisons more complex. Columns that are NULL will use more storage space, and special handling is required in MySQL. When columns that can be NULL are indexed, an additional byte is required for each index record. If you plan to index columns, you should try to avoid being designed to be NULL-capable columns.

MySQL data type

MySQL data types-Rookie tutorial

Some experience and skills

CHAR and VARCHAR

To make a choice based on the storage engine:

(1) MyISAM storage engine: fixed-length data columns are recommended

(2) MEMORY storage engine: currently, fixed-length data rows are stored, and both CHAR and VARCHAR are treated as CHAR types.

(3) InnoDB storage engine: VARCHAR type is recommended. The main performance factor is the storage capacity used by data rows. Since CAHR takes up more space than VARCHAR on average, it is better to use VARCHAR to minimize the storage capacity of rows of data and disk IO that need to be processed

Because CHAR fixed length, processing speed is much faster than VARCHAR, the disadvantage is a waste of storage space, the program needs to deal with line trailing spaces (storage will remove trailing spaces), so for those data with little change in length and higher requirements for query speed, you can consider using the CAHR type to store.

TEXT and BLOB

BLOB binary data

TEXT character data

(1) the delete operation will leave a lot of holes in the data table, and the records that fill these holes in the future will affect the performance of the insert. In order to improve performance, it is recommended that you use the optimizetable (optimize English for performance optimization) feature to defragment such tables regularly to avoid performance problems caused by "holes".

(2) use composite indexes to improve the accurate query performance of large text fields (BLOB or TEXT)

(3) use the prefix index (index the first few characters of the column instead of all) to vaguely query the BLOB or CLOB fields

(4) avoid retrieving large BLOB or text values when it is not necessary

(5) separate BLOB or TEXT columns into separate tables

Floating point number and fixed point number

Floating point numbers: there are error problems in float and double. In actual programming, it is best to use range comparison rather than direct comparison.

Fixed points: decimal (or numberic) string storage, more accurate storage of data, such as applied to currency and other precision-sensitive data

Date type selection

According to the actual needs, selecting the date type that can meet the minimum storage of the application can not only save storage, but also improve the operation efficiency of the table.

Other

MySQL can specify a width for an integer type such as int (11), but it does not make sense for most applications and does not limit the legal range of values, but only specifies the number of characters used by interactive tools such as the client. Int (1) and int (20) are the same for storage and computing.

DECIMAL requires additional space and computing overhead. In the case of a large amount of data, you can consider using BIGINT instead. You can multiply the monetary units that need to be stored by the corresponding multiple according to the decimal places. For example, if the financial data you need to store is accurate to 1/10000, you can multiply all the amounts by 1 million, and then store them in BIGINT. This can avoid both the imprecise calculation of floating-point storage and the high cost of DECIMAL precision.

Integers are usually the best choice for flag columns because they are fast and can use auto_increment. If possible, you should try to avoid using string types as flag columns because they are space-intensive and usually slower than numeric types.

You also need to pay attention to strings that are completely random, such as strings generated by MD5 (), SHA1 (), or UUID (). The new values generated by these functions will be arbitrarily distributed over a large space, which will cause INSERT and SELECT statements to become very slow:

(1) because the new insert values are randomly written to different locations of the index, the INSERT statement slows down, causing page splitting, random disk access, and clustered index fragmentation for the clustered index engine.

(2) select statements become slower because logically adjacent rows are distributed across disk and memory.

(3) Random values cause the cache to have a poor effect on all types of query statements, because it invalidates the principle of access locality on which the cache works.

IPv4 addresses are actually 32-bit unsigned integers, not strings, segmented representations are only easy to read, and MySQL provides INET_ATON () and INET_NTOA () functions to convert between the two representations.

Too many traps in MySQL schema design

When working, MySQL's storage engine API needs to copy data in row cache format between server layer and storage engine layer, and then decode the buffer content into columns in the server layer. The operation cost of converting encoded columns into row data structures from row buffering is very high, and the cost of conversion depends on the number of columns. Too many columns will result in high CPU usage.

Too many connections.

The entity-attribute-value (EAV) design pattern is a common bad design pattern, especially one that doesn't work reliably under MySQL. Alibaba Java Development Manual stipulates that associations are prohibited for more than three tables, associated fields are required, data types need to be absolutely consistent and associated fields need to be indexed (try to use the same data types to store similar or related values, especially columns used in association conditions).

Be careful with enumerations

ALTER TABLE is a blocking operation in MySQL 5.0 and earlier; even in version 5.1 or later, ALTER TABLE is required to cause blocking if you don't add a value at the end of the list

Paradigm and anti-paradigm

In a standardized database, each factual data appears only once. In contrast, in an anti-normalized database, the information is redundant and may be stored in multiple places.

Pros and cons of stylization

For write-intensive scenarios, stylized design is conducive to performance optimization. Normalized update operations are usually faster than anti-normalization; when data is better normalized, there is little or no duplicate data, so only less data needs to be modified; normalized tables are usually smaller and can be better placed in memory, so operations are performed faster.

The disadvantage is that the schema of the paradigm design usually needs to be associated, and slightly more complex query statements may need to be associated at least once, perhaps more, on the normal form-compliant schema, which is not only expensive, but also may invalidate some indexing strategies.

Advantages and disadvantages of anti-normalization

Anti-normalized schema can well avoid associations because all the data is in one table. If associative tables are not needed, the worst-case scenario for most queries-even if the index is not used is a full table scan, which may be much faster than the association when the data is larger than memory, because this avoids random Imax O (full table scans are basically sequential Imax O, but also depends on the engine implementation). Individual tables can also use a more efficient indexing strategy.

Mixed use of paradigm and anti-normalization

Mixing is often required in practical applications, and partial normalized schema, cached tables, and other techniques may be used. The most common way to de-model data is to copy or cache, storing the same specific columns in different tables, which helps to query the data efficiently, but also makes the update operation more expensive because multiple tables may need to be updated at the same time. In practice, you need to consider the frequency and duration of updates, and compare them with the frequency with which SELECT queries are executed (compromise).

Cache table and summary table

Sometimes the best way to improve performance is to store derived redundant data in the same table, but sometimes it is also necessary to create a completely separate summary or cache table (especially for retrieval needs). This is a great way to allow a small amount of dirty data, but sometimes there is no choice (for example, you need to avoid complex and expensive real-time update operations). Here, cached tables represent tables that can easily get data from other schema tables (but more slowly each time), while summary tables hold tables that aggregate data using GROUP BY statements.

Real-time computing statistics is a relatively expensive operation, because either most of the data in the table needs to be scanned, or query statements can only be run effectively on specific indexes, which generally have an impact on UPDATE operations, so generally do not want to create such indexes, while the establishment of summary tables can be avoided by sacrificing some timeliness. Cached tables, by contrast, are effective for optimizing search and retrieval statements. When using cached and summary tables, you must decide whether to maintain the data in real time or rebuild it periodically, which not only saves resources, but also keeps the table from being fragmented, and indexes that are organized in full order (which is more efficient). When rebuilding summary and cache tables, it is usually necessary to ensure that the data is still available during operation, and shadow tables can be used to achieve this.

Materialized view

Materialized views are pre-calculated and stored on disk tables that can be refreshed and updated through a variety of policies. MySQL does not support materialized views, can be achieved using the open source tool Flexviews, to achieve the change data capture function, you can read the server's binary log and parse the changes of related lines. Compared with the traditional methods of maintaining summary table and caching table, Flexviews extracts the changes to the source table based on the mirror before and after row updates in the binary log, and can incrementally calculate the contents of the materialized view, which means that the view does not need to be updated by querying the original data, which is much more efficient than reading data from the source table.

Counter table

When a counter table updates a row of records concurrently, there will be a global mutex, which will lead to the serial execution of the transaction. in order to achieve higher concurrent update performance, the counter can be saved in multiple rows (the concept of introducing slots). One row at a time is randomly selected for update.

Update hit_counter set cnt = cnt + 1 where slot = RAND () * 100

Read faster and write more slowly

In order to speed up queries, it is often necessary to build additional indexes, add redundant columns, and even create cached and summary tables. These methods increase the burden of writing queries and require additional maintenance tasks, but these are common techniques when designing high-performance databases: although writes become slower, they significantly improve the performance of read operations.

However, slower write operations are not the only price to pay for faster read operations, and may also make it more difficult to develop both read and write operations.

Speed up ALTER TABLE operation

The performance of MySQL's ALTER TABLE operation is a big problem for large tables, which in most cases lock the table and rebuild the entire table.

Modify only .frm files

MySQL performs most of the operations to modify the table structure by using the new table structure an empty table, finding all the data from the old table into the new table, and then deleting the old table. This can take a long time, especially if there is not enough memory and the table is large, and there are many indexes. Many people have the experience that ALTER TABLE operations can take hours or even days to complete. In general, most ALTER TABLRE operations will cause MySQL services to be interrupted. For common scenarios, there are only two scenarios that can be used: one is to ALTER TABLE on a machine that does not provide services, and then switch with the main library that provides services; the other technique is "shadow copy". The trick of shadow copying is to create a new table independent of the source table with the required table structure, and then exchange the two tables by renaming and deleting the table. There are also some tools that can help copy shadow tables: such as the "online schema change" tool of the Facebook data operation and maintenance tools team.

ALTER TABLE allows modifications using ALTER COLUMN, MODIFY COLUMN, and CHANGE COLUMN statements, all of which are different. ALTER COLUMN: set or delete default values for columns (very fast); CHANGE COLUMN: rename columns, change column types, and move column positions; MODIFY COLUMN: same as CHANGE COLUMN except that columns cannot be renamed.

In theory, MySQL can skip the step of creating a new table. The default value for the column actually exists in the .frm file of the table, so you can modify the file directly without changing the table itself.

Quickly create MyISAM index

In order to load data into MyISAM tables efficiently, a common technique is to disable the index, load the data, and then re-enable the index (only for non-unique indexes). In the modern version of InnoDB, there is a similar technique, which depends on InnoDB's fast online index creation capability: first delete all unique indexes, then add new columns, and finally recreate the deleted indexes.

This is what the editor shares with you about Schema and data type optimization. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.