Is TiFlash fast? 07/09 Update SLTechnology News&Howtos

Is TiFlash fast?

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "is the speed of TiFlash fast?". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

So how fast is TiFlash?

In order to answer this question more intuitively, we conducted a new comparison test with the latest version of TiFlash. The test selects the traditional transactional database (and its column storage extension), the analytical database and big data computing engine for comparison, which are Oracle, MySQL, MariaDB ColumnStore, Greenplum and Apache Spark.

Among them, MySQL can undertake online trading business, but the analysis speed is quite worrying compared with the products that are specific to the analysis scenario, while the storage database cannot bear online transactions, whether without a new real-time storage structure or high-frequency and small amount of data access performance, it is difficult to meet the requirements of online trading business.

As a HTAP database, what kind of performance can TiDB achieve on the analysis side after adding TiFlash on the premise that a large number of transaction scenarios have been verified? With the consistent data synchronization feature of TiFlash, can users directly analyze real-time data at an excellent speed?

This time, let's take a look at a set of interesting data from the U.S. Department of Transportation, which includes aircraft takeoffs and landings and punctuality since 1987. You can use Percona Lab's download script to get the dataset. In total, the data set contains more than 100.008 million million aircraft take-off and landing records. The table structure of the dataset is here.

For the query used in the test, see below. Let's take a look at the comparison results:

Query statement TiDB + TiFlashMySQL 5.7.29Greenplum 6.1Mariadb Columnstore 1.2.5Spark 2.4.5 + ParquetOracle 12.2.0.1Q10.508290.3404.2061.2092.04488.53Q20.295262.6503.7950.7400.56476.05Q30.395247.2602.3390.5830.68474.76Q40.512254.9602.9230.6251.30674.75Q50.184242.5302.0770.2580.62767.44Q60.273288.2904.4710.4621.084134.08Q70.659514.7009.6981.2131.536147.06Q80. 453487.8903.9271.6291.099165.35Q90.277261.8203.1600.9510.68176.5Q102.615407.3608.3442.02018.219127.29

Note: in order not to affect the scale, the MySQL and Oracle data are ignored in the above figure.

It can be seen from the above comparison

Hundreds of times higher than MySQL in a stand-alone environment (not to mention TiFlash scalability)

However, compared with analytical databases / engines that cannot be updated in real time, such as MPP database or new MariaDB ColumnStore, the performance can still be improved several times or even tenfold.

The following ten pieces of SQL are used for testing and analyzing queries.

Query 1: average monthly flight take-off and landing records

Select avg (C1) from (select year, month, count (*) as C1 from ontime group by year, month) A

Inquire about the number of daily flights from 2000 to 2008

Select dayofweek, count (*) as c from ontime where year > = 2000 and year10 and year=2007 group by carrier order by count (*) desc

Query 6: according to airline statistics on the percentage of delays in 2007

Select carrier, c, c2, c*100/c2 as c3 from (select carrier, count (*) as c from ontime where depdelay > 10 and year=2007 group by carrier) An inner join (select carrier, count (*) as c2 from ontime where year=2007 group by carrier) B using (carrier) order by c3 desc

Query 7: according to airline statistics, delay rate from 2000 to 2008

Select carrier, c, c2, c*100/c2 as c3 from (select carrier, count (*) as c from ontime where depdelay > 10 and year > = 2000 and year=2000 and year10 group by year) An inner join (select year, count (*) as c2 from ontime group by year) B using (year) order by year

Enquiry 9: number of flights per year

Select year, count (*) as C1 from ontime group by year

Query 10: multi-dimensional complex filtering and aggregation

Select min (year), max (year), carrier, count (*) as cnt, sum (arrdelayminutes > 30) as flights_delayed, round (sum (arrdelayminutes > 30) / count (*), 2) as rate from ontime where dayofweek not in (6) and originstate not in ('ak',' hi', 'pr',' vi') and deststate not in ('ak',' hi', 'pr',' vi') and flightdate

< '2010-01-01' group by carrier having cnt>

100000 and max (year) > 1990 order by rate desc limit 1000; True row mixing

Don't forget that there is a bank deposit. TiDB not only has TiFlash inventory engine, but also has corresponding row storage and supporting fine-grained index support.

For columns with a very high number of unique values (such as a specific time, product unique serial number, etc.), generally speaking, it is difficult to have a good means of accurate filtering. For example, in the above OnTime dataset, the same query can be made faster by indexing the CRSDepTime planned departure time column.

Count the total number of aircraft scheduled to take off at 18:45.

Mysql > select count (*) from ontime where 1845 = CRSDepTime;+-+ | count (*) | +-+ | 766539 | +-+ 1 row in set (0.09 sec)

With pure column storage, in MariaDB,Spark and Greenplum, such queries are 0.447 vs 0.449 and 1.576 seconds, respectively-- 4 to 17 times slower than TiDB + TiFlash! Because they have to sweep the meter violently.

In addition, the row-column mixing of TiDB is not a traditional design of row storage from one of two, but TiDB can have both row and column storage in the same table, and the two always keep the data strongly consistent (rather than ultimately consistent).

If you look at this, you may want to ask, does it put a mental burden on users that TiDB has both bank and inventory? The answer is no. You can delegate to TiDB to choose when to use row storage or column storage, except that users can be forced to choose for the sake of HTAP business isolation. When row memory is better (as in the case above), TiDB automatically switches to row memory for reading based on statistics: the performance of the above query on TiFlash is only half that of TiKV row memory + index.

Faster data arrival

Because of the high-frequency updatable storage engine designed to cooperate with TiDB data mirror synchronization, TiFlash can update data at a high speed. This makes its "fast" not only a "high-speed return query", but also means "data can be queried more quickly".

Compared to traditional analytical databases or Hadoop data lakes that need to be bulk loaded from the source database T + 1 (often in one day), TiFlash can read the latest (not just fresh) data, and you don't have to worry about data arriving out of order or consistency. Compared to maintaining additional data replication jobs, you can not only simplify the architecture, but also access the data in more real time.

This is the end of the content of "is TiFlash fast?" Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.