Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is big data's analysis engine ClickHouse?

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

The content of this article mainly focuses on what is big data's analysis engine ClickHouse. The content of the article is clear and clear. It is very suitable for beginners to learn and is worth reading. Interested friends can follow the editor to read together. I hope you can get something through this article!

What is ClickHouse?

ClickHouse is a column database management system (DBMS) for online analysis (OLAP).

For storage, column databases always store data from the same column together, and data from different columns are always stored separately.

Common column databases are: Vertica, Paraccel (Actian Matrix,Amazon Redshift), Sybase IQ, Exasol, Infobright, InfiniDB, MonetDB (VectorWise, Actian Vector), LucidDB, SAP HANA, Google Dremel, Google PowerDrill, Druid, kdb+.

Second, the solution of traditional analysis database:

The main contents are as follows: 1. The traditional way to deal with a large amount of data is to layer the data and form a data Mart layer by layer, so as to reduce the data volume of the final query, such as putting forward the concept of data cube and pre-processing the data. exchange space for time to improve query performance. 2. OLAP classified relational ROLAP: built with relational models, data models often use star model or snowflake model multi-dimensional MOLAP: save data in the form of multi-dimensional arrays, the core idea is to improve query performance by preprocessing aggregate results and using space-for-time hybrid architecture HOLAP: it is understandable that ROLAP and MOLAP are integrated

3. ClickHouse, a dark horse born out of nowhere

ClickHouse (full name: Click Stream,Data WareHouse) has the characteristics of ROLAP, online real-time query, complete DBMS, column storage, without any data preprocessing, support for batch updates, very perfect SQL and function support, support for high availability, independent of Hadoop ecology and out of the box. The historical evolution process of ClickHouse: Mysql period-> another Metrage period-> self-breakthrough OLAPServer period-> naturally occurring ClickHouse period 1. MySQL period: using MyISAM table engine, using B+ number structure to store indexes, while data files use separate storage files, (different from InnoDB table engine using B+ tree to store indexes and data at the same time. The data is directly mounted in the leaf node) 2. Metrage period: data model level: relational model is changed to Key-Value model index level: LSM tree replaces B+ tree data processing level: real-time query is changed to preprocessing using LSM tree the most representative is Hbase database. LSM essence can be seen as splitting a big tree into many small trees, and the data written in each batch will build a small tree in memory. Write is completed as soon as the build is completed (here, by pre-writing the log to prevent data loss caused by memory failure), writing data only occurs in memory and does not involve disk operations, so the writing speed is greatly improved. 3, OLAPServer period: design idea: take the length of the family data model: back to the relational model, the reason: relational has a better ability to describe the storage level: similar to the MyISAM table engine, it is divided into index file and data file index level: following the LSM tree, the index file and data file are split by column, and each column field is stored independently.

4. ClickHouse (full name: Click Stream,Data WareHouse)

1. ClickHouse applicable scenarios: very suitable for business intelligence, and also widely used in advertising traffic, web, app traffic, telecommunications, finance, e-commerce, information security, online games, Internet of things, etc. 2. ClickHouse is not suitable for scenarios: transactions are not good at querying by row granularity according to the primary key (although supported), so ClickHouse should not be used as a Key-Value database and is not good at deleting data by row (although supported).

5. Detailed explanation of ClickHouse architecture

ClickHouse is a columnar storage database based on MPP architecture, which absorbs the quintessence of various technologies and takes every detail to the extreme. 1. Complete DBMS (database management system) function DDL (database definition language): can dynamically create databases, tables and views DML (database operation language): dynamic query, add, modify and delete data access control, data backup and recovery, distributed management, etc. 2. Column storage is different from data compression in different files, the more duplicates in the data, the higher the compression ratio. The smaller the volume of data, the faster the transmission, and the less pressure on the network bandwidth and disk IO. Using LZ4 algorithm compression, the compression ratio reaches 8 1.3. The vectorization engine can simply understand and do an optimization to eliminate the loop in the program. The principle is the parallel operation at the register level. The access speed of the register is 300 times that of the memory and 30 million times that of the hard disk. 4. Relational model and SQL query relational model (including star model, snowflake model and even wide table model) have better data description ability than other models. In addition, note that ClickHouse SQL syntax is case-sensitive. 5. A variety of table engines have more than 20 table engines that merge numbers, memory, files, interfaces and other six categories. Each engine has its own characteristics and is suitable for different scenarios. 6. Multithreading and distribution if the performance of vectorized execution is improved by data-level parallelism, then multithreading is achieved by thread-level parallelism. Compared with the vectorized execution SIMD (single instruction processing multiple data) implemented by the underlying hardware, thread-level parallelism is controlled by a high-level software level. The idea of distributed design is to divide and conquer. In distribution, there is a Jinke law: mobile computing is more cost-effective than mobile data. In storage, ClickHouse not only supports partitioning (vertical expansion, using multi-thread principle), but also supports slicing (horizontal expansion, using distributed principle). It can be said that multithreading and distributed technology have been applied to the extreme. 7. Multi-master architecture HDFS, Spark, Hbase and ElasticSearch distributed systems all adopt Master-Slave master-slave architecture, which is controlled by one node, while ClickHouse adopts multi-master architecture. The client can get the same result when visiting any node. 8. Online query can achieve rapid response, and there is no need to preprocess the data. 9. Data slicing and distributed query data slicing is a kind of horizontal slicing of data. ClickHouse provides a local surface (Local Table) and a distributed table (Distributed Table), which is equivalent to a data slicing, while the distributed table itself does not store any data, it is the access agent of the local surface, and its function is similar to sub-library middleware, with the help of distributed tables to access multiple local tables, so as to achieve distributed query.

Sixth, the design principle of ClickHouse, the secret of being so fast

1, focus on the hardware, think before doing 2, algorithm in the front, abstract in the latter 3, dare to taste fresh, if not, change 4, specific scenarios, special optimization 5, continuous testing, continuous improvement thank you for reading, I believe you have a certain understanding of the "big data analysis engine ClickHouse is what" this problem, go to practice quickly, if you want to know more relevant knowledge points, you can pay attention to the website! The editor will continue to bring you better articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report