In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "what are the advantages and disadvantages and core features of ClickHouse". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
The full name of ClickHouse is Click Stream,Data Warehouse. ClickHouse is based on the click event stream of the page and carries out OLAP analysis for the data warehouse. ClickHouse is an open source data analysis database, developed by the fighting nation Russia Yandex, Yandex is a search engine, similar to Google, Baidu and so on. We all know that search engine revenue mainly comes from traffic and advertising business, so search engine companies will focus on analyzing users' network traffic, such as Google has Anlytics, Baidu has Baidu statistics, then Yandex corresponds to Yandex.Metrica. ClickHouse is the technology produced under Yandex.Metrica. According to the official website (https://clickhouse.tech/benchmark/dbms/) Magazine Click House, under the same server configuration and data volume, the average response speed:
2.63x of Vertica (Vertica is a paid column storage database)
17x of InfiniDB (scalable analytical database engine, built on Mysql)
27 times the size of MonetDB (open source determinant database)
126x of Hive
429 times of MySQL
10 times of Greenplum
1 times the Spark
Main features of ClickHouse
ROLAP (relational online analytical processing, compared with OLTP online transaction processing, our common ERP,CRM system belongs to OLTP)
Online real-time query
Complete DBMS (relational database)
Column storage (the difference from HBase,ClickHouse is full column storage, HBase is column family storage)
No data preprocessing is required.
Batch updates are supported
Have complete SQl support and functions
Support for high availability (multi-master structure, which will be discussed later in structural design)
Do not rely on Hadoop complex ecology (like ES, out of the box)
Some deficiencies
Does not support transactions (this is actually a disadvantage of most OLAP databases)
Not good at querying by row granularity according to the primary key (but this operation is supported)
Not good at deleting data by row (but this operation is supported)
ClickHouse Infrastructure
1.Column, FieldColumn and Field are the most basic mapping units of ClickHouse data. A column of data in memory is represented by a Column object. The Column object is divided into two parts: interface and implementation. In the IColumn interface object, the methods of various relational operations on the data are defined. In most cases, ClickHouse manipulates data as a whole column, but there are exceptions. If you need to manipulate a single specific numerical value (that is, a row of data in a single column), you need to use a Field object, which represents a single value. Different from the generalization of Column objects, Field objects use an aggregated design pattern. Thirteen data types such as Null, UInt64, String and Array and their corresponding processing logic are aggregated inside the Field object. DataType is responsible for serializing and deserializing 2.DataType data. The IDataType interface defines a number of positive and deserialization methods, which appear in pairs. IDataType also uses a generalized design pattern, and the implementation logic of the specific method is carried by the instance of the corresponding data type. Although DataType is responsible for serialization-related work, it is not directly responsible for reading the data, but instead fetches it from Column or Field objects. The data operations within 3.Block and Block streaming ClickHouse are oriented to Block objects and are in the form of streams. The Block object can be thought of as a subset of the data table. The essence of a Block object is a triple of data objects, data types, and column names, namely Column, DataType, and column name strings. A series of data operations can be done only through the Block object. Instead of aggregating Column and DataType objects directly, Block makes indirect references through ColumnWithTypeAndName objects. Block stream operation has two sets of top-level interfaces: IBlockInputStream is responsible for data reading and relational operation, and IBlockOutputStream is responsible for outputting data to the next step. The IBlockInputStream interface defines several read virtual methods for reading data, and the specific implementation logic is filled by its implementation class. There are more than 60 implementation classes in the IBlockInputStream interface, which can be roughly divided into three categories:
The first type of DDL operation for dealing with data definitions
The second kind of related operations used to deal with relational operations
The third category echoes with the table engine, each of which has a corresponding BlockInputStream implementation.
The design of IBlockOutputStream is similar to that of IBlockInputStream. These implementation classes are basically used for the processing of the table engine and are responsible for writing data to the next link or final destination. 4.Table does not have the so-called Table object in the underlying design of the data table, it directly uses the IStorage interface to refer to the data table. Table engine is a prominent feature of ClickHouse, and different table engines are implemented by different subclasses. The IStorage interface is responsible for defining, querying and writing data. IStorage is responsible for returning the raw data of the specified column according to the instructions of the AST query statement. Subsequent processing, calculation, and filtering are carried out by the sections described below. The 5.Parser and InterpreterParser parsers are responsible for creating AST objects; the Interpreter interpreter is responsible for interpreting the AST and further creating the pipeline for the execution of the query. Together with IStorage, they concatenate the whole process of data query. The Parser parser can parse a SQL statement into an AST syntax tree by recursively descending. Different SQL statements implement class parsing via different Parser. The Interpreter interpreter acts like the Service service layer, concatenating the entire query process, aggregating the resources it needs according to the type of interpreter. First it parses the AST object; then it performs "business logic" (such as branch judgment, setting parameters, calling interfaces, and so on); and finally it returns the IBlock object, setting up a query execution pipeline in the form of a thread. 6.Functions and Aggregate FunctionsClickHouse mainly provide two types of functions-ordinary function (Functions) and aggregate function (Aggregate Functions). Ordinary functions are defined by IFunction interface and have dozens of functions, which act on a whole column of data directly by means of vectorization. Aggregate functions are defined by the IAggregateFunction interface. Compared to stateless ordinary functions, aggregate functions are stateful. Take the COUNT aggregate function as an example, whose AggregateFunctionCount status is recorded using an integer UInt64. The state of the aggregate function supports serialization and deserialization, so it can be transferred between distributed nodes for incremental computation. The cluster of 7.Cluster and ReplicationClickHouse consists of Shard, and each shard is made up of Replica. This concept of layering is very common in some popular distributed systems. Here are a few distinctive features. One node of ClickHouse can only have one shard, that is, if you want to achieve 1 shard and 1 copy, you need to deploy at least 2 service nodes. Slicing is only a logical concept, and its physical load is borne by the copy. ClickHouse Table engine
MergeTree: allows you to create indexes based on primary keys and dates and perform real-time data updates. MergeTree is the most advanced watch engine in ClickHouse.
ReplacingMergeTree: this engine differs from MergeTree in that it removes duplicates with the same primary key. The deduplication of data will only occur in the process of merging. Some data may still not be processed. Therefore, ReplacingMergeTree is suitable for clearing duplicate data in the background to save space, but it does not guarantee that there is no duplicate data. To some extent, it can make up for the fact that clickhouse cannot update the data, and can be used to deduplicate the data in the scenario where too much data is repeated.
SummingMergeTree: when a data fragment of a table, ClickHouse merges all rows with the same primary key into a row that contains the summary value of columns with numeric data types in the merged row. If the primary key is combined in such a way that a single key value corresponds to a large number of rows, the storage space can be significantly reduced and the speed of data query can be accelerated. For non-additive columns, the first value will be taken. A long-term summary query scenario for a field.
AggregatingMergeTree: this engine inherits from MergeTree and changes the merge logic of data fragments. ClickHouse replaces all rows of the same primary key (within a data fragment) with a single row that stores a series of aggregate function states. You can use the AggregatingMergeTree table for incremental data statistical aggregation, including data aggregation for materialized views. The engine needs to use the AggregateFunction type to process all columns. If you want to merge and reduce the number of rows by a set of rules, it is appropriate to use AggregatingMergeTree. For AggregatingMergeTree, you cannot directly use insert to query write data. Insert select is usually used. However, it is more commonly used to create materialized views and do incremental data statistical aggregation, including data aggregation of materialized views.
The Distributed distributed engine does not store data itself, but it can perform distributed queries on multiple servers. Reading is automatically parallel. When reading, the index of the remote server table, if any, is used. Distributed engine parameters: cluster name in server configuration file, remote database name, remote table name, data sharding key.
This is the end of the content of "what are the advantages and disadvantages and core features of ClickHouse". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.