Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use ClickHouse for Python

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

In this article, the editor introduces in detail "how to use ClickHouse in Python". The content is detailed, the steps are clear, and the details are handled properly. I hope this article "how to use ClickHouse in Python" can help you solve your doubts.

ClickHouse is an open source determinant database (DBMS) that has attracted much attention in recent years. It is mainly used in the field of data online analysis (OLAP) and was opened up in 2016. At present, the domestic community is hot, and various large factories have followed up on large-scale use.

In Jinri Toutiao, ClickHouse is used to analyze user behavior internally. There are thousands of ClickHouse nodes internally, with a maximum of 1200 nodes in a single cluster, with a total data volume of dozens of PB, with a daily increase of raw data 300TB.

Tencent internally uses ClickHouse to analyze game data, and has established a complete set of monitoring operation and maintenance system for it.

Ctrip has been on trial since July 2018, and 80% of its business is now on ClickHouse. Daily data increment of more than 1 billion, nearly 1 million query requests.

ClickHouse is also used internally in Kuaishou. The total amount of storage is about 10PB, and 200TB is added every day. 90% of queries are less than 3s.

In foreign countries, hundreds of nodes in Yandex are used for user click behavior analysis, and CloudFlare, Spotify and other head companies are also using them.

ClickHouse was originally developed for YandexMetrica, the second largest Web analysis platform in the world. It has been used continuously as the core component of the system for many years.

1. About the practice of using ClickHouse

First, let's review some basic concepts:

OLTP: is a traditional relational database, the main operation of add, delete, change and search, emphasizing transaction consistency, such as banking system, e-commerce system.

OLAP: is a warehouse database, mainly reading data, doing complex data analysis, focusing on technical decision support, providing intuitive and simple results.

1.1. Application of ClickHouse in data Warehouse scenario

ClickHouse as a column database, column database is more suitable for OLAP scenarios, the key features of OLAP scenarios:

The vast majority are read requests

The data is updated in a sizeable batch (> 1000 rows) rather than in a single row, or there is no update at all.

Data that has been added to the database cannot be modified.

For reads, a considerable number of rows are extracted from the database, but only a small portion of the columns are extracted.

Wide tables, that is, each table contains a large number of columns

Relatively few queries (usually hundreds or less per server per second)

For simple queries, allow a delay of about 50 milliseconds

The data in the column is relatively small: numbers and short strings (for example, 60 bytes per URL)

High throughput is required when processing a single query (billions of rows per server per second)

Transactions are not necessary.

Low requirements for data consistency

Each query has a large table. Except for him, everything else is very small.

The query result is obviously smaller than the source data. In other words, the data is filtered or aggregated, so the results are appropriate in the RAM of a single server

1.2. Client tool DBeaver

The Clickhouse client tool is dbeaver, and the official website is https://dbeaver.io/.

Dbeaver is a free and open source (GPL) universal database tool for developers and database administrators. [Baidu encyclopedia]

Ease of use is the main goal of the project, which is a well-designed and developed database management tool. Free, cross-platform, based on open source frameworks and allowing various extensions to be written (plug-ins).

It supports any database with a JDBC driver.

It can handle any external data source.

Select and download the ClickHouse driver (without driver by default) by creating and configuring a new connection through "Database" in the operation interface menu, as shown in the following figure.

DBeaver configuration is based on Jdbc. Generally speaking, the default URL and port are as follows:

Jdbc:clickhouse://192.168.17.61:8123

This is shown in the following figure.

When you use DBeaver connection Clickhouse to do a query, sometimes the connection or query timeout occurs. At this time, you can set the socket_timeout parameter in the connection parameters to solve the problem.

Jdbc:clickhouse:// {host}: {port} [/ {database}]? socket_timeout=600000

1.3. Big data's application practice

Brief description of the environment:

The hardware resources are limited, with only 16g of memory and 100 million transaction data.

This application is a transaction big data, which mainly includes transaction master table, relevant customer information, material information, historical price, discount and integral information, etc., in which the main transaction table is self-related tree table structure.

In order to analyze customer trading behavior, under the condition of limited resources, extract and aggregate transaction details into transaction records by day and trading point, as shown in the following figure.

On ClickHouse, the transaction data structure consists of 60 columns (fields), and the interception is as follows:

In view of the frequent shortage of memory such as "would use 10.20 GiB, maximum: 9.31 GiB", a SQL statement for extracting aggregate datasets is written based on ClickHouse's SQL, as shown below.

The result is returned about 60s, as shown below:

2. Python uses ClickHouse to practice 2.1. ClickHouse third-party Python driver clickhouse_driver

ClickHouse does not provide an official Python interface driver. The common third-party driver interface is clickhouse_driver, which can be installed using pip, as shown below:

Pip install clickhouse_driverCollecting clickhouse_driver Downloading https://files.pythonhosted.org/packages/88/59/c570218bfca84bd0ece896c0f9ac0bf1e11543f3c01d8409f5e4f801f992/clickhouse_driver-0.2.1-cp36-cp36m-win_amd64.whl (173kB) 100% | ██ | 174kB 27kB/sCollecting tzlocal 0: self.get_trade (df_trade Filename.format (I)) n = n + batch if k = 0: flag=False print ('Completed' + str (k) + 'trade detailsgiving') Print ('Usercard count' + str (n)) return n # price change dataset class Price_Table (object): def _ init__ (self, cityname) Startdate): self.cityname = cityname self.startdate = startdate self.filename = 'price20210531.csv' def get_price (self): df_price = pd.read_csv (self.filename). Self.price_table=self.price_table.append (data_dict, ignore_index=True) print ('generate price tableting') Class CardTradeDB (object): def _ init__ (self,db_obj): self.db_obj = db_obj def insertDatasByCSV (self,filename): # there is a data mixed type df = pd.read_csv (filename,low_memory=False) # get transaction def getTradeDatasByID (self,ID_list=None): # string is too long We need to use''query_sql =' select C. cardusername as. Limit {}, {}) group by C.cardusername name__ ='n = self.db_obj.get_datas (query_sql) return n if _ _ name__ = ='_ _ main__': PTable = Price_Table ('Hubei') '2015-12-01') PTable.get_price () db_obj = DB_Obj ('ebd_all_b04') db_obj.setPriceTable (PTable.price_table) CTD = CardTradeDB (db_obj) df = CTD.getTradeDatasByID ()

The local file returned is:

3. Make a brief summary

When ClickHouse is used in OLAP scenarios, the query speed is very fast and requires large memory support. The third-party clickhouse-driver driver of Python basically meets the needs of data processing, and it is best if it can return Pandas DataFrame.

ClickHouse and Pandas aggregation are both very fast, and ClickHouse aggregation functions are also rich (for example, anyLast (x) returns the last value encountered in this article). If it can be aggregated through SQL, it is better to complete it in ClickHouse, and feedback the smaller result set to Python for machine learning.

Operation ClickHouse to delete the specified data def info_del2 (I): client = click_client (host=' address', port= port, user=' user name', password=' password', database=' database') sql_detail='alter table SS_GOODS_ORDER_ALL delete where order_id='+str (I) +' 'try: client.execute (sql_detail) except Exception as e: print (eJing' failed to delete commodity data')

When deleting data, python operates clickhou and mysql in a different way. Instead of using the usual% s and then adding data, you must edit a complete statement, as written in the above method, and the parameters passed in use the str type.

Read here, this article "how to use Python ClickHouse" article has been introduced, want to master the knowledge of this article also need to practice and use to understand, if you want to know more about the article, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report