Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the characteristics of Parquet

2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what are the characteristics of Parquet". In daily operation, I believe many people have doubts about the characteristics of Parquet. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what are the characteristics of Parquet?" Next, please follow the editor to study!

Writing process

Although it is stored as a column, the data comes in one row, so when will the data in memory be written to the file? We know that files can only be written sequentially, and if every line of data received is written to disk, it is row storage.

One solution is to open a file for each column. If the data has n attributes, n files are needed, and each time the data is written, it needs to be appended to n files. But for file formats, users definitely want to save complex data in a file, rather than managing a bunch of small files (imagine you made a ppt, each page saved as a file), so a Parquet file must store all the attributes of the data.

Another solution is to cache some data in memory, and after caching to a certain amount, package the data of each column together so that each package can be written to a file in a certain order. This is the essence of column storage: packaging by column cache.

file format

According to the above method, Parquet also needs to be divided into packets in each column. The segmentation standard of this packet is called Page,Page. The segmentation standard can be divided by data points (for example, one Page for every 1000 rows of data), or by space (for example, the data in each column is stored in 8KB to form a Page).

The data of an Page is a column of the same type, and it is generally encoded and compressed before it is stored on disk. In order to quickly query and decompress this Page, count the maximum and minimum value when writing, called PageHeader, which is stored at the beginning of Page, which is actually the metadata of Page. After the PageHeader is the data. When you read a Page, you can filter it through PageHeader first.

Parquet also stores multiple Page together, called Column Chunk. Therefore, each column consists of multiple ColumnChunk and has its own corresponding ColumnChunk Metadata. Note that this is only an attribute of a complete data, multiple attributes of a data should be placed on multiple Column Chunk, and the multiple Column Chunk together is called a Row Group.

Here is the official introduction of Parquet:

4-byte magic number "PAR1"

...

...

...

...

File Metadata

4-byte length in bytes of file metadata

4-byte magic number "PAR1"

Magic number is like a watermark, and finally there is the Metadata of the entire file. Take a look at the picture. Parquet's official document format looks like this:

On the left is the data and on the right is File Metadata.

If you think it's too complicated, you can look at the concise version of my painting:

Is it much more refreshing? There is a corresponding Row Group Metadata in File Metadata, and there is also Column Chunk Metadta, which is similar to the organization of data, so it will not be expanded.

At this point, the study on "what are the characteristics of Parquet" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report