What is big data's storage format parquet? 04/28 Update SLTechnology News&Howtos

What is big data's storage format parquet?

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how the big data storage format parquet is, and the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Big data storage format-Why does parquet need column storage

First of all, let's start with the relational database and look at some common queries in the relational database. For relational databases, our common queries generally have detailed queries or statistical grouping, but if it is a large amount of data, relational databases can not well support statistical queries, so general relational databases are mostly for OLTP business, then the common big data query has a characteristic, because of the large amount of data, it is of little significance to obtain specific details. So basically are statistical queries, such as: select count (1) cnt,ip from access_log group by ip, query the number of visits to each ip. Of course, there may also be queries to do in-depth data analysis in detail.

For the query speed, of course, the faster the better, we can improve the query speed of the system from the following aspects (of course, cache is not considered), 1. Query system framework optimization, here may include (execution plan, distributed computing, etc.), 2. Optimize the underlying file storage format to minimize IO.3. And, of course, indexing technology.

For sql:select count (1) cnt,ip from access_log group by ip above

We assume that the format of access_log is as follows: access_log {string ip; string access_time;. (other N attributes)

}

If it is row storage:

We need to read all the data in a row, then get the ip field in it, and do grouping statistics. Increased disk IO, but also increased computing pressure.

What if it is a column store:

For the above statistics, we only need to read IP to read unneeded column information, which is the advantage of column storage for big data's statistical query. Of course, in addition to big data's own sparsity, using columns may be better to optimize the storage structure and reduce storage space.

Parquet introduction

Parquet is the open source data structure of twitter in 2013, google's Dremel: [Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html](Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html "Interactive Analysis of Web-Scale Datasets http://research.google.com/pubs/pub36632.html"). This big data query sharp weapon inside the paper, in fact, is mainly in the elaboration of the underlying file storage format.

Parquet supports the composition of complex nested data structures and uses the repetition level / definition level (repetition/definition level) method to encode the data structures. The nesting method can actually avoid big data's join problem.

Parquet supports the compression of a column of data and supports read and write operations in any development language. At present, it is the top project of apache.

Analysis of parquet file format

First of all, let's take a look at how to define a parquet file format, which is also defined in a format similar to PB. Here we use class information as an example. There are multiple classes, each class has a master, there are multiple students (can be absent), each student has a name and multiple hobbies, as well as the grades of each course and the course name. Schema definition: message classroom {required string master; repeated group students {required string name; repeated string hobbys; repeated group coursescore {required string name; optional string score;} required means required repeated means there can be any optional 0 or 1 for data relations can be divided into Map and List, if List can be modified by repeated,Map using group keyword.

Here we define a set of data:

{master: "jack", student: name: "tom" student: name: "喜悦", hobbys: "basketball", hobbys: "football", coursescore: name: "math", coursescore: name: "chinese", score:100} {master: "BoBo"}

One of the two classes has no students, one has hobbies and courses, and the other has nothing. Let's first look at how to store it.

Storage

According to the schema defined above, it can be transformed into a tree representation structure first.

Only the root node of the tree is the data node that needs to be actually stored, and the rest is just a maintenance of the relationship. The following figure shows its nesting relationship

The same is true of actual storage, where each column stores all the data information. Is it easy to store.

Read

Now that all the data of the same class are stored together, how to restore the data to the original data. Here, the metadata information of parquet is introduced. How to restore the data requires two metadata information, Repetition Level and Definition Level.

First of all, we need to know which data is in a group, and we also need to know the hierarchical relationship between the data and the data, OK. These two metadata information can help us do it. (of course, it is also the metadata information written during storage)

Repetition Level records the level at which the values of the column are duplicated.

Take the two classroom messages just now as an example: for the master of two classroom, because they are independent and have no direct relationship, their Repetition level is 0

Jack 0

BoBO 0

For jack student tom,joy, tom is the first student, so level is 0, while 喜悦 is equal, so it is 1. 5%.

MasterRepetition Level Jack0 BoBo0

Student.nameRepetition Level tom0 joy1 null2

Student.coursescore.scoreRepetition Level null0 null2 1001

Based on Repetition level, the original data can be classified, but we do not know where to record the STOP, and the relationship before the data, at this time in the introduction of Definition Level.

At this point, Definition Level is introduced as a defined depth to record whether the column is virtual or not. So for non-NULL records, it doesn't make sense, and the value must be the same. Give me the same example. For example, for master, depth is 0 because it is required.

MasterDefinition Level Jack0 BoBo0

Student.nameDefinition Level tom1 joy2 null1 can restore the nesting relationship of data through Definition Level.

About big data storage format parquet is how to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.