Sunflower Treasure Book stored in Hadoop File 04/17 Update SLTechnology News&Howtos

Sunflower Treasure Book stored in Hadoop File

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

File storage branch storage and column storage, each storage format is divided into different types, how to use in the actual application? How to use it? Come and watch!

When will we specify the file storage format? For example, when creating a table in Hve and Ipala, in addition to specifying columns and delimiters, there is a STORED AS parameter at the end of the command line. This parameter defaults to text format, but the text is not suitable for all scenarios, so we can change the text information here.

So which format should we choose? What are the characteristics of each format? Why should we choose this format?

1. Text files:

Text files are the most basic file types in Hadoop, which can be read or written from any programming language, compatible with comma and tab-separated files, and many other applications. And the text file is readable directly, because it is all a string, so it is very useful in Debug. However, when the data reaches a certain scale, this format is very inefficient: (1) text files waste storage space by representing values as string; (2) it is difficult to represent binary data, such as pictures, and usually rely on other techniques, such as Base64 coding.

So to sum up, the text file format is easy to operate, but low performance.

II. Sequence files

Sequence files are essentially binary container formats based on key-value key-value pairs, which are less redundant and more efficient than text formats, and are suitable for storing binary data, such as images. And it is a proprietary format for Java and tightly integrated with Hadoop.

So the sequence file format can be summed up as: good performance, but difficult to operate.

3. Avro data file

Avro data files are binary encoded and have better storage efficiency. It is not only widely supported in the Hadoop ecosystem, but also can be used outside of Hadoop. It is an ideal choice for long-term storage of important data and can be read and written in multiple languages.

And it is embedded in the schema file, through which we can easily define the schema of the data like a table, and we can flexibly define fields and field types. Schema evolution can adapt to various changes, such as specifying a Schema type at present, adding some data structures, deleting some data, changing the type, and changing the length in the future.

So the Avro data file format can be summed up as follows: excellent operability and performance, is the best choice for Hadoop general storage.

All three formats described above are row storage, but there are also some column storage formats in Hadoop. Typical OLTP is stored in the form of rows, that is, consecutive rows are stored to consecutive blocks. When we do random value search access, we usually add some conditions, for row storage, we can quickly define the location of the block, and then extract the row data. Column storage is stored in column units, and if column storage is applied to OLTP, when we define it to scan a specific row, it scans all columns. It is a terrible thing for column storage to be applied to online transaction scenario processing. The significance of column storage lies in its application to big data analysis scenarios, such as eigenvalue extraction and variable filtering. Usually, we will apply a large number of wide tables in big data scenario applications. Maybe for a business analysis, we only need to use one or dozens of such columns. Then you can select some columns to scan without scanning the whole table. There is no absolute difference between row storage and column storage, but the scenarios that apply to each other are different.

Let's take a look at the following important storage methods:

I. Parquet file

The Parquet file format is very important and will be widely used in the future. If we call HDFS the big data storage fact standard, then the Parquet file is the fact standard of the file storage format. At present, spark has adopted it as the default file storage format, so you can see its importance. The open source column storage format originally developed by cloudera and twitter supports applications in MapReduce, Hive, Pig, Impala, Spark, Crunch, and other projects. Both it and Avro data files have Schema metadata, except that Parquet files are column storage and Avro data files are row storage. It must be emphasized here that Parquet files make some additional optimizations in coding to reduce storage space and increase performance.

So the Parquet file summed up is: excellent operability and performance, is the best choice based on the column access mode.

File storage format, need to focus on grasping and learning, especially the advantages and disadvantages of each storage format, must be proficient in order to better choose to use. In addition, we should also share and communicate with others in our usual work, so that we can better improve our knowledge structure and upgrade our technical level. We recommend "big data cn" Wechat official account, waiting for you to communicate.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.