How to realize data Compression with hdfs 07/02 Update SLTechnology News&Howtos

How to realize data Compression with hdfs

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces hdfs how to achieve data compression, the article is very detailed, has a certain reference value, interested friends must read it!

The company has less than 30 hadoop clusters with a total hdfs size of 120T. Recently, the monitoring always gave an alarm with insufficient disks (less than 5%). Before, it had been busy with business and had no time to organize the cluster. After finishing, it was found that the total number of existing files was about 34T, plus 3 copies of redundancy, the entire hdfs occupied 103T. Before cleaning, the text was stored directly, and there was no compression. There should be a lot of room for optimization here. Among them, a log file recording the installation of the application on the user's phone takes up about 5T, so start with him first.

Because hive has three file storage formats, TEXTFILE, SEQUENCEFILE, and RCFILE, the first two are row-based storage, and RCFile is a column-oriented data format introduced by Hive. It follows the design concept of "dividing first by column, then dividing vertically". When querying columns that it does not care about, it will skip those columns on IO, so choose RCFILE, and then compress them with Gzip.

There was also a mistake of comparing 2: because a colleague had investigated rcfile (quit) before, he looked at the table-building statement in the way of show create table XX and found that it was

CREATE EXTERNAL TABLE XX (. ) PARTITIONED BY (day int) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'COLLECTION ITEMS TERMINATED BY', 'LINES TERMINATED BY'\ n 'STORED AS INPUTFORMAT' org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'LOCATION' / user/hive/data/XX'

Just copy and change the fields, build a RCFile table of app_install, and sql import the previous data.

Set mapred.job.priority=VERY_HIGH;set hive.merge.mapredfiles=true;set hive.merge.smallfiles.avgsize=200000000;set hive.exec.compress.output=true;set mapred.output.compress=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec; set mapred.job.name=app_install.$_DAY;insert overwrite table app_install1 PARTITION (day=$_DAY) select XXX from tb1 where day=$_DAY

If you report an error, check the hadoop operation log and find that it is

FATAL ExecReducer: java.lang.UnsupportedOperationException: Currently the writer can only accept BytesRefArrayWritableat org.apache.hadoop.hive.ql.io.RCFile$Writer.append (RCFile.java:880) at org.apache.hadoop.hive.ql.io.RCFileOutputFormat$2.write (RCFileOutputFormat.java:140) at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp (FileSinkOperator.java:588) at org.apache.hadoop.hive.ql.exec.Operator.process (Operator.java:471) at org.apache.hadoop.hive.ql.exec. Operator.forward (Operator.java:762) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp (SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.process (Operator.java:471) at org.apache.hadoop.hive.ql.exec.Operator.forward (Operator.java:762) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.createForwardJoinObject (CommonJoinOperator.java:389) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject (CommonJoinOperator . Java: 715) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject (CommonJoinOperator.java:697) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.genObject (CommonJoinOperator.java:697) at org.apache.hadoop.hive.ql.exec.CommonJoinOperator.checkAndGenObject (CommonJoinOperator.java:856) at org.apache.hadoop.hive.ql.exec.JoinOperator.endGroup (JoinOperator.java:265) at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce (ExecReducer.java:198) At org.apache.hadoop.mapred.ReduceTask.runOldReducer (ReduceTask.java:519) at org.apache.hadoop.mapred.ReduceTask.run (ReduceTask.java:420) at org.apache.hadoop.mapred.Child$4.run (Child.java:255) at javax.security.auth.Subject.doAs (Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs (UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main (Child.java:249)

It was said on the Internet that it was a bug of hive. I always thought it was this bug. After tossing around all day, I finally tried to modify the sentence of building a table according to the way on the Internet.

REATE EXTERNAL TABLE XX (. ) PARTITIONED BY (day int) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'COLLECTION ITEMS TERMINATED BY', 'LINES TERMINATED BY'\ n 'STORED AS RCFILELOCATION' / user/hive/data/XX'

As a result, it works normally, and then uses show create table XX to view the statement and finds that it becomes

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT' org.apache.hadoop.hive.ql.io.RCFileOutputFormat'

Depressed to death, it is caused by the difference between building table sentences and displaying them in show create table. Although it is a small problem, it also takes a lot of experience.

These are all the contents of the article "how to achieve data Compression in hdfs". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.