[summary] comparison of spark's processing of Lzo compressed files in text format and Lzo format 07/19 Update SLTechnology News&Howtos

[summary] comparison of spark's processing of Lzo compressed files in text format and Lzo format

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Describe how to load files in lzo compressed format in spark

2. Compare the influence of the number of Running Tasks when lzo format files calculate data in textFile and LzoTextInputFormat.

a. Make sure that the lzo.index index file is generated in the folder where the lzo file is located

(perform index operation on the lzo compressed file to generate lzo.index file before map operation can split

Hadoop jar ${HADOOP_HOME} / lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer / wh/source/)

b. With LzoTextInputFormat processing, Tasks can be allocated normally according to the number of blocks.

View the number of file blocks

[tech@dx2 ~] $hdfs fsck / wh/source/hotel.2017-08-07.txt_10.10.10.10_20170807.lzoConnecting to namenode via http://nn1.zdp.ol:50070FSCK started by bwtech (auth:SIMPLE) from / 10.10.10.10 for path / wh/source/hotel.2017-08-07.txt_10.10.16.105_20170807.lzo at Tue Aug 08 15:27:52 CST 2017.Status: HEALTHY Total size:2892666412 B Total dirs:0 Total files 1 Total symlinks:0 Total blocks (validated): 11 (avg. Block size 262969673 B) Minimally replicated blocks:11 (100.0) Over-replicated blocks:0 (0.0%) Under-replicated blocks:0 (0.0%) Mis-replicated blocks:0 (0.0%) Default replication factor:3 Average block replication:3.0 Corrupt blocks:0 Missing replicas:0 (0.0%) Number of data-nodes:21 Number of racks:2FSCK ended at Tue Aug 08 15:27:52 CST 2017 in 3 milliseconds

For Spark source code, please refer to https://github.com/chocolateBlack/LearningSpark/blob/master/src/main/scala-2.11/SparkLzoFile.scala

Import com.hadoop.mapreduce.LzoTextInputFormatimport org.apache.hadoop.io. {Text, LongWritable} import org.apache.spark. {SparkContext SparkConf} object SparkLzoFile {def main (args: array [string]) {val conf = new SparkConf (). SetAppName ("Spark_Lzo_File") val sc = new SparkContext (conf) / / File path val filePath = "/ wh/source/hotel.2017-08-07.txt_10.10.10.10_20170807.lzo" / / load the file val textFile = sc.textFile (filePath) / by lzoTextInputFormat in textFile mode According to the file val lzoFile = sc.newAPIHadoopFile [LongWritable Text, LzoTextInputFormat] (filePath) println (textFile.partitions.length) / / partitions number output 1 println (lzoFile.partitions.length) / / partitions number output 11 / / two ways to calculate word count view background task lzoFile.map (_. _ 2.toString) .flatMap (x = > x.split ("-")) .map ((_, 1)). ReduceByKey (_ + _). Collect textFile.flatMap (x = > x.split ("\ t")). Map ((_) 1)). ReduceByKey (_ + _). Collect}}

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.