How to merge small files in mapreduce 07/04 Update SLTechnology News&Howtos

How to merge small files in mapreduce

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to merge small files in mapreduce. Many people may not know much about it. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

PathFilter class in HDFS

It is a common requirement to process a batch of files in a single operation. For example, a MapReduce job that processes logs may need to analyze log files contained in a large number of directories within a month. It is convenient to use wildcards in an expression to match multiple files, without enumerating each file and directory to specify input. Hadoop provides two FIleSystem methods for performing wildcards:

1 public FileStatus [] globStatus (Path pathPattern) throw IOException

2 public FileStatus [] globStatus (Path pathPattern, PathFilter filter) throw IOException

The globStatus () method returns an array of FileStatus objects for all files that the path wants to match, sorted by path. Hadoop supports the same wildcards as Unix bash.

The second method passes a PathFilter object as a parameter, and PathFilter can further restrict the match. PathFilter is an interface with only one method accept (Path path). Refer to the following code for specific use

Package com.tv;import java.io.IOException;import java.net.URI;import java.net.URISyntaxException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FSDataOutputStream;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.FileUtil;import org.apache.hadoop.fs.Path;import org.apache.hadoop.fs.PathFilter;import org.apache.hadoop.io.IOUtils Public class MergeSmallFilesToHDFS {private static FileSystem fs = null; private static FileSystem local = null; public static class RegexExcludePathFilter implements PathFilter {private final String regex; public RegexExcludePathFilter (String regex) {this.regex = regex } public boolean accept (Path path) {/ / TODO Auto-generated method stub boolean flag = path.toString () .matches (regex); / / filter files in regex format, just return! Flag return! flag;}} public static class RegexAcceptPathFilter implements PathFilter {private final String regex; public RegexAcceptPathFilter (String regex) {this.regex = regex } public boolean accept (Path path) {/ / TODO Auto-generated method stub boolean flag = path.toString () .matches (regex); / / accept files in regex format, just return flag return flag }} public static void list () throws IOException, URISyntaxException {/ / read configuration file Configuration conf = new Configuration (); URI uri = new URI ("hdfs://zbc:9000"); / / FileSystem is the core class for users to operate HDFS, which obtains the HDFS file system fs = FileSystem.get (uri, conf) corresponding to URI. / / get the local file system local = FileSystem.getLocal (conf); / / get all the subdirectories under this directory (date name) FileStatus [] dirstatus = local.globStatus (new Path ("C:/Users/zaish/Documents/ learning / hadooop analysis data / tvdata/*"), new RegexExcludePathFilter ("^. * svn$"); Path [] dirs = FileUtil.stat2Paths (dirstatus); FSDataOutputStream out = null FSDataInputStream in = null; for (Path dir: dirs) {String fileName = dir.getName (). Replace ("-", "); / / File name / / only accept .txt files in the date directory FileStatus [] localStatus = local.globStatus (new Path (dir+" / * "), new RegexAcceptPathFilter (" ^. * txt$ ")) / / get all the files in the date directory Path [] listedPaths = FileUtil.stat2Paths (localStatus); / / output path Path block = new Path ("hdfs://zbc:9000/middle/tv/" + fileName + ".txt"); / / Open the output stream out = fs.create (block) For (Path p: listedPaths) {in = local.open (p); / / Open input stream IOUtils.copyBytes (in, out, 4096, false); / / copy data / / close input stream in.close () } if (out! = null) {/ / close the output stream out.close ();} public static void main (String [] args) throws Exception {list () }} after reading the above, do you have any further understanding of how to merge small files in mapreduce? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.