Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of File name wildcard and filtering in MapReduce

2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly shows you the "MapReduce Chinese file name wildcard and filtering example analysis", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn the "MapReduce Chinese file name wildcard and filtering example analysis" this article.

1. The use and introduction of wildcards

It is a common requirement to process batch files in one step. For example, a MapReduce job that processes logs might analyze a month's worth of files that are contained in a large number of directories. Hadoop has a wildcard operation that makes it easy to use wildcards to check multiple files in an expression without enumerating each file and directory to specify input. Hadoop provides two FileSystem methods for performing wildcards:

Public FileStatus [] globStatus (Path pathPattern) throws IOException public FileStatus [] globStatus (Path pathPattern, PathFilter filter) throws IOException

GlobStatus () returns an array of FileStatus objects whose paths match the format provided, sorted by path. The optional PathFilter command further specifies the restriction match.

Hadoop supports the same range of wildcards as Unix bash (see Table 3-2).

Table 3-2: wildcards and their roles

Wildcard character

Name

Match

*

Asterisk

Match 0 or more characters

?

Question mark

Match a single character

[ab]

Character category

Match a character in {a _ r _ b}

Renew the table

Wildcard character

Name

Match

[^ ab]

Non-character category

The match is not a character in {aforme b}.

[aMub]

Character range

Match one within the range of {a ~ (th) b}

Characters (including ab), an in dictionary

Be less than or equal to b in order

[^ aMub]

Non-character range

Match one that is not in the range of {aforme b}.

Characters (including ab), an in the word

The canonical order should be less than or equal to b

{a,b}

Or choose

Match statements that contain one of an or b

\ c

Escape character

Match metacharacter c

Suppose you have log files stored in a directory structure organized by date. In this way, you can assume that the log files for the last day of 2007 will be saved to the directory under the name of / 2007-12-31. Suppose the entire list of files is as follows:

/ 2007-12-30 / 2007-12-31 / 2008-01-01 / 2008-01-02

Here are some file wildcards and their extensions.

Wildcard character

Expansion

/ *

/ 2007/2008

/ *

/ 2007/12 / 2008/01

/ * / 12Universe *

/ 2007-12-30 / 2007-12-31

/ 200?

/ 2007 / 2008

/ 200 [78]

/ 2007 / 2008

/ 200 [7-8]

/ 2007 / 2008

/ 200 [^ 01234569]

/ 2007 / 2008

/ * / {31 / 01}

/ 2007-12-31 / 2008-01-01

/ * / 3 {0jue 1}

/ 2007-12-30 / 2007-12-31

/ * / {12Compact 31 01Compact 01}

/ 2007-12-31 / 2008-01-01

2. PathFilter object

The wildcard format does not always accurately describe the collection of files we want to access. For example, it is unlikely to use a wildcard format to exclude a particular file. The listStatus () and globStatus () methods in FileSystem provide optional PathFilter objects that allow us to programmatically control the match:

Package org.apache.hadoop.fs; public interface PathFilter {boolean accept (Path path);}

PathFilter, like java.io.FileFilter, is a Path object rather than a File object.

Example 3-7: shows a PathFilter that excludes paths that match a regular expression.

Public class RegexExcludePathFilter implements PathFilter {private final String regex; public RegexExcludePathFilter (String regex) {this.regex = regex;} public boolean accept (Path path) {return! path.toString (). Matches (regex);}}

This filter leaves only files that are different from regular expressions. We use it with a wildcard that pre-culls some file collections: filters are used to optimize the results. For example:

Fs.globStatus (new Path ("/ 2007 new Path *), new RegexExcludeFilter (" ^. * / 2007-12-31 $") are all the contents of this article entitled" sample Analysis of wildcard names and filtering in MapReduce ". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report