Mapreduction of text mining participle 07/09 Update SLTechnology News&Howtos

Mapreduction of text mining participle

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Software version

Paoding-analysis3.0

Project jar package and copy Pao Ding dic directory to the project's classpath

Modify paoding-dic-home.properties file under paoding-analysis.jar to set dictionary file path

Paoding.dic.home=classpath:dic

Word Segmentation Program demo

Import java.io.IOException;import java.io.StringReader;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;import net.paoding.analysis.analyzer.PaodingAnalyzer;public class TokenizeWithPaoding {public static void main (String [] args) {String line= "Republic of the Chinese nation"; PaodingAnalyzer analyzer = new PaodingAnalyzer (); StringReader sr=new StringReader (line); TokenStream ts=analyzer.tokenStream ("", sr) / / Segmentation flow, the first parameter is meaningless / / iterative Segmentation flow try {while (ts.incrementToken ()) {CharTermAttribute ta=ts.getAttribute (CharTermAttribute.class); System.out.println (ta.toString ());}} catch (Exception e) {e.printStackTrace ();}

News text classification source file

Http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Each folder represents a category, and the files under each category represent a piece of news

Chinese news classification needs word segmentation first.

You can use CombineFileInputFormat, another abstract subclass of FileInputFormat, to implement the createRecordReader method for a large number of small files

CombineFileInputFormat overrides the getSpilt method. The returned shard type is CombineFileSpilt, which is a subclass of InputSpilt and can contain multiple files.

How RecordReader is generated from the file key-value is determined by the nextKeyValue function

Custom CombineFileInputFormat class

Package org.conan.myhadoop.fengci;import java.io.IOException;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.JobContext;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit / * Custom MyInputFormat class, used to implement a Split containing multiple files * @ author BOB * * / public class MyInputFormat extends CombineFileInputFormat {/ / prohibit file segmentation @ Override protected boolean isSplitable (JobContext context, Path file) {return false @ Override public RecordReader createRecordReader (InputSplit split, TaskAttemptContext context) throws IOException {return new CombineFileRecordReader ((CombineFileSplit) split, context, MyRecordReader.class);}}

Custom RecordReader class

Package org.conan.myhadoop.fengci;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.InputSplit;import org.apache.hadoop.mapreduce.RecordReader;import org.apache.hadoop.mapreduce.TaskAttemptContext;import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit / * Custom MyRecordReader class, which is used to read the content in the Split shard of MyInputFormat object sharding * @ author BOB * * / public class MyRecordReader extends RecordReader {private CombineFileSplit combineFileSplit; / / configuration information of the currently processed shard private Configuration conf; / / job private Text currentKey = new Text () / / currently read key private Text currentValue = new Text (); / / currently read value private int totalLength; / / number of files in the current part private int currentIndex; / / the location index of the file being read in the current part private float currentProgress = 0F / / current progress private boolean processed = false; / / Mark whether the current file has been processed / / constructor public MyRecordReader (CombineFileSplit combineFileSplit, TaskAttemptContext context, Integer fileIndex) {super (); this.combineFileSplit = combineFileSplit; this.currentIndex = fileIndex This.conf = context.getConfiguration (); this.totalLength = combineFileSplit.getPaths () .length;} @ Override public void initialize (InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {} @ Override public Text getCurrentKey () throws IOException, InterruptedException {return currentKey } @ Override public Text getCurrentValue () throws IOException, InterruptedException {return currentValue;} @ Override public float getProgress () throws IOException, InterruptedException {if (currentIndex > = 0 & & currentIndex < totalLength) {return currentProgress = (float) currentIndex/totalLength;} return currentProgress } @ Override public void close () throws IOException {} @ Override public boolean nextKeyValue () throws IOException, InterruptedException {if (! processed) {/ / consists of the file's parent directory, file name and directory separator key Path file = combineFileSplit.getPath (currentIndex) StringBuilder sb = new StringBuilder (); sb.append ("/"); sb.append (file.getParent (). GetName ()) .append ("/"); sb.append (file.getName ()); currentKey.set (sb.toString ()) / / take the contents of the entire file as value FSDataInputStream in = null; byte [] content = new byte [(int) combineFileSplit.getLength (currentIndex)]; FileSystem fs = file.getFileSystem (conf); in = fs.open (file) In.readFully (content); currentValue.set (content); in.close (); processed = true; return true;} return false;}}

Word-driven category

Package org.conan.myhadoop.fengci;import java.io.IOException;import java.io.StringReader;import net.paoding.analysis.analyzer.PaodingAnalyzer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.FileUtil;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job Import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import org.apache.lucene.analysis.Analyzer;import org.apache.lucene.analysis.TokenStream;import org.apache.lucene.analysis.tokenattributes.CharTermAttribute / * participle driver class, used for participle * @ author BOB * * / public class TokenizerDriver extends Configured implements Tool {public static void main (String [] args) throws Exception {int res = ToolRunner.run (new Configuration (), new TokenizerDriver (), args); System.exit (res) } @ Override public int run (String [] args) throws Exception {Configuration conf = new Configuration (); / / Parameter setting conf.setLong ("mapreduce.input.fileinputformat.split.maxsize", 4000000); / / Job name Job job = new Job (conf, "Tokenizer") Job.setJarByClass (TokenizerDriver.class); job.setMapperClass (Map.class); job.setInputFormatClass (MyInputFormat.class); job.setOutputFormatClass (SequenceFileOutputFormat.class); job.setOutputKeyClass (Text.class) Job.setOutputValueClass (Text.class); Path inpath=new Path (args [0]); Path outpath=new Path (args [1]); FileSystem fs = inpath.getFileSystem (conf); FileStatus [] status = fs.listStatus (inpath); Path [] paths = FileUtil.stat2Paths (status) For (Path path: paths) {FileInputFormat.addInputPath (job, path);} FileOutputFormat.setOutputPath (job, outpath); / / if the output folder already exists, delete FileSystem hdfs = outpath.getFileSystem (conf) If (hdfs.exists (outpath)) {hdfs.delete (outpath,true); hdfs.close ();} / / No Reduce task job.setNumReduceTasks (0); return job.waitForCompletion (true)? 0: 1 } / * Map class under Hadoop computing framework, used for parallel processing of text word segmentation task * @ author BOB * * / static class Map extends Mapper {@ Override protected void map (Text key, Text value, Context context) throws IOException InterruptedException {/ / create a word splitter Analyzer analyzer = new PaodingAnalyzer () String line = value.toString (); StringReader reader = new StringReader (line); / / get participle stream object TokenStream ts = analyzer.tokenStream (", reader); StringBuilder sb = new StringBuilder () / / traversing the word while (ts.incrementToken ()) {CharTermAttribute ta = ts.getAttribute (CharTermAttribute.class) in the word stream If (sb.length ()! = 0) {sb.append (") .append (ta.toString ());} else {sb.append (ta.toString ()) Value.set (sb.toString ()); context.write (key, value);}

The result of word segmentation is preprocessed, all the news is grouped into one text, key is the category, one line represents a piece of news, and the words are separated by spaces.

The processed data can be used in mahout as Bayesian classifier.

Reference article:

Http://f.dataguru.cn/thread-244375-1-1.html

Http://www.cnblogs.com/panweishadow/p/4320720.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.