2015-10-02 58 views
0

我正在尝试使用MapReduce Hadoop技术对wordcount程序进行计数。我需要做的是开发一个索引字计数应用程序,它将计算给定输入文件集中每个文件中每个字的出现次数。该文件集存在于Amazon S3存储桶中。它也会计算每个单词的总发生次数。我附加了计算给定文件集中单词出现次数的代码。在此之后,我需要打印哪个文件在哪个文件中出现的单词在该特定文件中出现的次数。MapReduce Apache Hadoop技术

我知道它有点复杂,但任何将不胜感激。

Map.java

import java.io.IOException; 
import java.util.*; 

import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileSplit; 

public class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 
    private final static IntWritable one = new IntWritable(1); 
    private Text word = new Text(); 
    private String pattern= "^[a-z][a-z0-9]*$"; 

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 
     String line = value.toString(); 
     StringTokenizer tokenizer = new StringTokenizer(line); 
     InputSplit inputSplit = context.getInputSplit(); 
     String fileName = ((FileSplit) inputSplit).getPath().getName(); 
     while (tokenizer.hasMoreTokens()) { 
      word.set(tokenizer.nextToken()); 
      String stringWord = word.toString().toLowerCase(); 
      if (stringWord.matches(pattern)){ 
       context.write(new Text(stringWord), one); 
      } 

     } 
    } 
} 

Reduce.java

import java.io.IOException; 

import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 

public class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 

    public void reduce(Text key, Iterable<IntWritable> values, Context context) 
    throws IOException, InterruptedException { 
     int sum = 0; 
     for (IntWritable val : values) { 
      sum += val.get(); 
     } 
     context.write(key, new IntWritable(sum)); 
    } 
} 

WordCount.java

import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.conf.*; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapreduce.*; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 

public class WordCount { 
    public static void main(String[] args) throws Exception { 
     Configuration conf = new Configuration(); 

     Job job = new Job(conf, "WordCount"); 
     job.setJarByClass(WordCount.class); 
     job.setOutputKeyClass(Text.class); 
     job.setOutputValueClass(IntWritable.class); 

     job.setNumReduceTasks(3); 

     job.setMapperClass(Map.class); 
     job.setReducerClass(Reduce.class); 

     job.setInputFormatClass(TextInputFormat.class); 
     job.setOutputFormatClass(TextOutputFormat.class); 

     FileInputFormat.addInputPath(job, new Path(args[0])); 
     FileOutputFormat.setOutputPath(job, new Path(args[1])); 

     job.waitForCompletion(true); 
    } 
} 
+0

你的问题在哪里? – Roman

+0

对不起,我没有明白。 –

+0

这个网站是问题和答案。您的帖子中没有单个问号。那么你究竟在问什么? – Roman

回答

2

在映射器,创建可写textpair自定义这将是输出密钥将保存您的文件的文件名和文字,并将其值设为1.

映射输出:

<K,V> ==> <MytextpairWritable,new IntWritable(1) 

你可以得到下面的代码片段在映射文件名。

FileSplit fileSplit = (FileSplit)context.getInputSplit(); 
String filename = fileSplit.getPath().getName(); 

并将这些作为构造函数传递给context.write中的自定义可写类。像这样的东西。

context.write(new MytextpairWritable(filename,word),new IntWritable(1)); 

而在减速装置只是总结的价值,这样你可以得到每个文件多少次是有一个特定的词。 Reducer代码会是这样的。

public class Reduce extends Reducer<mytextpairWritable, IntWritable,mytextpairWritable, IntWritable> { 


    public void reduce(mytextpairWritable key, Iterable<IntWritable> values , Context context) 
    throws IOException, InterruptedException { 
     int sum = 0; 
     for(IntWritable val: values){ 
      sum+=val.get(); 
      } 
     context.write(key, new IntWritable(sum)); 
} 

您的输出将是这样的。

File1,hello,2 
File2,hello,3 
File3,hello,1