2017-07-02 31 views
1

我正在学习Spark,通过学习Spark中的一些示例:Lightning Fast Data Analysis,然后添加自己的开发。RDD.saveAsTextFile之后的空文件是什么?

我创建了这个类来查看基本转换和操作。

/** 
* Find errors in a log file 
*/ 

package com.oreilly.learningsparkexamples.mini.java; 

import org.apache.spark.SparkConf; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.JavaSparkContext; 
import org.apache.spark.api.java.function.Function; 

public class FindErrors { 
    public static void main(String args[]){ 
     String inputFile = args[0]; 
     String outputFile = args[1]; 
     //Create a Spark context 
     SparkConf conf = new SparkConf().setAppName("findErrors"); 
     JavaSparkContext sc = new JavaSparkContext(conf); 
     //Load input data 
     JavaRDD<String> input = sc.textFile(inputFile); 
     //Split up into words 
     JavaRDD<String> errorsRDD = input.filter(
      new Function<String, Boolean>() { 
       public Boolean call(String x) { 
        return x.contains("error"); 
       } 
      }); 
     //Transform into word and count 
     //errorsRDD.saveAsTextFile(outputFile); 

     JavaRDD<String> warningsRDD = input.filter(
      new Function<String, Boolean>() { 
       public Boolean call(String x) { 
        return x.contains("warning"); 
       } 
      }); 

     JavaRDD<String> badLinesRDD = errorsRDD.union(warningsRDD); 

     badLinesRDD.saveAsTextFile(outputFile); 

     System.out.println("I had " + badLinesRDD.count() + " concerning lines."); 
     System.out.println("Here are 10 examples:"); 
     for(String line: badLinesRDD.take(10)){ 
      System.out.println(line); 
     } 

    } 
} 

这是我用来运行它的命令:

$SPARK_HOME/bin/spark-submit --class com.oreilly.learningsparkexamples.mini.java.FindErrors ./target/learning-spark-mini-example-0.0.1.jar ../files/fake_logs/log1.log ./errorLog 

这里的日志文件的内容:

66.249.69.97 - - [24/Sep/2014:22:25:44 +0000] "GET /071300/242153 HTTP/1.1" 404 514 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /favicon.ico HTTP/1.1" 200 1713 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET/HTTP/1.1" 200 18785 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET /jobmineimg.php?q=m HTTP/1.1" 200 222 "http://www.holdenkarau.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.174 - - [24/Sep/2014:22:26:37 +0000] "GET /jobmineimg.php?q=m HTTP/1.1" 200 222 "http://www.holdenkarau.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 

一件事我注意到的是,输出创建一些文件,而比我预期的一个文件。

的文件有:

_SUCCESS 


part-00000 
71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 

part-00001 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 

part-00002 


part-00003 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 
71.19.157.175 - - [24/Sep/2014:22:26:12 +0000] "GET /warning HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36" 

它看起来好像每个警告/错误的“分组”创建文件。什么是空白文件虽然?

此外,这可能是我的代码中,我还没有找到的东西,或者它是一个星火的特征?

回答

1

这是一项功能。使用saveAsTextFile Spark为每个分区写入一个输出文件,无论它是否包含数据。由于您应用了filter,原先包含数据的某些输入分区最终可能为空。因此空文件。

+0

干杯user6910411。 – runnerpaul

相关问题