火花数据表格集

-4

我做了一个独立的Apache集群7个。要运行Scala代码，代码是火花数据表格集

/** Our main function where the action happens */ 

def main(args: Array[String]) { 

    // Set the log level to only print errors 

    Logger.getLogger("org").setLevel(Level.ERROR) 

    // Create a SparkContext without much actual configuration 

    // We want EMR's config defaults to be used. 

    val conf = new SparkConf() 

    conf.setAppName("MovieSimilarities1M") 

    val sc = new SparkContext(conf) 

    val input = sc.textFile("file:///home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv") 

    val mappedInput = input.map(extractCustomerPricePairs) 

    val totalByCustomer = mappedInput.reduceByKey((x,y) => x + y) 

    val flipped = totalByCustomer.map(x => (x._2, x._1)) 

    val totalByCustomerSorted = flipped.sortByKey() 

    val results = totalByCustomerSorted.collect() 

    // Print the results. 

    results.foreach(println) 

    } 

}

步骤是：

我创建使用.jar文件SBT
使用提交作业火花提交* .jar

但是我的执行程序找不到sc.textFile("file:///home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv")

此customer-orders.csv文件存储在我的主PC中。

完整堆栈跟踪：

error: [Stage 0:> (0 + 2)/2]17/09/25 17:32:35 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 5, 141.225.166.191, executor 2): java.io.FileNotFoundException: File file:/home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv does not exist

我怎么解决这个问题呢？

请修改代码以在群集中运行。

来源

2017-09-25 Rakib Al-Fahad

错误：[阶段0：>（0 + 2）/ 2] 17/09/25 17:32:35错误TaskSetManager：阶段0.0中的任务0失败4次;中止作业线程“main”中的异常org.apache.spark.SparkException：由于阶段失败而导致作业中止：阶段0中的任务0。0失败4次，最近失败：在阶段0.0（TID 5,141.225.166.191，执行器2）中丢失任务0.3：java.io.FileNotFoundException：文件文件：/home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv不存在 –

为了让您的工作节点能够访问该文件，您有几个选项。

1.手动将文件复制到所有节点。

每个节点应正好这条道路有此文件：/home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv

2.附加文件提交作业。

有一个选项调用--files，使您可以复制任意数量的文件，同时提交这样的作业：

spark-submit --master ... -jars ... --files /home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv

不要滥用这一点。此选项更适用于测试目的和小文件。

3.使用一些可供所有节点访问的外部通用可用存储。

S3和NFS共享是流行的选择。

sc.textFile("s3n://bucketname/customer-orders.csv")

4.您可以在你的驱动程序读取数据，然后将其转换为加工做RDD。

val bufferedSource = io.Source.fromFile("/home/ralfahad/LearnSpark/SBTCreate/customer-orders.csv") 
val lines = (for (line <- bufferedSource.getLines()) yield line).toList 
val rdd = sc.makeRdd(lines)

一般不推荐使用，但可用于快速检测。

来源

2017-09-26 07:25:36

感谢您的帮助。这个概念现在很清楚 –

火花数据表格集

回答

相关问题