Apache的火花和Python拉姆达

我有以下代码Apache的火花和Python拉姆达

file = spark.textFile("hdfs://...") 
counts = file.flatMap(lambda line: line.split(" ")) \ 
      .map(lambda word: (word, 1)) \ 
      .reduceByKey(lambda a, b: a + b) 
counts.saveAsTextFile("hdfs://...")

http://spark.apache.org/examples.html我抄在这里

我无法理解这段代码尤其是关键字

flatmap的例子，
地图和
reduceby

有人可以发生的事情用简单的英语解释。

来源

2014-07-04 jhon.smith

我不是专家，但我认为flatMap建立从嵌套结构（字行的名单？）的列表，地图应用功能的所有元素，一个d reduceByKey按键对这些元素进行分组（我猜这里是相同的单词），并将函数（这里是一个和）成对地应用。这可能会计数文本中每个单词的出现次数。 – user189

如果您使用函数式语言来进行函数式编程，事情会变得更加简洁和易于阅读。即我强烈建议使用Scala而不是OO脚本语言。 Scala功能更强大，对Spark更具性能，并且更容易挖掘Spark代码。你的代码变成：'spark.textFile（“hdfs：// ...”）.flatMap（_。split（“”））。map（（_，1））。reduceByKey（_ + _）。saveAsTextFile “hdfs：// ...”）' – samthebest

map是最简单的，它本质上说，做序列的每个元素在给定的操作并返回结果序列（非常类似于foreach）。 flatMap是同样的事情，但不是只有一个元素每个元素返回你被允许返回一个序列（可以为空）。这是一个解释difference between map and flatMap的答案。最后reduceByKey需要聚合函数（意味着它采用两个相同类型的参数和返回类型，也应该是交换和关联，否则你会得到不一致的结果），这是用于聚合每V每个K在你的(K,V)对序列。

例^*：
reduce (lambda a, b: a + b,[1,2,3,4])

这是说集合体+整个列表，它会做

1 + 2 = 3 
3 + 3 = 6 
6 + 4 = 10 
final result is 10

者皆减少，除了你同样的事情做一个减少每个独特键。

所以要解释它在你的榜样

file = spark.textFile("hdfs://...") // open text file each element of the RDD is one line of the file 
counts = file.flatMap(lambda line: line.split(" ")) //flatMap is needed here to return every word (separated by a space) in the line as an Array 
      .map(lambda word: (word, 1)) //map each word to a value of 1 so they can be summed 
      .reduceByKey(lambda a, b: a + b) // get an RDD of the count of every unique word by aggregating (adding up) all the 1's you wrote in the last step 
counts.saveAsTextFile("hdfs://...") //Save the file onto HDFS

那么，为什么算的话这种方式，原因是节目的MapReduce的范例是高度并行，从而扩展到这样做计算TB或甚至数PB的数据。

_{我不使用python多告诉我，如果我犯了一个错误。}

来源

2014-07-04 15:59:47 aaronman

见直列评论：

file = spark.textFile("hdfs://...") # opens a file 
counts = file.flatMap(lambda line: line.split(" ")) \ # iterate over the lines, split each line by space (into words) 
      .map(lambda word: (word, 1)) \ # for each word, create the tuple (word, 1) 
      .reduceByKey(lambda a, b: a + b) # go over the tuples "by key" (first element) and sum the second elements 
counts.saveAsTextFile("hdfs://...")

reduceByKey的更详细的解释可以发现here

来源

2014-07-04 13:58:19 alfasin

对不起，我不明白reduceByKey。在一个正常的lambda表达式lambda a中，b：a + b表示输入对（a，b）给我a + b结果不是吗？但是在这里它做了其他奇怪的语法？ –

要了解reduceBykey，您首先必须了解reduce。一个简单的简化例子：'print reduce（lambda a，b：a + b，[1,2,3]）'迭代一个迭代器并将函数（第一个参数 - 这里是lambda表达式）应用于前两个元素然后使用结果与第三个元素等。 – alfasin

我alfasin我重读你的解释，我只希望我也可以奖励点给你too.Your评论清除我的困惑reduceByKey –

的答案在这里是在代码级别准确，但它可能有助于了解引擎盖下发生的事情。

我的理解是，当调用reduce操作时，会有一个海量数据混洗，导致通过map（）操作获得的所有KV对具有相同的键值，并将其分配给总计值的任务在KV对的集合中。然后将这些任务分配给不同的物理处理器，然后将结果与另一个数据洗牌进行比较。

因此，如果在地图操作产生（猫1）（猫1）（狗1）（猫1）（猫1）（狗1）

的降低操作产生（猫4）（狗2）

希望这有助于

来源

2014-09-13 23:32:22 Alexk

Apache的火花和Python拉姆达

回答

相关问题