Spark rdd.count（）会产生不一致的结果

简单的rdd.count（）在多次运行时会给出不同的结果。

这里是我运行代码：

val inputRdd = sc.newAPIHadoopRDD(inputConfig, 
classOf[com.mongodb.hadoop.MongoInputFormat], 
classOf[Long], 
classOf[org.bson.BSONObject]) 

println(inputRdd.count())

它打开一个MondoDb服务器的连接，并只计算的对象。似乎相当直截了当地我

据MongoDB中有3349495项

这里是我的火花输出，都跑到同一个jar：

spark1 : 3.257.048 
spark2 : 3.303.272 
spark3 : 3.303.272 
spark4 : 3.303.272 
spark5 : 3.303.271 
spark6 : 3.303.271 
spark7 : 3.303.272 
spark8 : 3.303.272 
spark9 : 3.306.300 
spark10: 3.303.272 
spark11: 3.303.271

星火和MongoDB在同一个集群上运行。
我们正在运行：

Spark version 1.5.0-cdh5.6.1 
Scala version 2.10.4 
MongoDb version 2.6.12

不幸的是，我们不能更新这些

为放电不确定性？
有没有人可以启发我？

在此先感谢

编辑/进一步信息
我只注意到我们mongod.log错误。此错误是否会导致不一致的行为？

[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds 
[rsBackgroundSync] replSet syncing to: hadoop05:27017 
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds 
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds 
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds 
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds 
[rsBackgroundSync] replSet error RS102 too stale to catch up, at least from hadoop05:27017 
[rsBackgroundSync] replSet our last optime : Jul 2 10:19:44 57777920:111 
[rsBackgroundSync] replSet oldest at hadoop05:27017 : Jul 5 15:17:58 577bb386:59 
[rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember 
[rsBackgroundSync] replSet error RS102 too stale to catch up

来源

2017-01-25 PeterLudolf

你检查项目的数量在MongoDB中数次（并行运行的火花'计数（）'）？ – Yaron

运行时，MongoDb中的条目数量未更改。并感谢重新格式化:) – PeterLudolf

a）什么是您的MongoDB部署拓扑？（replica set或sharded cluster？）也许spark工作人员根据MongoDB成员返回不同的答案，即某些成员尚未复制数据。 b）MongoDB v2.6已于2016年10月到期，请尽可能升级。 –

正如你已经发现的那样，这个问题似乎并没有用spark（或scala），而是用MongoDB。

因此，有关差异的问题似乎得到解决。

你还是会想要解决实际MongoDB的错误，所提供的链接可以是一个很好的起点：http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember

来源

2017-07-31 15:06:33

Spark rdd.count（）会产生不一致的结果

回答

相关问题