每当我尝试在pyspark中执行一个简单处理时,它无法打开套接字。例外:无法在pyspark上打开套接字
>>> myRDD = sc.parallelize(range(6), 3)
>>> sc.runJob(myRDD, lambda part: [x * x for x in part])
以上抛出异常 -
port 53554 , proto 6 , sa ('127.0.0.1', 53554)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Volumes/work/bigdata/spark-custom/python/pyspark/context.py", line 917, in runJob
return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
File "/Volumes/work/bigdata/spark-custom/python/pyspark/rdd.py", line 143, in _load_from_socket
raise Exception("could not open socket")
Exception: could not open socket
>>> 15/08/30 19:03:05 ERROR PythonRDD: Error while sending iterator
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:613)
我通过rdd.py _load_from_socket检查,并意识到它得到的端口,但服务器甚至还没有开始或SP runJob可能是发出─
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
它不会为我工作。我使用Spark 1.5.2和jdk1.7版本。 – prabhugs
错误是当python驱动程序尝试连接到驱动程序的scala部分时,出现任何火花动作(count,reduce ....) 下面的相应行显示超时被硬编码为3秒。 https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L121 从技术上讲,现在没有办法配置超时,因此Python代码恢复的唯一方法是捕获应用程序级代码中的异常并重试可配置的次数。 – farmi