2016-11-25 28 views
1

我按照https://github.com/basho/spark-riak-connector的说明运行Spark 2.0.2-hadoop2.7。如何在pyspark上使用Spark Riak连接器?

试图 -

1)pyspark --repositories https://dl.bintray.com/basho/data-platform --packages com.basho.riak:spark-riak-connector_2.11:1.6.0

2)pyspark --driver-class-path /path/to/spark-riak-connector_2.11-1.6.0-uber.jar

3)加入spark.driver.extraClassPath /path/to/jars/*主人的火花default.conf

4)尝试旧版本的连接器(1.5 .0和1.5.1)

我可以在master's web ui中验证,在pyspark的应用程序环境中吨,riak罐装载。我也加倍检查了spark的scala版本是2.11。

但..不管我做什么,我没有pyspark_riak进口

>>> import pyspark_riak 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
ImportError: No module named pyspark_riak 

我该如何解决呢?

尝试选项#1时,罐子被加载,我得到这个报告,看起来只有精细:

:: modules in use: 
    com.basho.riak#riak-client;2.0.7 from central in [default] 
    com.basho.riak#spark-riak-connector_2.11;1.6.0 from central in [default] 
    com.fasterxml.jackson.core#jackson-annotations;2.8.0 from central in [default] 
    com.fasterxml.jackson.core#jackson-core;2.8.0 from central in [default] 
    com.fasterxml.jackson.core#jackson-databind;2.8.0 from central in [default] 
    com.fasterxml.jackson.datatype#jackson-datatype-joda;2.4.4 from central in [default] 
    com.fasterxml.jackson.module#jackson-module-scala_2.11;2.4.4 from central in [default] 
    com.google.guava#guava;14.0.1 from central in [default] 
    joda-time#joda-time;2.2 from central in [default] 
    org.erlang.otp#jinterface;1.6.1 from central in [default] 
    org.scala-lang#scala-reflect;2.11.2 from central in [default] 
    :: evicted modules: 
    com.fasterxml.jackson.core#jackson-core;2.4.4 by [com.fasterxml.jackson.core#jackson-core;2.8.0] in [default] 
    com.fasterxml.jackson.core#jackson-annotations;2.4.4 by [com.fasterxml.jackson.core#jackson-annotations;2.8.0] in [default] 
    com.fasterxml.jackson.core#jackson-databind;2.4.4 by [com.fasterxml.jackson.core#jackson-databind;2.8.0] in [default] 
    com.fasterxml.jackson.core#jackson-annotations;2.4.0 by [com.fasterxml.jackson.core#jackson-annotations;2.8.0] in [default] 
    --------------------------------------------------------------------- 
    |     |   modules   || artifacts | 
    |  conf  | number| search|dwnlded|evicted|| number|dwnlded| 
    --------------------------------------------------------------------- 
    |  default  | 15 | 11 | 11 | 4 || 11 | 11 | 
    --------------------------------------------------------------------- 

另外,如果我打印sys.path我可以看到/tmp/spark-b2396e0a-f329-4066-b3b1-4e8c21944a66/userFiles-7e423d94-5aa2-4fe4-935a-e06ab2d423ae/com.basho.riak_spark-riak-connector_2.11-1.6.0.jar(我核实存在)

回答

1

存储库中的spark-riak-connector没有pyspark支持。但是你可以建立它自己,重视pyspark:

git clone https://github.com/basho/spark-riak-connector.git 
cd spark-riak-connector/ 
python connector/python/setup.py bdist_egg # creates egg file inside connector/python/dist/ 

然后新建鸡蛋添加到Python的路径:

pyspark --repositories https://dl.bintray.com/basho/data-platform --packages com.basho.riak:spark-riak-connector_2.11:1.6.0 
>>> import sys 
>>> sys.path.append('connector/python/dist/pyspark_riak-1.0.0-py2.7.egg') 
>>> import pyspark_riak 
>>> 

但是用火花了Riak连接器与火花2.0.2要小心 - 我看到最新的软件包版本是使用spark 1.6.2进行测试的,API可能无法按预期工作。

相关问题