2017-06-22 24 views
2

我在我的IBM Bluemix PySpark应用程序中使用Cloudant Python API。IBM Bluemix Spark:将python依赖项提供给spark-submit.sh

如何提供依赖包来提交spark? py-files选项到spark-submit.sh只需要py, zip or egg文件,我的包装格式为tar.gzwhl格式。

这是链接到的Cloudant Python客户端库,我试图用 - https://pypi.python.org/pypi/cloudant

文章How to install dependencies for python关于同一主题的讲座,但我想看到requirements.txt的例子,解决方案中提到的Procfile和manifest.yml文件。

+0

您链接的文章是关于python云代工应用程序,而不是关于spark应用程序,所以不幸的是不相关。 –

回答

1

您应该能够从您的python脚本编程式地使用pip,例如,

import pip 
pip.main(['install', '--user', 'cloudant']) 

这为我工作:

helloSpark.py运行后

import sys 
from pyspark import SparkContext 

import pip 
pip.main(['install', '--user', 'cloudant']) 

from cloudant.client import Cloudant 
client = Cloudant('username', 'password', account='account', connect=True) 

# do some spark processing 
def computeStatsForCollection(sc,countPerPartitions=100000,partitions=5): 
    totalNumber = min(countPerPartitions * partitions, sys.maxsize) 
    rdd = sc.parallelize(range(totalNumber),partitions) 
    return (rdd.mean(), rdd.variance()) 

if __name__ == "__main__": 
    sc = SparkContext(appName="Hello Spark") 
    print("Hello Spark Demo. Compute the mean and variance of a collection") 
    stats = computeStatsForCollection(sc); 
    print(">>> Results: ") 
    print(">>>>>>>Mean: " + str(stats[0])); 
    print(">>>>>>>Variance: " + str(stats[1])); 
    sc.stop() 

run.sh

./spark-submit.sh --vcap ./vcap.json --deploy-mode cluster \ 
    --master https://169.54.219.20:8443 \ 
    --conf spark.service.spark_version=1.6 
    helloSpark.py 

标准输出:

$ cat stdout_1498114277669877424 
no extra config 
load default config from : /usr/local/src/spark160master/spark/profile/batch/ 
Requirement already satisfied: cloudant in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages 
Requirement already satisfied: requests<3.0.0,>=2.7.0 in /usr/local/src/bluemix_jupyter_bundle.v47/notebook/lib/python2.7/site-packages (from cloudant) 
Traceback (most recent call last): 
    File "/tmp/spark-160-ego-master/work/spark-driver-380d8ae7-4ddc-452e-bb29-1665375a348c/helloSpark.py", line 8, in <module> 
    client = Cloudant('username', 'password', account='account', connect=True) 
    File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 443, in __init__ 
    self.connect() 
    File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 114, in connect 
    self.session_login(self._user, self._auth_token) 
    File "/gpfs/fs01/user/s9c8-cbcae60bfa1d3e-39ca506ba762/.local/lib/python2.7/site-packages/cloudant/client.py", line 172, in session_login 
    resp.raise_for_status() 
    File "/usr/local/src/bluemix_jupyter_bundle.v47/notebook/lib/python2.7/site-packages/requests/models.py", line 840, in raise_for_status 
    raise HTTPError(http_error_msg, response=self) 
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://account.cloudant.com/_session 

不幸的是,我没有保存输出的我第一次运行该通报说,它已经安装Cloudant脚本。但在这里您可以看到Cloudant库可用,并尝试使用无效凭据连接到群集,因此Cloudant返回了401错误。

你可能不希望尝试点子安装每次运行脚本的时候,所以你可以试试这个:

try: 
    import cloudant 
except: 
    import pip 
    pip.main(['install', '--user', 'cloudant']) 

这将尝试加载Cloudant库。如果加载出错(例如,因为尚未安装),它将与pip一起安装。

+0

谢谢。这工作。 –

+0

欢迎您。你可以请upvote并接受答案? –