2015-06-19 84 views
5

我已经在Windows机器上安装了Spark并希望通过Spyder使用它。经过一些故障排除后,基本功能似乎可以正常工作:从Spark/pyspark连接到PostgreSQL

import os 

os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6" 

from pyspark import SparkContext, SparkConf 
from pyspark.sql import SQLContext 

spark_config = SparkConf().setMaster("local[8]") 
sc = SparkContext(conf=spark_config) 
sqlContext = SQLContext(sc) 

textFile = sc.textFile("D:\\Analytics\\Spark\\spark-1.4.0-bin-hadoop2.6\\README.md") 
textFile.count() 
textFile.filter(lambda line: "Spark" in line).count() 

sc.stop() 

这按预期运行。我现在想连接到运行在同一台服务器上的Postgres9.3数据库。我已经从这里下载了JDBC驱动程序here并将它放在D:\ Analytics \ Spark \ spark_jars文件夹中。我然后创建一个新文件d:\分析\星火包含此行\火花1.4.0彬hadoop2.6 \的conf \火花defaults.conf:

spark.driver.extraClassPath  'D:\\Analytics\\Spark\\spark_jars\\postgresql-9.3-1103.jdbc41.jar' 

我已经跑了下面的代码来测试连接

import os 

os.environ["SPARK_HOME"] = "D:\Analytics\Spark\spark-1.4.0-bin-hadoop2.6" 

from pyspark import SparkContext, SparkConf 
from pyspark.sql import SQLContext 

spark_config = SparkConf().setMaster("local[8]") 
sc = SparkContext(conf=spark_config) 
sqlContext = SQLContext(sc) 

df = (sqlContext 
    .load(source="jdbc", 
      url="jdbc:postgresql://[hostname]/[database]?user=[username]&password=[password]", 
      dbtable="pubs") 
) 
sc.stop() 

,但我得到以下错误:

Py4JJavaError: An error occurred while calling o22.load. 
: java.sql.SQLException: No suitable driver found for  jdbc:postgresql://uklonana01/stonegate?user=analytics&password=pMOe8jyd 
at java.sql.DriverManager.getConnection(Unknown Source) 
at java.sql.DriverManager.getConnection(Unknown Source) 
at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:118) 
at org.apache.spark.sql.jdbc.JDBCRelation.<init>(JDBCRelation.scala:128) 
at org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:113) 
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:265) 
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) 
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) 
at java.lang.reflect.Method.invoke(Unknown Source) 
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 
at py4j.Gateway.invoke(Gateway.java:259) 
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
at py4j.commands.CallCommand.execute(CallCommand.java:79) 
at py4j.GatewayConnection.run(GatewayConnection.java:207) 
at java.lang.Thread.run(Unknown Source) 

我如何检查是否我已经下载了正确的.jar文件或者其他地方的错误可能来自哪里?

+0

我试过的PostgreSQL-9.3-1103.jdbc41.jar和不少其他.jar文件。我也尝试添加'#s.environ [“SPARK_CLASSPATH”] =“D:\\ Analytics \\ Spark \\ spark_jars \\ *”“,但这会给出错误”Py4JJavaError:调用None.org.apache时发生错误.spark.api.java.JavaSparkContext。 :org.apache.spark.SparkException:找到spark.driver.extraClassPath和SPARK_CLASSPATH。只使用前者。“这意味着上述版本应该工作。 – phildeutsch

回答

1

删除火花defaults.conf和SPARK_CLASSPATH在python添加到系统环境是这样的:

os.environ["SPARK_CLASSPATH"] = 'PATH\\TO\\postgresql-9.3-1101.jdbc41.jar'