2016-09-19 21 views
1

我花了差不多2天时间浏览互联网,但无法解决这个问题。我试图安装graphframes package(版本:0.2.0-spark2.0-s_2.11)通过PyCharm以spark运行,但尽管我尽了最大的努力,但这是不可能的。在PyCharm中使用图表框

我已经尝试了几乎所有的东西。请知道我在查看答案之前也查看了本网站here

这里是我试图运行的代码:

# IMPORT OTHER LIBS -------------------------------------------------------- 
import os 
import sys 
import pandas as pd 

# IMPORT SPARK ------------------------------------------------------------------------------------# 
# Path to Spark source folder 
USER_FILE_PATH = "/Users/<username>" 
SPARK_PATH = "/PycharmProjects/GenesAssociation" 
SPARK_FILE = "/spark-2.0.0-bin-hadoop2.7" 
SPARK_HOME = USER_FILE_PATH + SPARK_PATH + SPARK_FILE 
os.environ['SPARK_HOME'] = SPARK_HOME 

# Append pySpark to Python Path 
sys.path.append(SPARK_HOME + "/python") 
sys.path.append(SPARK_HOME + "/python" + "/lib/py4j-0.10.1-src.zip") 

try: 
    from pyspark import SparkContext 
    from pyspark import SparkConf 
    from pyspark.sql import SQLContext 
    from pyspark.graphframes import GraphFrame 

except ImportError as ex: 
    print "Can not import Spark Modules", ex 
    sys.exit(1) 

# GLOBAL VARIABLES --------------------------------------------------------- -----------------------# 
SC = SparkContext('local') 
SQL_CONTEXT = SQLContext(SC) 

# MAIN CODE ---------------------------------------------------------------------------------------# 
if __name__ == "__main__": 

    # Main Path to CSV files 
    DATA_PATH = '/PycharmProjects/GenesAssociation/data/' 
    FILE_NAME = 'gene_gene_associations_50k.csv' 

    # LOAD DATA CSV USING PANDAS -----------------------------------------------------------------# 
    print "STEP 1: Loading Gene Nodes -------------------------------------------------------------" 
    # Read csv file and load as df 
    GENES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME, 
         usecols=['OFFICIAL_SYMBOL_A'], 
         low_memory=True, 
         iterator=True, 
         chunksize=1000) 

    # Concatenate chunks into list & convert to dataFrame 
    GENES_DF = pd.DataFrame(pd.concat(list(GENES), ignore_index=True)) 

    # Remove duplicates 
    GENES_DF_CLEAN = GENES_DF.drop_duplicates(keep='first') 

    # Name Columns 
    GENES_DF_CLEAN.columns = ['gene_id'] 

    # Output dataFrame 
    print GENES_DF_CLEAN 

    # Create vertices 
    VERTICES = SQL_CONTEXT.createDataFrame(GENES_DF_CLEAN) 

    # Show some vertices 
    print VERTICES.take(5) 

    print "STEP 2: Loading Gene Edges -------------------------------------------------------------" 
    # Read csv file and load as df 
    EDGES = pd.read_csv(USER_FILE_PATH + DATA_PATH + FILE_NAME, 
         usecols=['OFFICIAL_SYMBOL_A', 'OFFICIAL_SYMBOL_B', 'EXPERIMENTAL_SYSTEM'], 
         low_memory=True, 
         iterator=True, 
         chunksize=1000) 

    # Concatenate chunks into list & convert to dataFrame 
    EDGES_DF = pd.DataFrame(pd.concat(list(EDGES), ignore_index=True)) 

    # Name Columns 
    EDGES_DF.columns = ["src", "dst", "rel_type"] 

    # Output dataFrame 
    print EDGES_DF 

    # Create vertices 
    EDGES = SQL_CONTEXT.createDataFrame(EDGES_DF) 

    # Show some edges 
    print EDGES.take(5) 

    g = gf.GraphFrame(VERTICES, EDGES) 

不用说,我都试过,包括graphframes目录(看看here明白我做了什么)到火花的pyspark目录。但似乎这还不够......我尝试过的其他任何事情都失败了。希望对此有所帮助。你可以在下面看到我得到的错误消息:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 
Setting default log level to "WARN". 
To adjust logging level use sc.setLogLevel(newLevel). 
16/09/19 12:46:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
16/09/19 12:46:03 WARN Utils: Service 'SparkUI' could not bind on port 4040.  Attempting port 4041. 

STEP 1: Loading Gene Nodes ------------------------------------------------------------- 
     gene_id 
0   MAP2K4 
1   MYPN 
2   ACVR1 
3   GATA2 
4   RPA2 
5   ARF1 
6   ARF3 
8   XRN1 
9   APP 
10   APLP1 
11  CITED2 
12   EP300 
13   APOB 
14   ARRB2 
15   CSF1R 
16  PRRC2A 
17   LSM1 
18  SLC4A1 
19   BCL3 
20   ADRB1 
21   BRCA1 
25   ARVCF 
26   PCBD1 
27   PSEN2 
28   CAPN3 
29   ITPR1 
30   MAGI1 
31   RB1 
32  TSG101 
33   ORC1 
...   ... 
49379  WDR26 
49380  WDR5B 
49382  NLE1 
49383  WDR12 
49385  WDR53 
49386  WDR59 
49387  WDR61 
49409  CHD6 
49422  DACT1 
49424  KMT2B 
49438 SMARCA1 
49459 DCLRE1A 
49469  F2RL1 
49472  SENP8 
49475  TSPY1 
49479 SERPINB5 
49521  HOXA11 
49548  SYF2 
49553  FOXN3 
49557  MLANA 
49608  REPIN1 
49609  GMNN 
49670 HIST2H2BE 
49767  BCL7C 
49797  SIRT3 
49810  KLF4 
49858  RHO 
49896  MAGEA2 
49907 SUV420H2 
49958  SAP30L 

[6025 rows x 1 columns] 
16/09/19 12:46:08 WARN TaskSetManager: Stage 0 contains a task of very large size (107 KB). The maximum recommended task size is 100 KB. 
[Row(gene_id=u'MAP2K4'), Row(gene_id=u'MYPN'), Row(gene_id=u'ACVR1'), Row(gene_id=u'GATA2'), Row(gene_id=u'RPA2')] 
STEP 2: Loading Gene Edges ------------------------------------------------------------- 
      src  dst     rel_type 
0  MAP2K4  FLNC    Two-hybrid 
1   MYPN  ACTN2    Two-hybrid 
2  ACVR1  FNTA    Two-hybrid 
3  GATA2  PML    Two-hybrid 
4   RPA2  STAT3    Two-hybrid 
5   ARF1  GGA3    Two-hybrid 
6   ARF3 ARFIP2    Two-hybrid 
7   ARF3 ARFIP1    Two-hybrid 
8   XRN1  ALDOA    Two-hybrid 
9   APP APPBP2    Two-hybrid 
10  APLP1  DAB1    Two-hybrid 
11  CITED2 TFAP2A    Two-hybrid 
12  EP300 TFAP2A    Two-hybrid 
13  APOB  MTTP    Two-hybrid 
14  ARRB2 RALGDS    Two-hybrid 
15  CSF1R  GRB2    Two-hybrid 
16  PRRC2A  GRB2    Two-hybrid 
17  LSM1  NARS    Two-hybrid 
18  SLC4A1 SLC4A1AP    Two-hybrid 
19  BCL3  BARD1    Two-hybrid 
20  ADRB1  GIPC1    Two-hybrid 
21  BRCA1  ATF1    Two-hybrid 
22  BRCA1  MSH2    Two-hybrid 
23  BRCA1  BARD1    Two-hybrid 
24  BRCA1  MSH6    Two-hybrid 
25  ARVCF  CDH15    Two-hybrid 
26  PCBD1 CACNA1C    Two-hybrid 
27  PSEN2  CAPN1    Two-hybrid 
28  CAPN3  TTN    Two-hybrid 
29  ITPR1  CA8    Two-hybrid 
...  ...  ...      ... 
49969 SAP30  HDAC3 Affinity Capture-Western 
49970 BRCA1  RBBP8   Co-localization 
49971 BRCA1  BRCA1  Biochemical Activity 
49972  SET  TREX1   Co-purification 
49973  SET  TREX1  Reconstituted Complex 
49974 PLAGL1  EP300  Reconstituted Complex 
49975 PLAGL1 CREBBP  Reconstituted Complex 
49976 EP300 PLAGL1 Affinity Capture-Western 
49977  MTA1  ESR1  Reconstituted Complex 
49978 SIRT2  EP300 Affinity Capture-Western 
49979 EP300  SIRT2 Affinity Capture-Western 
49980 EP300  HDAC1 Affinity Capture-Western 
49981 EP300  SIRT2  Biochemical Activity 
49982 MIER1 CREBBP  Reconstituted Complex 
49983 SMARCA4  SIN3A Affinity Capture-Western 
49984 SMARCA4  HDAC2 Affinity Capture-Western 
49985  ESR1  NCOA6 Affinity Capture-Western 
49986  ESR1  TOP2B Affinity Capture-Western 
49987  ESR1  PRKDC Affinity Capture-Western 
49988  ESR1  PARP1 Affinity Capture-Western 
49989  ESR1  XRCC5 Affinity Capture-Western 
49990  ESR1  XRCC6 Affinity Capture-Western 
49991 PARP1  TOP2B Affinity Capture-Western 
49992 PARP1  PRKDC Affinity Capture-Western 
49993 PARP1  XRCC5 Affinity Capture-Western 
49994 PARP1  XRCC6 Affinity Capture-Western 
49995 SIRT3  XRCC6 Affinity Capture-Western 
49996 SIRT3  XRCC6  Reconstituted Complex 
49997 SIRT3  XRCC6  Biochemical Activity 
49998 HDAC1  PAX3 Affinity Capture-Western 

[49999 rows x 3 columns] 
16/09/19 12:46:11 WARN TaskSetManager: Stage 1 contains a task of very large size (1211 KB). The maximum recommended task size is 100 KB. 
[Row(src=u'MAP2K4', dst=u'FLNC', rel_type=u'Two-hybrid'), Row(src=u'MYPN', dst=u'ACTN2', rel_type=u'Two-hybrid'), Row(src=u'ACVR1', dst=u'FNTA', rel_type=u'Two-hybrid'), Row(src=u'GATA2', dst=u'PML', rel_type=u'Two-hybrid'), Row(src=u'RPA2', dst=u'STAT3', rel_type=u'Two-hybrid')] 
Traceback (most recent call last): 
    File "/Users/username/PycharmProjects/GenesAssociation/__init__.py", line 99, in <module> 
    g = gf.GraphFrame(VERTICES, EDGES) 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 62, in __init__ 
    self._jvm_gf_api = _java_api(self._sc) 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/graphframes/graphframe.py", line 34, in _java_api 
    return jsc._jvm.Thread.currentThread().getContextClassLoader().loadClass(javaClassName) \ 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco 
    return f(*a, **kw) 
    File "/Users/username/PycharmProjects/GenesAssociation/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling o50.loadClass. 
: java.lang.ClassNotFoundException: org.graphframes.GraphFramePythonAPI 
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381) 
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424) 
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 
    at py4j.Gateway.invoke(Gateway.java:280) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:211) 
    at java.lang.Thread.run(Thread.java:745) 


Process finished with exit code 1 

在此先感谢。

回答

3

您可以设置PYSPARK_SUBMIT_ARGS无论是在你的代码

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11 pyspark-shell" 
) 
spark = SparkSession.builder.getOrCreate() 

或PyCharm编辑运行配置(运行 - >编辑配置 - >选择配置 - >选择配置选项卡 - >选择环境变量 - >添加PYSPARK_SUBMIT_ARGS):

enter image description here

以最少的工作示例:

import os 
import sys 

SPARK_HOME = ... 
os.environ["SPARK_HOME"] = SPARK_HOME 
# os.environ["PYSPARK_SUBMIT_ARGS"] = ... If not set in PyCharm config 

sys.path.append(os.path.join(SPARK_HOME, "python")) 
sys.path.append(os.path.join(SPARK_HOME, "python/lib/py4j-0.10.3-src.zip")) 

from pyspark.sql import SparkSession 

spark = SparkSession.builder.getOrCreate() 

v = spark.createDataFrame([("a", "foo"), ("b", "bar"),], ["id", "attr"]) 
e = spark.createDataFrame([("a", "b", "foobar")], ["src", "dst", "rel"]) 


from graphframes import * 

g = GraphFrame(v, e) 
g.inDegrees.show() 

spark.stop() 

您还可以添加packagesjarsspark-defaults.conf

如果您使用Python 3和graphframes 0.2,则从JAR中提取Python库时存在已知问题,因此您必须手动完成。例如,您可以下载JAR文件,将其解压缩,并确保graphframes的根目录位于您的Python路径中。这已在graphframes 0.3中得到修复。

+0

感谢您的回复。你能否再看一下后续问题?干杯。 –

+0

说实话,我不知道那里发生了什么。看起来像https://forums.databricks.com/questions/9530/pyspark-graphframes-init-error.html相同的问题,但我不能重现。 – zero323

+0

无论如何感谢的人。我能够通过构建软件包并将graphframe文件夹复制到pyspark目录(包括pyc文件)来运行它。不管怎么说,还是要谢谢你! –