2016-04-19 86 views
3

我使用python,这是Spark Rdd/dataframes。检查类型:如何检查是RDD还是数据框?

我试过isinstance(东西,RDD),但RDD没有被识别。

我之所以需要这样做:

我写其中两个RDD和dataframes可以通过一个函数,所以我需要做input.rdd得到底层RDD如果数据帧在传递

回答

3

isinstance会工作得很好:

from pyspark.sql import DataFrame 
from pyspark.rdd import RDD 

def foo(x): 
    if isinstance(x, RDD): 
     return "RDD" 
    if isinstance(x, DataFrame): 
     return "DataFrame" 

foo(sc.parallelize([])) 
## 'RDD' 
foo(sc.parallelize([("foo", 1)]).toDF()) 
## 'DataFrame' 

但单牒是更优雅的方式:

from functools import singledispatch 

@singledispatch 
def bar(x): 
    pass 

@bar.register(RDD) 
def _(arg): 
    return "RDD" 

@bar.register(DataFrame) 
def _(arg): 
    return "DataFrame" 

bar(sc.parallelize([])) 
## 'RDD' 

bar(sc.parallelize([("foo", 1)]).toDF()) 
## 'DataFrame' 

如果你不介意额外的依赖multipledispatch也是一个有趣的选择:

from multipledispatch import dispatch 

@dispatch(RDD) 
def baz(x): 
    return "RDD" 

@dispatch(DataFrame) 
def baz(x): 
    return "DataFrame" 

baz(sc.parallelize([])) 
## 'RDD' 

baz(sc.parallelize([("foo", 1)]).toDF()) 
## 'DataFrame' 

最后最Python化的方法是简单地查看接口:

def foobar(x): 
    if hasattr(x, "rdd"): 
     ## It is a DataFrame 
    else: 
     ## It (probably) is a RDD 
相关问题