2014-10-12 109 views
8

我注意到这是一个issue on GitHub already。有没有人有任何代码将熊猫数据框转换为橙色表格?将Pandas DataFrame转换为橙色表

明确地说,我有下表。

 user hotel star_rating user home_continent gender 
0   1  39   4.0  1    2 female 
1   1  44   3.0  1    2 female 
2   2  63   4.5  2    3 female 
3   2  2   2.0  2    3 female 
4   3  26   4.0  3    1 male 
5   3  37   5.0  3    1 male 
6   3  63   4.5  3    1 male 
+0

橙色格式看起来并不难,只要输出继电器:http://docs.orange.biolab.si/reference/rst/Orange.data.formats.html也是它支持导入CSV文件和猜测的数据类型,你有尝试过什么吗? – EdChum 2014-10-12 08:54:48

+0

所以我可以理解数据如何保存到*中。标签文件,但具体来说,是否有一个函数或一系列的调用,你可以让你转换熊猫数据帧到橙色表? (Side评论:这个页面如何谈论数据如何存储在外部文件中,但并没有谈到如何从文件中保存/加载,这很有趣)我个人认为Orange没有很好的文档记录。) – hlin117 2014-10-12 13:19:04

+0

这样一个工作流Pandas中的表格作为文件,然后在Orange工作中导入文件?还是太多了?我猜测字段数据类型可能不会很好地传递。 – BKay 2014-10-16 19:01:00

回答

17

Orange软件包的文档没有涵盖所有细节。根据lib_kernel.cppTable._init__(Domain, numpy.ndarray)仅适用于intfloat

他们确实应该为pandas.DataFrames或至少支持numpy.dtype("str")提供一个C级接口。

更新:添加table2df,df2table通过对int和float使用numpy大大提高了性能。

将这段脚本保存在您的橙色Python脚本集合中,现在您在橙色环境中配备了熊猫。

使用a_pandas_dataframe = table2df(a_orange_table)a_orange_table = df2table(a_pandas_dataframe)

注意:此脚本只能在Python 2.x中,参考@DustinTang的answer为Python 3.x的兼容脚本。

import pandas as pd 
import numpy as np 
import Orange 

#### For those who are familiar with pandas 
#### Correspondence: 
#### value <-> Orange.data.Value 
####  NaN <-> ["?", "~", "."] # Don't know, Don't care, Other 
#### dtype <-> Orange.feature.Descriptor 
####  category, int <-> Orange.feature.Discrete # category: > pandas 0.15 
####  int, float <-> Orange.feature.Continuous # Continuous = core.FloatVariable 
####             # refer to feature/__init__.py 
####  str <-> Orange.feature.String 
####  object <-> Orange.feature.Python 
#### DataFrame.dtypes <-> Orange.data.Domain 
#### DataFrame.DataFrame <-> Orange.data.Table = Orange.orange.ExampleTable 
####        # You will need this if you are reading sources 

def series2descriptor(d, discrete=False): 
    if d.dtype is np.dtype("float"): 
     return Orange.feature.Continuous(str(d.name)) 
    elif d.dtype is np.dtype("int"): 
     return Orange.feature.Continuous(str(d.name), number_of_decimals=0) 
    else: 
     t = d.unique() 
     if discrete or len(t) < len(d)/2: 
      t.sort() 
      return Orange.feature.Discrete(str(d.name), values=list(t.astype("str"))) 
     else: 
      return Orange.feature.String(str(d.name)) 


def df2domain(df): 
    featurelist = [series2descriptor(df.icol(col)) for col in xrange(len(df.columns))] 
    return Orange.data.Domain(featurelist) 


def df2table(df): 
    # It seems they are using native python object/lists internally for Orange.data types (?) 
    # And I didn't find a constructor suitable for pandas.DataFrame since it may carry 
    # multiple dtypes 
    # --> the best approximate is Orange.data.Table.__init__(domain, numpy.ndarray), 
    # --> but the dtype of numpy array can only be "int" and "float" 
    # --> * refer to src/orange/lib_kernel.cpp 3059: 
    # --> * if (((*vi)->varType != TValue::INTVAR) && ((*vi)->varType != TValue::FLOATVAR)) 
    # --> Documents never mentioned >_< 
    # So we use numpy constructor for those int/float columns, python list constructor for other 

    tdomain = df2domain(df) 
    ttables = [series2table(df.icol(i), tdomain[i]) for i in xrange(len(df.columns))] 
    return Orange.data.Table(ttables) 

    # For performance concerns, here are my results 
    # dtndarray = np.random.rand(100000, 100) 
    # dtlist = list(dtndarray) 
    # tdomain = Orange.data.Domain([Orange.feature.Continuous("var" + str(i)) for i in xrange(100)]) 
    # tinsts = [Orange.data.Instance(tdomain, list(dtlist[i]))for i in xrange(len(dtlist))] 
    # t = Orange.data.Table(tdomain, tinsts) 
    # 
    # timeit list(dtndarray) # 45.6ms 
    # timeit [Orange.data.Instance(tdomain, list(dtlist[i])) for i in xrange(len(dtlist))] # 3.28s 
    # timeit Orange.data.Table(tdomain, tinsts) # 280ms 

    # timeit Orange.data.Table(tdomain, dtndarray) # 380ms 
    # 
    # As illustrated above, utilizing constructor with ndarray can greatly improve performance 
    # So one may conceive better converter based on these results 


def series2table(series, variable): 
    if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"): 
     # Use numpy 
     # Table._init__(Domain, numpy.ndarray) 
     return Orange.data.Table(Orange.data.Domain(variable), series.values[:, np.newaxis]) 
    else: 
     # Build instance list 
     # Table.__init__(Domain, list_of_instances) 
     tdomain = Orange.data.Domain(variable) 
     tinsts = [Orange.data.Instance(tdomain, [i]) for i in series] 
     return Orange.data.Table(tdomain, tinsts) 
     # 5x performance 


def column2df(col): 
    if type(col.domain[0]) is Orange.feature.Continuous: 
     return (col.domain[0].name, pd.Series(col.to_numpy()[0].flatten())) 
    else: 
     tmp = pd.Series(np.array(list(col)).flatten()) # type(tmp) -> np.array(dtype=list (Orange.data.Value)) 
     tmp = tmp.apply(lambda x: str(x[0])) 
     return (col.domain[0].name, tmp) 

def table2df(tab): 
    # Orange.data.Table().to_numpy() cannot handle strings 
    # So we must build the array column by column, 
    # When it comes to strings, python list is used 
    series = [column2df(tab.select(i)) for i in xrange(len(tab.domain))] 
    series_name = [i[0] for i in series] # To keep the order of variables unchanged 
    series_data = dict(series) 
    print series_data 
    return pd.DataFrame(series_data, columns=series_name) 
+0

所以你似乎提供了一个非常彻底的答复,谢谢!这些功能是否适用于每个Orange桌面/ Panda DataFrame? – hlin117 2014-10-19 16:15:59

+0

希望是的,我测试了我自己的数据集,但是可能需要更多的测试。 – TurtleIzzy 2014-10-19 16:20:04

+0

这对我在Python3和Orange3中没有效果。但是,谢谢! – 2016-07-06 01:26:53

1

像这样?

table = Orange.data.Table(df.as_matrix()) 

Orange中的列将获得通用名称(a1,a2 ...)。如果要从数据框中复制名称和类型,请从数据框中构建Orange.data.Domain对象(http://docs.orange.biolab.si/reference/rst/Orange.data.domain.html#Orange.data.Domain.init),并将其作为上面的第一个参数传递。

请参阅http://docs.orange.biolab.si/reference/rst/Orange.data.table.html中的构造函数。

+0

我尝试此操作时出现域错误。 “TypeError:构造函数无效(域或示例或两者都有)”。你能提供一些代码来添加一个域吗? – hlin117 2014-10-17 18:48:07

+1

假设你有'df = DataFrame({“A”:[1,2,3,4],“B”:[8,7,6,5]})'。使用'domain = Orange.data.Domain([Orange.feature.Continuous(name)for name in df.columns])'然后'table = Orange.data.Table(domain,df.as_matrix())构建一个域。 ' – JanezD 2014-10-18 14:56:50

+0

哦,如果它不起作用:你的数据框是什么样的?如果'df.as_matrix()。dtype'是'object',Orange将不会接受它。您必须将分类数据转换为索引。 – JanezD 2014-10-18 15:04:33

2

为了将pandas DataFrame转换为橙色表,您需要构建一个指定列类型的域。

对于连续变量,您只需提供变量的名称,但对于离散变量,还需要提供所有可能值的列表。

下面的代码将构造一个域名为您的数据帧,并将其转换为橙色表:

import numpy as np 
from Orange.feature import Discrete, Continuous 
from Orange.data import Domain, Table 
domain = Domain([ 
    Discrete('user', values=[str(v) for v in np.unique(df.user)]), 
    Discrete('hotel', values=[str(v) for v in np.unique(df.hotel)]), 
    Continuous('star_rating'), 
    Discrete('user', values=[str(v) for v in np.unique(df.user)]), 
    Discrete('home_continent', values=[str(v) for v in np.unique(df.home_continent)]), 
    Discrete('gender', values=['male', 'female'])], False) 
table = Table(domain, [map(str, row) for row in df.as_matrix()]) 

地图(STR,行)所需步骤,橙色知道,数据中包含的离散特征值(而不是值列表中的值的索引)。

+0

这很好用!我对它进行了测试,似乎我可以按性别对表格进行排序,所以我会假定大部分其他表函数都可以工作。 – hlin117 2014-10-18 18:02:18

+0

如果你想描述一个特征是一个ID,那么没有其他的数据类型吗? (例如,一个用户ID) – hlin117 2014-10-19 16:17:46

2

此代码从@TurtleIzzy修改为Python3。

import numpy as np 
from Orange.data import Table, Domain, ContinuousVariable, DiscreteVariable 


def series2descriptor(d): 
    if d.dtype is np.dtype("float") or d.dtype is np.dtype("int"): 
     return ContinuousVariable(str(d.name)) 
    else: 
     t = d.unique() 
     t.sort() 
     return DiscreteVariable(str(d.name), list(t.astype("str"))) 

def df2domain(df): 
    featurelist = [series2descriptor(df.iloc[:,col]) for col in range(len(df.columns))] 
    return Domain(featurelist) 

def df2table(df): 
    tdomain = df2domain(df) 
    ttables = [series2table(df.iloc[:,i], tdomain[i]) for i in range(len(df.columns))] 
    ttables = np.array(ttables).reshape((len(df.columns),-1)).transpose() 
    return Table(tdomain , ttables) 

def series2table(series, variable): 
    if series.dtype is np.dtype("int") or series.dtype is np.dtype("float"): 
     series = series.values[:, np.newaxis] 
     return Table(series) 
    else: 
     series = series.astype('category').cat.codes.reshape((-1,1)) 
     return Table(series) 
相关问题