熊猫DataFrame性能

熊猫真的很棒，但我真的很惊讶，它是如何从Pandas.DataFrame中检索值的效率低下。在下面的玩具示例中，即使DataFrame.iloc方法比字典慢100倍以上。熊猫DataFrame性能

问题：这里的教训只是字典是更好的查找价值的方法吗？是的，我明白这正是他们所做的。但是我只是想知道我是否缺少DataFrame查询性能。

我意识到这个问题比“询问”更“沉思”，但我会接受一个答案，提供对此的见解或观点。谢谢。

import timeit 

setup = ''' 
import numpy, pandas 
df = pandas.DataFrame(numpy.zeros(shape=[10, 10])) 
dictionary = df.to_dict() 
''' 

f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]'] 

for func in f: 
    print func 
    print min(timeit.Timer(func, setup).repeat(3, 100000))

值=字典[5] [5]

0.130625009537

值= df.loc [5,5]

19.4681699276

值= DF。 iloc [5,5]

17.2575249672

来源

2014-02-28 Owen

字典是数据框，就像一辆自行车是一辆汽车。你可以在自行车上骑脚踏10英尺，比起汽车，装备等都快。但如果你需要走一英里，汽车才会赢。

对于某些小的，有针对性的目的，字典可能会更快。如果这就是你所需要的，那么请使用字典，当然！但是，如果你需要/需要DataFrame的强大和豪华，那么字典是不可替代的。如果数据结构不能满足您的需求，那么比较速度就没有意义。

现在例如 - 更具体一点 - 字典适用于访问列，但访问行并不方便。

import timeit 

setup = ''' 
import numpy, pandas 
df = pandas.DataFrame(numpy.zeros(shape=[10, 1000])) 
dictionary = df.to_dict() 
''' 

# f = ['value = dictionary[5][5]', 'value = df.loc[5, 5]', 'value = df.iloc[5, 5]'] 
f = ['value = [val[5] for col,val in dictionary.items()]', 'value = df.loc[5]', 'value = df.iloc[5]'] 

for func in f: 
    print(func) 
    print(min(timeit.Timer(func, setup).repeat(3, 100000)))

产生

value = [val[5] for col,val in dictionary.iteritems()] 
25.5416321754 
value = df.loc[5] 
5.68071913719 
value = df.iloc[5] 
4.56006002426

因此列表的字典是在检索比df.iloc行慢5倍。随着列数的增加，速度不足会变得更大。（列数与自行车类比中的脚数一样，距离越长，车辆变得越方便......）

这只是列表字典不太方便的一个例子/比DataFrame慢。

另一个例子是当你有一个DatetimeIndex的行，并希望选择某些日期之间的所有行。使用数据帧，您可以使用

df.loc['2000-1-1':'2000-3-31']

如果您要使用列表字典，那么没有简单的类比。和DataFrame相比，您需要用来选择正确行的Python循环会再次非常慢。

来源

2014-02-28 02:02:17 unutbu

回答这样也许添加常见问题，在这里看到： https://github.com/pydata/pandas/issues/3871 – Jeff

感谢这两个非常有启发性的例子，还有一个比喻，作为一个骑车人，我很欣赏。 – Owen

我遇到了同样的问题。你可以使用at来改善。

“由于使用[]进行索引必须处理大量情况（单标签访问，切片，布尔索引等），因此需要花费一些开销以便找出要求的内容。你只想访问一个标量值，最快的方法是使用at和iat方法，这些方法在所有数据结构上实现。“

看到官方参考http://pandas.pydata.org/pandas-docs/stable/indexing.html章“快速标值获取和设置”使用at或iat用于标量运算

来源

2014-04-24 00:58:46 user3566825

这是一个很好的参考，但不如上述答案详细。 – BCR

+1。实施例基准：

In [1]: import numpy, pandas 
    ...: df = pandas.DataFrame(numpy.zeros(shape=[10, 10])) 
    ...: dictionary = df.to_dict() 

In [2]: %timeit value = dictionary[5][5] 
The slowest run took 34.06 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 310 ns per loop 

In [4]: %timeit value = df.loc[5, 5] 
10000 loops, best of 3: 104 µs per loop 

In [5]: %timeit value = df.iloc[5, 5] 
10000 loops, best of 3: 98.8 µs per loop 

In [6]: %timeit value = df.iat[5, 5] 
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 9.58 µs per loop 

In [7]: %timeit value = df.at[5, 5] 
The slowest run took 6.59 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 9.26 µs per loop

似乎使用at（iat）比loc（iloc）快约10倍。

来源

2015-09-18 18:34:57 joon

我遇到了关于访问数据帧行的不同现象。在数据帧上测试这个简单的例子，大约有10,000,000行。字典中的岩石。

def testRow(go): 
    go_dict = go.to_dict() 
    times = 100000 
    ot= time.time() 
    for i in range(times): 
     go.iloc[100,:] 
    nt = time.time() 
    print('for iloc {}'.format(nt-ot)) 
    ot= time.time() 
    for i in range(times): 
     go.loc[100,2] 
    nt = time.time() 
    print('for loc {}'.format(nt-ot)) 
    ot= time.time() 
    for i in range(times): 
     [val[100] for col,val in go_dict.iteritems()] 
    nt = time.time() 
    print('for dict {}'.format(nt-ot))

来源

2017-04-19 09:41:15 amityaffliction

我想访问一个小区的最快的方法，是

df.get_value(row,column) 
df.set_value(row,column,value)

两个都是快于（我认为）

df.iat(...) 
df.at(...)

来源

2017-05-30 11:20:34

熊猫DataFrame性能

回答

相关问题