熊猫元素明智的比较和创建选择

在一个数据框中，我想比较列的元素与值，并将通过比较的元素排序到一个新的列。熊猫元素明智的比较和创建选择

df = pandas.DataFrame([{'A':3,'B':10}, 
         {'A':2, 'B':30}, 
         {'A':1,'B':20}, 
         {'A':2,'B':15}, 
         {'A':2,'B':100}]) 

df['C'] = [x for x in df['B'] if x > 18]

我无法找出什么过错，为什么我得到：

ValueError: Length of values does not match length of index

来源

2016-05-24 mati

正如达伦所说，DataFrame中的所有列应具有相同的长度。

当您尝试print [x for x in df['B'] if x > 18]时，您只能得到[30, 20, 100]值。但是你有五个索引/行。这就是你得到Length of values does not match length of index错误的原因。

如下您可以更改代码：

df['C'] = [x if x > 18 else None for x in df['B']] 
print df

您将获得：

A B  C 
0 3 10 NaN 
1 2 30 30.0 
2 1 20 20.0 
3 2 15 NaN 
4 2 100 100.0

来源

2016-05-24 07:27:53

我认为你可以使用loc与boolean indexing：

print (df) 
    A B 
0 3 10 
1 2 30 
2 1 20 
3 2 15 
4 2 100 

print (df['B'] > 18) 
0 False 
1  True 
2  True 
3 False 
4  True 
Name: B, dtype: bool 

df.loc[df['B'] > 18, 'C'] = df['B'] 
print (df) 
    A B  C 
0 3 10 NaN 
1 2 30 30.0 
2 1 20 20.0 
3 2 15 NaN 
4 2 100 100.0

如果你需要通过病症使用的选择boolean indexing：

print (df[df['B'] > 18]) 
    A B 
1 2 30 
2 1 20 
4 2 100

如果需要更多的东西更快，可以用where：

df['C'] = df.B.where(df['B'] > 18)

时序（len(df)=50k）：

In [1367]: %timeit (a(df)) 
The slowest run took 8.34 times longer than the fastest. This could mean that an intermediate result is being cached. 
1000 loops, best of 3: 1.14 ms per loop 

In [1368]: %timeit (b(df1)) 
100 loops, best of 3: 15.5 ms per loop 

In [1369]: %timeit (c(df2)) 
100 loops, best of 3: 2.93 ms per loop

代码时序：

import pandas as pd 

df = pd.DataFrame([{'A':3,'B':10}, 
         {'A':2, 'B':30}, 
         {'A':1,'B':20}, 
         {'A':2,'B':15}, 
         {'A':2,'B':100}]) 
print (df) 
df = pd.concat([df]*10000).reset_index(drop=True) 
df1 = df.copy() 
df2 = df.copy() 

def a(df): 
    df['C'] = df.B.where(df['B'] > 18) 
    return df 

def b(df1):  
    df['C'] = ([x if x > 18 else None for x in df['B']]) 
    return df 

def c(df2):  
    df.loc[df['B'] > 18, 'C'] = df['B'] 
    return df 

print (a(df)) 
print (b(df1)) 
print (c(df2))

来源

2016-05-24 07:10:35 jezrael

我添加新的更快的方法，请检查一下。谢谢。 – jezrael

所有列在DataFrame必须是相同的长度H。因为你过滤出一些值，你试图插入值减少到C柱比在列A和B.

所以，你的两个选项来启动一个新的数据帧为C：

dfC = [x for x in df['B'] if x > 18]

或者当x不是18+时列中的某个虚拟值。例如： -

df['C'] = np.where(df['B'] > 18, True, False)

甚至：

df['C'] = np.where(df['B'] > 18, 'Yay', 'Nay')

附：另请参阅：Pandas conditional creation of a series/dataframe column以获取其他方法。

来源

2016-05-24 07:10:57

熊猫元素明智的比较和创建选择

回答

相关问题