2017-04-19 27 views
1

我正在将Excel电子表格转换为Python,以便自动化并加速完成几项任务。我需要向DataFrame添加几列,并根据前一列中的值向它们添加数据。我已经使用两个嵌套for循环工作,但它确实很慢,我知道Pandas并非专为单元格工作而设计。这里是我的问题的一个样本:避免Pandas DataFrame循环的有效方法

import pandas as pd 

results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]}) 

thresholds = [103.5, 98.5, 93.5, 88.5, 83.5, 78.5] 

for threshold in thresholds: 
    results[str(threshold)] = 0 
    for index, row in results.iterrows(): 
     if row['scores'] > threshold: 
      results.set_value(index, str(threshold), row['outcomes']) 

print (results) 

和正确的输出:

outcomes scores 103.5 98.5 93.5 88.5 83.5 78.5 
0   1 78.5  0  0  0  0  0  0 
1   0 91.0  0  0  0  0  0  0 
2   1 103.5  0  1  1  1  1  1 

什么是这样做的更有效的方法?我一直在尝试将DataFrame转换为按列而不是行来工作,但我无法获得任何工作。 感谢您的帮助!

+0

http://stackoverflow.com/questions/43398468/rounding-to-specific-numbers-in-python-3-6/43398652#43398652 – Serge

+0

http://stackoverflow.com/questions/14947909/ python-checking-to-which-bin-a-value-belong?noredirect = 1&lq = 1 – Serge

回答

2

这将做的工作:

import pandas as pd 

results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]}) 

thresholds = [103.5, 98.5, 93.5, 88.5, 83.5, 78.5] 

for threshold in thresholds: 
    results[str(threshold)] = results[['scores','outcomes']].apply(lambda x: x['outcomes'] if x['scores']>threshold else 0, axis=1) 

print (results) 

这pronts

outcomes scores 103.5 98.5 93.5 88.5 83.5 78.5 
0   1 78.5  0 0.0 0.0 0.0 0.0 0.0 
1   0 91.0  0 0.0 0.0 0.0 0.0 0.0 
2   1 103.5  0 1.0 1.0 1.0 1.0 1.0 
+0

谢谢!完美的作品。 – Greg

+0

不客气:) –

1

下面是不使用循环或列表中理解一个完全量化的解决方案。

import pandas as pd 
import numpy as np 
results = pd.DataFrame({'scores':[78.5, 91.0, 103.5], 'outcomes':[1,0,1]}) 
thresholds = [4.7562029077978352, 4.6952820449271861, 4.6343611820565371, 4.5734403191858881, 103.5, 98.5, 93.5, 88.5, 83.5, 78.5] 
thresholds_col = ['{:.16f}'.format(e) for e in thresholds] 
data = results.outcomes[:,np.newaxis] * ((results.scores[:,np.newaxis] - thresholds > 0)) 
results = results.join(pd.DataFrame(data=data, columns=thresholds_col)) 
print results 
print results[thresholds_col] 

Out[79]: 
    4.7562029077978352 4.6952820449271861 4.6343611820565371 \ 
0     1     1     1 
1     0     0     0 
2     1     1     1 

    4.5734403191858881 103.5000000000000000 98.5000000000000000 \ 
0     1      0     0 
1     0      0     0 
2     1      0     1 

    93.5000000000000000 88.5000000000000000 83.5000000000000000 \ 
0     0     0     0 
1     0     0     0 
2     1     1     1 

    78.5000000000000000 
0     0 
1     0 
2     1 
+0

当我在完整数据集上运行此代码时,出现KeyError:'4.7562029078'。实际数据集有200个阈值,第一个是4.7562029077978352;你的代码以某种方式将阈值四舍五入到一定数量的数字? – Greg

+0

当您使用浮点作为Pandas列名称时,它会自动进行舍入。你的阈值是否有相同的长度和小数点?你能举几个例子吗? – Allen

+0

阈值根据传入数据'(max - min)/ number_of_bins'动态计算。有时它很整洁,其他时间不是很多。在这一组中,前四个阈值是“4.7562029077978352,4.6952820449271861,4.6343611820565371,4.5734403191858881”。 – Greg