如何一次读取熊猫数据框的两行和两列并对这些行/列值应用条件？

我想在pandas Dataframe中一次读取两行和两列，然后在pandas Dataframe的两行/列矩阵之间应用条件依赖的zip vs. product。如何一次读取熊猫数据框的两行和两列并对这些行/列值应用条件？

import pandas as pd 
import itertools as it 
from itertools import product 

cond_mcve = pd.read_csv('condition01.mcve.txt', sep='\t') 

    alfa alfa_index beta beta_index delta delta_index 
0 a,b   23 c,d   36 a,c   32 
1 a,c   23 b,e   37 c,d   32 
2 g,h   28 d,f   37 e,g   32 
3 a,b   28 c,d   39 a,c   34 
4 c,e   28 b,g   39 d,k   34

这里阿尔法，β和δ是字符串值，并且他们有自己相应的指标。
我要创建两个zip相邻串（按行），如果他们有相同的指数值。 因此，对于alfa column的前两行，输出应为aa,cb，因为两行的alfa_index为23。
但是，对于阿尔法列的第二和第三行中的两个索引值不同（23和28），因此，输出应为字符串的产物，即输出：GA，GC，哈，HC

这是我精神上想过这样做时： 而且，我希望，我非常清楚地说明问题。

# write a function 
def some_function(): 
    read_two columns at once (based on prefix similarity) 

    then: 
    if two integer_index are same: 
     zip(of strings belonging to that index) 

    if two integer index are different: 
     product(of strings belonging to that index) 

# take this function and apply it to pandas dataframe: 
cond_mcve_updated = cond_mcve+cond_mcve.shift(1).dropna(how='all').applymap(some_function)

这里shift能够一次读取两行，所以我在同一时间阅读两行问题就解决了。 不过，我有在阅读两列和实施条件的其他问题：

读取两个列在一次大熊猫数据帧（基于前缀的相似性）。
分离这些列进行比较的指标值（整数）
申请基于所述条件拉链与产品

预期的最终输出将是：

alfa   alfa_index beta    beta_index delta delta_index 
1 aa,cb   23   bc,bd,ec,ed  37   ca,dc   32 
2 ga,gc,ha,hc 28   db,fe   37   ec,gd   32 
same for other line..... 

# the first index(i.e 0 is lost) but that's ok. I can work it out using `head/tail` method in pandas.

来源

2017-03-15 everestial007

下面是一个方法达到结果。此功能使用shift,concat和apply将数据运行到一个函数，该函数可以根据匹配的值执行prod/sum事件。

代码：

import itertools as it 

def crazy_prod_sum_thing(frame): 
    # get the labels which do not end with _index 
    labels = [(l, l + '_index') 
       for l in frame.columns.values if not l.endswith('_index')] 

    def func(row): 
     # get row n and row n-1 
     front = row[:len(row) >> 1] 
     back = row[len(row) >> 1:] 

     # loop through the labels 
     results = [] 
     for l, i in labels: 
      x = front[l].split(',') 
      y = back[l].split(',') 
      if front[i] == back[i]: 
       results.append(x[0] + y[0] + ',' + x[1] + x[1]) 
      else: 
       results.append(
        ','.join([x1 + y1 for x1, y1 in it.product(x, y)])) 

     return pd.Series(results) 

    # take this function and apply it to pandas dataframe: 
    df = pd.concat([frame, frame.shift(1)], axis=1)[1:].apply(
     func, axis=1) 

    df.rename(columns={i: x[0] + '_cpst' for i, x in enumerate(labels)}, 
       inplace=True) 
    return pd.concat([frame, df], axis=1)

测试代码：

import pandas as pd 
from io import StringIO 
data = [x.strip() for x in """ 
     alfa alfa_index beta beta_index delta delta_index 
    0 a,b   23 c,d   36 a,c   32 
    1 a,c   23 b,e   37 c,d   32 
    2 g,h   28 d,f   37 e,g   32 
    3 a,b   28 c,d   39 a,c   34 
    4 c,e   28 b,g   39 d,k   34 
""".split('\n')[1:-1]] 
df = pd.read_csv(StringIO(u'\n'.join(data)), sep='\s+') 
print(df) 

print(crazy_prod_sum_thing(df))

结果：

alfa alfa_index beta beta_index delta delta_index 
0 a,b   23 c,d   36 a,c   32 
1 a,c   23 b,e   37 c,d   32 
2 g,h   28 d,f   37 e,g   32 
3 a,b   28 c,d   39 a,c   34 
4 c,e   28 b,g   39 d,k   34 

1   [aa,cc, bc,bd,ec,ed, ca,dd] 
2   [ga,gc,ha,hc, db,ff, ec,gg] 
3 [ag,bb, cd,cf,dd,df, ae,ag,ce,cg] 
4    [ca,ee, bc,gg, da,kk]

注意：

这不会将问题的结果封送回问题中指出的数据框中，因为我不确定如何在索引值不匹配时采取这些索引值。

来源

2017-03-15 20:26:01

这必须是可行的。如果有办法保留索引值，我会尝试锻炼。非常感谢。我还没有完全接受答案，但同时也希望能够增加一些其他的答案，就我所知。等待几天来获得一些关于这个问题的关注。谢谢。 – everestial007

我只是尝试了代码，但在print（crazy_prod_sum_thing（df））过程中遇到错误** **错误消息：** TypeError :(无法对用这些索引器[6.0]的“，'发生在索引1'）'它提示了一些关于'float'的内容，但索引值应该是整数。可能是什么问题？ – everestial007

两个都试过。我还打印了您创建的文件和我发布的文件的输出。两者是完全相同和相同的类型。两者都给出完全相同的错误信息。 idk为什么？ – everestial007

如何一次读取熊猫数据框的两行和两列并对这些行/列值应用条件？

回答

相关问题