2017-10-09 189 views
0

我有水果的熊猫数据帧::区分大小写熊猫系列匹配和清洁熊猫系列逻辑

df = pd.read_csv(newfile, header=None) 
df 
      0  1  2    3  4  5 6 7 
0  Apple Bananas  Fig Elderberry Cherry Honeydew NaN NaN 
1  Bananas Cherry Dragon Elderberry  NaN   NaN NaN NaN 
2  Cherry Grape  NaN   NaN  NaN   NaN NaN NaN 
3  Dragon  NaN Apple  Bananas Cherry Elderberry NaN NaN 
4 Elderberry Apple Bananas   Fig Grape   NaN NaN NaN 
5   Fig Cherry Honeydew   Apple  NaN   NaN NaN NaN 
6  Grape  NaN  NaN   NaN  NaN   NaN NaN NaN 
7  Honeydew Grape  Fig  Elderberry Dragon  Cherry Bananas Apple  

而且我试图找到“果配对”,例如在第一排中,苹果和无花果是一对,第六排无花果和苹果。对苹果接骨木和接骨木 - 苹果也是如此,但苹果和香蕉没有苹果(从香蕉开始就没有苹果了)。

我有下面的代码的工作,而这是否::

fruits = df[0] 
stock = df.drop(0, axis=1) 

for i in range(len(fruits)): 
    string1 = str(fruits[i]) 
    full_line = (stock.iloc[i]) 
    line = np.array(full_line.dropna(axis=0)) 
    if len(line) > 0 : 
     for j in range(len(stock)): 
      iind = (fruits[fruits == line[j]].index[0]) 
      this_line = stock.iloc[iind] 
      logic_out = this_line.str.match(string1) 
      print(logic_out) 

BUT! (1)由于Pandas系列区分大小写,因此它在水果==行[j]处突破,并且(2)布尔输出是True,False和NaN的混合。理想情况下,我只想计算Trues。任何和所有的帮助诉非常感谢!

回答

1

我打算用一套逻辑,熊猫堆叠和NumPy的广播

f = lambda x: x.title() if isinstance(x, str) else x 

s = df.applymap(f).set_index('0').rename_axis(None).stack().groupby(level=0).apply(set) 

f = s.index 
p = s.values 

one_way = (p[:, None] & [{x} for x in f]).astype(bool) 
[(f[i], f[j]) for i, j in zip(*np.where(one_way & one_way.T))] 

[('Apple', 'Elderberry'), 
('Apple', 'Fig'), 
('Apple', 'Honeydew'), 
('Bananas', 'Dragon'), 
('Bananas', 'Elderberry'), 
('Dragon', 'Bananas'), 
('Elderberry', 'Apple'), 
('Elderberry', 'Bananas'), 
('Fig', 'Apple'), 
('Fig', 'Honeydew'), 
('Honeydew', 'Apple'), 
('Honeydew', 'Fig')] 
+0

嗨@piRSquared,这看起来不错,但崩溃在第一行,用KeyError异常:“0”的消息。 ...我编辑了上面的代码,告诉你我是如何在df中阅读的,而.cvs文件如下。 – npross

+0

苹果,香蕉,无花果,接骨木,樱桃,蜜瓜,, 香蕉,樱桃,龙,接骨木浆果,,,, 樱桃,葡萄,,,,,, 龙,苹果,,香蕉,樱桃,接骨木, 接骨木,苹果,香蕉,无花果,葡萄,,, 图,樱桃,蜜露,苹果,,,, 葡萄,,,,,,, 蜜露,葡萄,无花果,接骨木,龙,樱桃,香蕉,苹果 – npross

+0

什么是第一列的实际名称。我假设它是'0',因为当我复制并且超过你提供的数据框时,这被解析。立即尝试我的更新。 – piRSquared