2016-04-05 52 views
1

我有一个熊猫数据帧:应用功能,数据帧列

name sample 
1 a  Category 1: qwe, asd (line break) Category 2: sdf, erg 
2 b  Category 2: sdf, erg(line break) Category 5: zxc, eru 
... 
30 p  Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err 

我想结束:

name qwe asd sdf erg zxc eru 2134 EFDgh Pdr tke err 
1 a  1  1  1  1 0  0 0  0  0  0 
2 b  0  0  1  1 1  1 0  0  0  0 
... 
30 p  0 1  0  0 0  0 0  1  1  0 

我创建了以下功能:

def cleanattributes(istring): 

    istring=str(istring) 
    istring=istring.rstrip().split('\\n') 

    counter=0 
    for line in istring: 
     istring[counter]=istring[counter].rpartition(': ')[-1] 
     counter+=1 
    istring=str(istring) 
    istring = istring.replace("'", "") 
    istring = istring.replace("\"", "") 
    return(str(istring)) 

这个函数创建返回没有类别标题的类别信息的预期结果(想法是使用getdummies来获取合作伙伴) lumns)

teststring="Category 1: qwe, asd\\nCategory 2: sdf, erg" 
cleanattributes(teststring) 
OUTPUT: '[qwe, asd, sdf, erg]' 

我不知道如何最好地应用此功能,每一个记录,使数据帧是这样的:

name sample 
1 a  qwe, asd, sdf, erg 
2 b  sdf, erg, zxc, eru 
... 
30 p  asd, 2134, EFDgh, Pdr tke, err 

或者,如果这是甚至逼近这个的最好方法。

按照要求:

df['sample'].iat[0] 
OUTPUt= 'Category 1: qwe, asd\nCategory 2: sdf, erg' 
+0

什么是'DF [ '样品']的EXACT输出IAT [0]'。? – Alexander

+0

输出结果为'Category 1:qwe,asd \ nCategory 2:sdf,erg'(编辑:删除了一个额外的\ n我为测试目的而意外添加的) –

回答

2
df = pd.DataFrame(
    {'name': ['a', 'b'], 
    'sample': ['Category 1: asd, Category PE: 2134, EFDgh, Pdr tke, err', 
       'Category 2: sdf, erg\nCategory 5: zxc, eru\nCategory 1: asd, Category PE: 2134, EFDgh, Pdr tke, err']} 

df2 = pd.concat([df.name, 
       df['sample'] 
       .str.replace("(Category .*:)+", '') # Remove "Category [*]:" 
       .str.replace(r'\n', '') # Remove "\n" 
       .str.split(', ', expand=True)], 
       axis=1) 

df3 = pd.melt(df2, id_vars='name')[['name', 'value']] 

>>> pd.concat([df3['name'], pd.get_dummies(df3['value'])], axis=1) 
    name 2134 EFDgh Pdr tke ergzxc err eru2134 sdf 
0  a  1  0  0  0 0  0 0 
1  b  0  0  0  0 0  0 1 
2  a  0  1  0  0 0  0 0 
3  b  0  0  0  1 0  0 0 
4  a  0  0  1  0 0  0 0 
5  b  0  0  0  0 0  1 0 
6  a  0  0  0  0 1  0 0 
7  b  0  1  0  0 0  0 0 
8  a  0  0  0  0 0  0 0 
9  b  0  0  1  0 0  0 0 
10 a  0  0  0  0 0  0 0 
11 b  0  0  0  0 1  0 0