2017-08-19 51 views
1

我想解析基于某些标签和值在另一列的字段中的熊猫数据框中的文本数据,并将它们存储在自己的列中。例如,如果我创造了这个数据帧,DF:新的大熊猫列与正则表达式解析

df = pd.DataFrame([[1,2],['A: this is a value B: this is the b val C: and here is c.','A: and heres another a. C: and another c']]) 
df = df.T 
df.columns = ['col1','col2'] 


df['tags'] = df['col2'].apply(lambda x: re.findall('(?:\s|)(\w*)(?::)',x)) 
all_tags = [] 

for val in df['tags']: 
    all_tags = all_tags + val 
all_tags = list(set(all_tags)) 
for val in all_tags: 
    df[val] = '' 

df: 
    col1            col2  tags A C B 
0 1 A: this is a value B: this is the b val C: and... [A, B, C]  
1 2   A: and heres another a. C: and another c  [A, C] 

我怎么会填充每个新的“标签”列从COL2他们的价值观,所以我得到这个DF:

col1            col2   tags \ 
0 1 A: this is a value B: this is the b val C: and... [A, B, C] 
1 2   A: and heres another a. C: and another c  [A, C] 

        A    C     B 
0  this is a value and here is c. this is the b val 
1 and heres another a. and another c 

回答

4

使用str.extractall正则表达式(?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$)另一种选择:

正则表达式捕获分号后半结肠和值之前的关键(?P<key>\w+)(?P<val>[^:]*)作为两个单独的列keyval,所述val将匹配非:个字符,直到它达到由预见语法(?=\w+:|$)限制的下一个键值对;这是假设的关键始终是一个字,这将是另有明确:

​​

enter image description here


其中str.extractall给出:

df.col2.str.extractall(pat) 

enter image description here

然后你转动结果和连接wi th原始数据帧。

1

这里有一个方式

In [683]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+') 
      .apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x]))) 
     ) 
Out[683]: 
         A     B    C 
0  this is a value this is the b val and here is c. 
1 and heres another a.     NaN and another c 

你可以追加回结果使用join

In [690]: df.join(df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+') 
        .apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x])))) 
Out[690]: 
    col1            col2  tags \ 
0 1 A: this is a value B: this is the b val C: and... [A, B, C] 
1 2   A: and heres another a. C: and another c  [A, C] 

         A     B    C 
0  this is a value this is the b val and here is c. 
1 and heres another a.     NaN and another c 

逸岸,你可以得到df['tags']使用字符串方法

In [688]: df.col2.str.findall('(?:\s|)(\w*)(?::)') 
Out[688]: 
0 [A, B, C] 
1  [A, C] 
Name: col2, dtype: object 

详细

拆分群体纳入名单

In [684]: df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+') 
Out[684]: 
0 [A: this is a value, B: this is the b val, C: ... 
1   [A: and heres another a., C: and another c] 
Name: col2, dtype: object 

现在,以列表的键和值对。

In [685]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+') 
      .apply(lambda x: [v.split(':', 1) for v in x])) 
Out[685]: 
0 [[A, this is a value], [B, this is the b val... 
1 [[A, and heres another a.], [C, and another c]] 
Name: col2, dtype: object