新的大熊猫列与正则表达式解析

我想解析基于某些标签和值在另一列的字段中的熊猫数据框中的文本数据，并将它们存储在自己的列中。例如，如果我创造了这个数据帧，DF：新的大熊猫列与正则表达式解析

df = pd.DataFrame([[1,2],['A: this is a value B: this is the b val C: and here is c.','A: and heres another a. C: and another c']]) 
df = df.T 
df.columns = ['col1','col2'] 


df['tags'] = df['col2'].apply(lambda x: re.findall('(?:\s|)(\w*)(?::)',x)) 
all_tags = [] 

for val in df['tags']: 
    all_tags = all_tags + val 
all_tags = list(set(all_tags)) 
for val in all_tags: 
    df[val] = '' 

df: 
    col1            col2  tags A C B 
0 1 A: this is a value B: this is the b val C: and... [A, B, C]  
1 2   A: and heres another a. C: and another c  [A, C]

我怎么会填充每个新的“标签”列从COL2他们的价值观，所以我得到这个DF：

col1            col2   tags \ 
0 1 A: this is a value B: this is the b val C: and... [A, B, C] 
1 2   A: and heres another a. C: and another c  [A, C] 

        A    C     B 
0  this is a value and here is c. this is the b val 
1 and heres another a. and another c

来源

2017-08-19 this_is_david

使用str.extractall与正则表达式(?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$)另一种选择：

的正则表达式捕获分号后半结肠和值之前的关键(?P<key>\w+)(?P<val>[^:]*)作为两个单独的列key和val，所述val将匹配非:个字符，直到它达到由预见语法(?=\w+:|$)限制的下一个键值对;这是假设的关键始终是一个字，这将是另有明确：

其中str.extractall给出：

df.col2.str.extractall(pat)

然后你转动结果和连接wi th原始数据帧。

来源

2017-08-19 16:58:51 Psidom

这里有一个方式

In [683]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+') 
      .apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x]))) 
     ) 
Out[683]: 
         A     B    C 
0  this is a value this is the b val and here is c. 
1 and heres another a.     NaN and another c

你可以追加回结果使用join

In [690]: df.join(df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+') 
        .apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x])))) 
Out[690]: 
    col1            col2  tags \ 
0 1 A: this is a value B: this is the b val C: and... [A, B, C] 
1 2   A: and heres another a. C: and another c  [A, C] 

         A     B    C 
0  this is a value this is the b val and here is c. 
1 and heres another a.     NaN and another c

逸岸，你可以得到df['tags']使用字符串方法

In [688]: df.col2.str.findall('(?:\s|)(\w*)(?::)') 
Out[688]: 
0 [A, B, C] 
1  [A, C] 
Name: col2, dtype: object

详细：

拆分群体纳入名单

In [684]: df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+') 
Out[684]: 
0 [A: this is a value, B: this is the b val, C: ... 
1   [A: and heres another a., C: and another c] 
Name: col2, dtype: object

现在，以列表的键和值对。

In [685]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+') 
      .apply(lambda x: [v.split(':', 1) for v in x])) 
Out[685]: 
0 [[A, this is a value], [B, this is the b val... 
1 [[A, and heres another a.], [C, and another c]] 
Name: col2, dtype: object

来源

2017-08-19 16:53:43 Zero

新的大熊猫列与正则表达式解析

回答

相关问题