2016-03-03 62 views
2

我有一个简单的数据帧如下:从熊猫列创建angrams名单

df = pd.DataFrame({ 
'notes': pd.Series(['meth cook makes meth with purity of over 96%', 'meth cook is also called Heisenberg', 'meth cook has cancer', 'he is known as the best meth cook', 'Meth Dealer added chili powder to his batch', 'Meth Dealer learned to make the best meth', 'everyone goes to this Meth Dealer for best shot', 'girlfriend of the meth dealer died', 'this lawyer is a people pleasing person', 'cinnabon has now hired the lawyer as a baker', 'lawyer had to take off in the end', 'lawyer has a lot of connections who knows other guy']), 
'name': pd.Series([np.nan, 'Walter White', np.nan, np.nan, np.nan, np.nan, 'Jessie Pinkman', np.nan, 'Saul Goodman', np.nan, np.nan, np.nan]), 
'occupation': pd.Series(['meth cook', np.nan, np.nan, np.nan, np.nan, np.nan, 'meth dealer', np.nan, np.nan, 'lawyer', np.nan, np.nan]) 
}) 

它看起来如下:

name         notes          occupation 
NaN      meth cook makes meth with purity of over 96%    meth cook 
Walter White   meth cook is also called Heisenberg        NaN 
NaN      meth cook has cancer           NaN 
NaN      he is known as the best meth cook        NaN 
NaN      Meth Dealer added chili powder to his batch      NaN 
NaN      Meth Dealer learned to make the best meth      NaN 
Jessie Pinkman   everyone goes to this Meth Dealer for best shot    meth dealer 
NaN      girlfriend of the meth dealer died        NaN 
Saul Goodman   this lawyer is a people pleasing person       NaN 
NaN      cinnabon has now hired the lawyer as a baker     lawyer 
NaN      lawyer had to take off in the end        NaN 
NaN      lawyer has a lot of connections who knows other guy    NaN 

我想创建字/字谜列表'笔记'专栏。我还想排除“笔记”列中的任何数字/特殊字符(例如:我不想在输出中使用96%)。

我还想将所有单个单词(没有重复)写入文本文件。

我该如何在Python中做到这一点?

回答

2

IIUC您可以使用str.replace去除数字ANS的特殊字符:

import pandas as pd 
import numpy as np 

df = pd.DataFrame({ 
'notes': pd.Series(['meth cook makes meth with purity of over 96%', 'meth cook is also called Heisenberg', 'meth cook has cancer', 'he is known as the best meth cook', 'Meth Dealer added chili powder to his batch', 'Meth Dealer learned to make the best meth', 'everyone goes to this Meth Dealer for best shot', 'girlfriend of the meth dealer died', 'this lawyer is a people pleasing person', 'cinnabon has now hired the lawyer as a baker', 'lawyer had to take off in the end', 'lawyer has a lot of connections who knows other guy']), 
'name': pd.Series([np.nan, 'Walter White', np.nan, np.nan, np.nan, np.nan, 'Jessie Pinkman', np.nan, 'Saul Goodman', np.nan, np.nan, np.nan]), 
'occupation': pd.Series(['meth cook', np.nan, np.nan, np.nan, np.nan, np.nan, 'meth dealer', np.nan, np.nan, 'lawyer', np.nan, np.nan]) 
}) 

#remove all numbers and #* 
df['notes'] = df['notes'].str.replace(r"[0-9%*]+","") 
print df 
       name            notes \ 
0    NaN   meth cook makes meth with purity of over  
1  Walter White    meth cook is also called Heisenberg 
2    NaN        meth cook has cancer 
3    NaN     he is known as the best meth cook 
4    NaN  Meth Dealer added chili powder to his batch 
5    NaN   Meth Dealer learned to make the best meth 
6 Jessie Pinkman everyone goes to this Meth Dealer for best shot 
7    NaN     girlfriend of the meth dealer died 
8  Saul Goodman   this lawyer is a people pleasing person 
9    NaN  cinnabon has now hired the lawyer as a baker 
10    NaN     lawyer had to take off in the end 
11    NaN lawyer has a lot of connections who knows othe... 

    occupation 
0  meth cook 
1   NaN 
2   NaN 
3   NaN 
4   NaN 
5   NaN 
6 meth dealer 
7   NaN 
8   NaN 
9  lawyer 
10   NaN 
11   NaN 
#all string to one big string 
l = df['notes'].sum() 
print l 
meth cook makes meth with purity of over meth cook is also called Heisenbergmeth cook has cancerhe is known as the best meth cookMeth Dealer added chili powder to his batchMeth Dealer learned to make the best metheveryone goes to this Meth Dealer for best shotgirlfriend of the meth dealer diedthis lawyer is a people pleasing personcinnabon has now hired the lawyer as a bakerlawyer had to take off in the endlawyer has a lot of connections who knows other guy 

print type(l) 
<type 'str'> 

#remove duplicity words 
words = l.split() 
individual_words = " ".join(sorted(set(words), key=words.index)) 
print individual_words 
meth cook makes with purity of over is also called Heisenbergmeth has cancerhe known as the best cookMeth Dealer added chili powder to his batchMeth learned make metheveryone goes this Meth for shotgirlfriend dealer diedthis lawyer a people pleasing personcinnabon now hired bakerlawyer had take off in endlawyer lot connections who knows other guy 

#write to file 
with open("Output.txt", "w") as text_file: 
    text_file.write(individual_words) 
+0

如果我的回答对您有所帮助,不要忘了[接受](http://meta.stackexchange.com/questions/5234/how-do-accepting-an-answer-work)并且赞成。谢谢。 – jezrael

+0

谢谢!我现在将这个应用于我更大的数据框。这个解决方案很有意义 –

+0

非常感谢。 – jezrael