2016-12-29 44 views
6

txt文件创建熊猫据帧我需要根据基于以下结构的文本文件来创建一个大熊猫数据帧:从特定模式

Alabama[edit] 
Auburn (Auburn University)[1] 
Florence (University of North Alabama) 
Jacksonville (Jacksonville State University)[2] 
Livingston (University of West Alabama)[2] 
Montevallo (University of Montevallo)[2] 
Troy (Troy University)[2] 
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4] 
Tuskegee (Tuskegee University)[5] 
Alaska[edit] 
Fairbanks (University of Alaska Fairbanks)[2] 
Arizona[edit] 
Flagstaff (Northern Arizona University)[6] 
Tempe (Arizona State University) 
Tucson (University of Arizona) 
Arkansas[edit] 

与该行“[编辑]”是国家和行[号码]是地区。我需要拆分以下内容,然后重复每个区域名称的状态名称。

Index   State   Region Name 
0    Alabama  Aurburn... 
1    Alabama  Florence... 
2    Alabama  Jacksonville... 
... 
9    Alaska   Fairbanks... 
10    Alaska   Arizona... 
11    Alaska   Flagstaff... 

熊猫据帧

我不知道如何分割基于“[编辑]”和“[编号]”或“(人物)”的文本文件到相应的列和重复国家名称为每个地区名称。请任何人都可以给我一个起点,开始完成以下任务。

+0

的[阅读\ _table可能的复制在熊猫,如何从文本输入到数据框](http://stackoverflow.com/questions/40413380/read-table-in-pandas-how-to-get-input-from-text-to-a-数据帧) – root

回答

0

在将文件放入数据框之前,您可能需要对文件执行一些额外的操作。

的一个起点将是分裂文件到行,每行搜索字符串[edit],将字符串名称作为字典的键时,它是有...

我不认为熊猫有任何内置的方法来处理这种格式的文件。

4

假设你有以下DF:

In [73]: df 
Out[73]: 
               text 
0          Alabama[edit] 
1      Auburn (Auburn University)[1] 
2    Florence (University of North Alabama) 
3  Jacksonville (Jacksonville State University)[2] 
4   Livingston (University of West Alabama)[2] 
5   Montevallo (University of Montevallo)[2] 
6       Troy (Troy University)[2] 
7 Tuscaloosa (University of Alabama, Stillman Co... 
8     Tuskegee (Tuskegee University)[5] 
9          Alaska[edit] 
10  Fairbanks (University of Alaska Fairbanks)[2] 
11          Arizona[edit] 
12   Flagstaff (Northern Arizona University)[6] 
13     Tempe (Arizona State University) 
14      Tucson (University of Arizona) 
15          Arkansas[edit] 

您可以使用Series.str.extract()方法:

In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False) 

In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False) 

In [120]: df.State = df.State.ffill() 

In [121]: df 
Out[121]: 
               text  State Region Name 
0          Alabama[edit] Alabama   NaN 
1      Auburn (Auburn University)[1] Alabama  Auburn 
2    Florence (University of North Alabama) Alabama  Florence 
3  Jacksonville (Jacksonville State University)[2] Alabama Jacksonville 
4   Livingston (University of West Alabama)[2] Alabama Livingston 
5   Montevallo (University of Montevallo)[2] Alabama Montevallo 
6       Troy (Troy University)[2] Alabama   Troy 
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa 
8     Tuskegee (Tuskegee University)[5] Alabama  Tuskegee 
9          Alaska[edit] Alaska   NaN 
10  Fairbanks (University of Alaska Fairbanks)[2] Alaska  Fairbanks 
11          Arizona[edit] Arizona   NaN 
12   Flagstaff (Northern Arizona University)[6] Arizona  Flagstaff 
13     Tempe (Arizona State University) Arizona   Tempe 
14      Tucson (University of Arizona) Arizona  Tucson 
15          Arkansas[edit] Arkansas   NaN 

In [122]: df = df.dropna() 

In [123]: df 
Out[123]: 
               text State Region Name 
1      Auburn (Auburn University)[1] Alabama  Auburn 
2    Florence (University of North Alabama) Alabama  Florence 
3  Jacksonville (Jacksonville State University)[2] Alabama Jacksonville 
4   Livingston (University of West Alabama)[2] Alabama Livingston 
5   Montevallo (University of Montevallo)[2] Alabama Montevallo 
6       Troy (Troy University)[2] Alabama   Troy 
7 Tuscaloosa (University of Alabama, Stillman Co... Alabama Tuscaloosa 
8     Tuskegee (Tuskegee University)[5] Alabama  Tuskegee 
10  Fairbanks (University of Alaska Fairbanks)[2] Alaska  Fairbanks 
12   Flagstaff (Northern Arizona University)[6] Arizona  Flagstaff 
13     Tempe (Arizona State University) Arizona   Tempe 
14      Tucson (University of Arizona) Arizona  Tucson 
3

你可以解析该文件到元组第一:

import pandas as pd 
from collections import namedtuple 

Item = namedtuple('Item', 'state area') 
items = [] 

with open('unis.txt') as f: 
    for line in f: 
     l = line.rstrip('\n') 
     if l.endswith('[edit]'): 
      state = l.rstrip('[edit]') 
     else:    
      i = l.index(' (') 
      area = l[:i] 
      items.append(Item(state, area)) 

df = pd.DataFrame.from_records(items, columns=['State', 'Area']) 

print df 

输出:

 State   Area 
0 Alabama  Auburn 
1 Alabama  Florence 
2 Alabama Jacksonville 
3 Alabama Livingston 
4 Alabama Montevallo 
5 Alabama   Troy 
6 Alabama Tuscaloosa 
7 Alabama  Tuskegee 
8 Alaska  Fairbanks 
9 Arizona  Flagstaff 
10 Arizona   Tempe 
11 Arizona  Tucson 
1

TL; DR
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]


regex = '(?P<State>.*?)\[edit\]' # pattern to match 
print(s.groupby(
    # will get nulls where we don't have "[edit]" 
    # forward fill fills in the most recent line 
    # where we did have an "[edit]" 
    s.str.extract(regex, expand=False).ffill() 
).apply(
    # I still have all the original values 
    # If I group by the forward filled rows 
    # I'll want to drop the first one within each group 
    pd.Series.tail, n=-1 
).reset_index(
    # munge the dataframe to get columns sorted 
    name='Region_Name' 
)[['State', 'Region_Name']]) 

     State          Region_Name 
0 Alabama      Auburn (Auburn University)[1] 
1 Alabama    Florence (University of North Alabama) 
2 Alabama Jacksonville (Jacksonville State University)[2] 
3 Alabama   Livingston (University of West Alabama)[2] 
4 Alabama   Montevallo (University of Montevallo)[2] 
5 Alabama       Troy (Troy University)[2] 
6 Alabama Tuscaloosa (University of Alabama, Stillman Co... 
7 Alabama     Tuskegee (Tuskegee University)[5] 
8 Alaska  Fairbanks (University of Alaska Fairbanks)[2] 
9 Arizona   Flagstaff (Northern Arizona University)[6] 
10 Arizona     Tempe (Arizona State University) 
11 Arizona      Tucson (University of Arizona) 

设置

txt = """Alabama[edit] 
Auburn (Auburn University)[1] 
Florence (University of North Alabama) 
Jacksonville (Jacksonville State University)[2] 
Livingston (University of West Alabama)[2] 
Montevallo (University of Montevallo)[2] 
Troy (Troy University)[2] 
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4] 
Tuskegee (Tuskegee University)[5] 
Alaska[edit] 
Fairbanks (University of Alaska Fairbanks)[2] 
Arizona[edit] 
Flagstaff (Northern Arizona University)[6] 
Tempe (Arizona State University) 
Tucson (University of Arizona) 
Arkansas[edit]""" 

s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True) 
5

你可以先read_csv与参数nameRegion Name列创建DataFrame,分隔符是值不是在价值(例如;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name']) 

然后insert新列Stateextract行,其中文本[edit]replace所有值从(到结尾列Region Name

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill()) 
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '') 

末除去其中由boolean indexing文本[edit],面具是由str.contains创建行:

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True) 
print (df) 
     State Region Name 
0 Alabama  Auburn 
1 Alabama  Florence 
2 Alabama Jacksonville 
3 Alabama Livingston 
4 Alabama Montevallo 
5 Alabama   Troy 
6 Alabama Tuscaloosa 
7 Alabama  Tuskegee 
8 Alaska  Fairbanks 
9 Arizona  Flagstaff 
10 Arizona   Tempe 
11 Arizona  Tucson 

如果需要的所有值解决方案是简单:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name']) 
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill()) 
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True) 
print (df) 
     State          Region Name 
0 Alabama      Auburn (Auburn University)[1] 
1 Alabama    Florence (University of North Alabama) 
2 Alabama Jacksonville (Jacksonville State University)[2] 
3 Alabama   Livingston (University of West Alabama)[2] 
4 Alabama   Montevallo (University of Montevallo)[2] 
5 Alabama       Troy (Troy University)[2] 
6 Alabama Tuscaloosa (University of Alabama, Stillman Co... 
7 Alabama     Tuskegee (Tuskegee University)[5] 
8 Alaska  Fairbanks (University of Alaska Fairbanks)[2] 
9 Arizona   Flagstaff (Northern Arizona University)[6] 
10 Arizona     Tempe (Arizona State University) 
11 Arizona      Tucson (University of Arizona) 
+0

谢谢@jezrael我用你的建议,它完美的工作,谢谢。 –

+0

不要忘了点击这个答案旁边的复选标记,如果它适合你。请参阅[我应该怎么做,当有人回答我的问题](https://stackoverflow.com/help/someone-answers)的一些指导! – charlesreid1