2013-06-03 26 views
1

首先,我想让你知道我在编码方面比较新,而且我只有Python和Javascript的肤浅知识。来自多行的汇总文本字段(javascript/python)

我有一个包含名称及其在数据结构队名这个巨大的TXT如下:

Name1, Surname1 Team1 
        Team2 
        Team3 
Name2, Surname2 Team2 
        Team4 
Name3, Surname3 Team1 
        Team5 

理想情况下,我想提取我的数据由Team#搜索和返回的人属于名称到它。

例如,我需要team1和team2的组件。我的新TXT输出应该是这样的:

Team1, Name1, Surname1, Name3, Surname3 
Team2, Name1, Surname1, Name2, Surname2 

非常感谢您的帮助

+0

它又是你现在的输入结构?一条线,多条线和什么时候是线路制动器? – Johannes

+0

姓氏和/或队名中是否有空格?中间是否有制表符,或者是固定列中的团队名称? –

+0

@Johannes:输入非常混乱。唯一的“结构化”部分是“Name1,Surname1”,每次都有一个逗号和1个空格。就团队而言,他们通常被放置在一个固定的列中,但是,首先报告的团队(名称 - 姓氏行中)通常与团队列不一致,具体取决于包含“姓名,姓氏“ – user2447387

回答

0

一个Python版本,可以看看这个:

fobj_in = io.StringIO("""Name1, Surname1 Team1 
        Team2 
        Team3 
Name2, Surname2 Team2 
        Team4 
Name3, Surname3 Team1 
        Team5""") 

fobj_out = io.StringIO() 

from collections import defaultdict 

teams = defaultdict(list) 

for line in fobj_in: 
    items = line.split() 
    if len(items) == 3: 
     name = items[:2] 
     team = items[2] 
    else: 
     team = items[0] 
    teams[team].append(name) 

for team_name in sorted(teams.keys()): 
    fobj_out.write(team_name + ', ') 
    for name in teams[team_name][:-1]: 
     fobj_out.write('{} {}, '.format(name[0], name[1])) 
    name = teams[team_name][-1] 
    fobj_out.write('{} {}\n'.format(name[0], name[1])) 


fobj_out.seek(0) 
print(fobj_out.read()) 

输出:

Team1, Name1, Surname1, Name3, Surname3 
Team2, Name1, Surname1, Name2, Surname2 
Team3, Name1, Surname1 
Team4, Name2, Surname2 
Team5, Name3, Surname3 

只要做到这一点读取和写入到一个实际的文件:

fobj_in = open('in_file.txt') 
fobj_out = open('out_file.txt', 'w') 

EDIT

:样品的数据似乎不包含的情况下woud导致多个名称在输出一行。

随着this input data,我们需要改变的代码:

from collections import defaultdict 
teams = defaultdict(list) 
for line in fobj_in: 
    if not line.strip(): 
     continue 
    items = [entry.strip() for entry in line.split('\t') if entry] 
    if len(items) == 2: 
     name = items[0] 
     team = items[1] 
    else: 
     team = items[0] 
    teams[team].append(name) 
for team_name in sorted(teams.keys()): 
    fobj_out.write(team_name + ', ') 
    for name in teams[team_name][:-1]: 
     fobj_out.write('{}, '.format(name)) 
    name = teams[team_name][-1] 
    fobj_out.write('{}\n'.format(name)) 

生成的文件内容是这样的:

"Décore ta vie" (2003), Boilard, Naggy 
"Mouki" (2010), Boileau, Sonia 
A chacun sa place (2011), Boinem, Victor Emmanuel 
Absence (2009) (V), Boillat, Patricia 
C.A.L.L.E. (2005), Boillat, Patricia 
Comment devenir un trou de cul et enfin plaire aux femmes (2004), Boire, Roger 
Couleur de peau: Miel (2012), Boileau, Laurent 
Hergé:Les aventures de Tintin (2004), Boillot, Olivier 
Isola, là dove si parla la lingua di Bacco (2011) (co-director), Boillat, Patricia 
L'île (2011), Boillot, Olivier 
La beauté fatale et féroce... (1996), Boire, Roger 
Last Call Indian (2010), Boileau, Sonia 
Le Temple Oublié (2005), Boillot, Olivier 
Le pied tendre (1988), Boire, Roger 
Legit (2006), Boinski, James W. 
Nubes (2010), Boira, Francisco 
Questions nationales (2009), Boire, Roger 
Reconciling Rwanda (2007), Boiko, Patricia 
Soviet Gymnasts (1955), Boikov, Vladimir 
The Corporal's Diary (2008) (V) (head director), Boiko, Patricia 
Un gars ben chanceux (1977), Boire, Roger 
+0

谢谢,但它会处理多个名称,即双名/姓氏,单独的团队名称......(请参阅上面的注释) – user2447387

+0

它将处理示例输入。我怎么知道你的实际输入是怎样的?化合物名称是否也由空格,逗号和其他内容分隔?名字,姓氏或团队有多少部分?代码需要适应这一点。 –

+0

是的,我知道,我很抱歉。我编辑了我的问题发布了一个链接到我的数据库的示例,以澄清事情(https://www.dropbox.com/s/sl3tu7m77gei987/sample.txt)。那么,实际上可能有多个名字和姓氏。此外,团队领域相当长,因为它可以在一段时间内添加其他类型的信息(可用时)和引号。理想情况下,我应该在“团队”字符串中搜索我的关键字(其中包含上述说明以及其他信息),并且代码应返回与其关联的人员的姓名。 – user2447387