2012-10-11 25 views
8

我想解析一个测试文件。文件中有用户名,按以下格式的地址和电话:文本文件数据解析行和输出为列

Name: John Doe1 
address : somewhere 
phone: 123-123-1234 

Name: John Doe2 
address : somewhere 
phone: 123-123-1233 

Name: John Doe3 
address : somewhere 
phone: 123-123-1232 

只有将近1万的用户:)我想要做的就是那些行转换为列,例如:

Name: John Doe1    address : somewhere   phone: 123-123-1234 
Name: John Doe2    address : somewhere   phone: 123-123-1233 
Name: John Doe3    address : somewhere   phone: 123-123-1232 

我宁愿在bash中这样做,但如果你知道如何在python中做到这一点,那么这个信息的文件位于/ root/docs/information中。任何提示或帮助将不胜感激。

+2

你试了一下? – nneonneo

+0

最初的问题,@tafiela。但是,不要忘记指出下一个问题,你试图做什么。 – Yamaneko

+0

地址确实只是冒号后的一行吗? – 2012-10-11 03:46:37

回答

5

一种方式与GNU awk

awk 'BEGIN { FS="\n"; RS=""; OFS="\t\t" } { print $1, $2, $3 }' file.txt 

结果:

Name: John Doe1  address : somewhere  phone: 123-123-1234 
Name: John Doe2  address : somewhere  phone: 123-123-1233 
Name: John Doe3  address : somewhere  phone: 123-123-1232 

请注意,我已将输出文件分隔符(OFS)设置为两个制表符(\t\t)。您可以将其更改为您喜欢的任何角色或一组角色。 HTH。

+0

+1 - 你打败了我。 –

+0

什么'RS'呢? – Yamaneko

+1

@VictorHugo:'RS'是记录分隔符的简称。默认情况下'RS'被设置为'\ n'或换行符。这允许'awk'逐行处理文件。当我们将它设置为无(或''“')时,我们实际上正在改变'awk'的一行定义。由于每条记录都由空行分隔,因此设置'RS =“”'可以轻松解决问题。 HTH。 – Steve

0

在Python:

results = [] 
cur_item = None 

with open('/root/docs/information') as f: 
    for line in f.readlines(): 
     key, value = line.split(':', 1) 
     key = key.strip() 
     value = value.strip() 

     if key == "Name": 
      cur_item = {} 
      results.append(cur_item) 
     cur_item[key] = value 

for item in results: 
    # print item 
+0

你应该精确的语言;) –

+0

@sputnick我不是很明白你的意思 –

+0

只是说语言:它是Python。 – Matthias

0

您应该能够使用split()方法上的绳子来解析这一点:

line = "Name: John Doe1" 
key, value = line.split(":") 
print(key) # Name 
print(value) # John Doe1 
3

随着短Perl一行代码:

$ perl -ne 'END{print "\n"}chomp; /^$/ ? print "\n" : print "$_\t\t"' file.txt 

输出

Name: John Doe1   address : somewhere    phone: 123-123-1234 
Name: John Doe2   address : somewhere    phone: 123-123-1233 
Name: John Doe3   address : somewhere    phone: 123-123-1232 
1

这似乎基本上你想要做什么:

information = 'information' # file path 

with open(information, 'rt') as input: 
    data = input.read() 

data = data.split('\n\n') 

for group in data: 
    print group.replace('\n', '  ') 

输出:

Name: John Doe1  address : somewhere  phone: 123-123-1234 
Name: John Doe2  address : somewhere  phone: 123-123-1233 
Name: John Doe3  address : somewhere  phone: 123-123-1232  
0

您可以通过行迭代并打印在列这样的 -

for line in open("/path/to/data"): 
    if len(line) != 1: 
     # remove \n from line's end and make print statement 
     # skip the \n it adds in the end to continue in our column 
     print "%s\t\t" % line.strip(), 
    else: 
     # re-use the blank lines to end our column 
     print 
2

使用粘贴,就可以加入该文件中的行:

$ paste -s -d"\t\t\t\n" file 
Name: John Doe1 address : somewhere  phone: 123-123-1234 
Name: John Doe2 address : somewhere  phone: 123-123-1233 
Name: John Doe3 address : somewhere  phone: 123-123-1232 
+0

没那么好格式化=) –

+0

@sputnick没错,但这确实很难。有无数的实用程序来扩展标签。 –

+0

是的,但在这种情况下,您需要2个管道;) –

1

我知道你没有提到awk,但是它很好地解决了你的问题:

awk 'BEGIN {RS="";FS="\n"} {print $1,$2,$3}' data.txt 
0
#!/usr/bin/env python 

def parse(inputfile, outputfile): 
    dictInfo = {'Name':None, 'address':None, 'phone':None} 
    for line in inputfile: 
    if line.startswith('Name'): 
     dictInfo['Name'] = line.split(':')[1].strip() 
    elif line.startswith('address'): 
     dictInfo['address'] = line.split(':')[1].strip() 
    elif line.startswith('phone'): 
     dictInfo['phone'] = line.split(':')[1].strip() 
     s = 'Name: '+dictInfo['Name']+'\t'+'address: '+dictInfo['address'] \ 
      +'\t'+'phone: '+dictInfo['phone']+'\n' 
     outputfile.write(s) 

if __name__ == '__main__': 
    with open('output.txt', 'w') as outputfile: 
    with open('infomation.txt') as inputfile: 
     parse(inputfile, outputfile) 
0

使用sed的解决方案。

cat input.txt | sed '/^$/d' | sed 'N; s:\n:\t\t:; N; s:\n:\t\t:' 
  1. 第一管,sed '/^$/d',移除空行。
  2. 第二根管道,sed 'N; s:\n:\t\t:; N; s:\n:\t\t:',结合了这些线。
 
Name: John Doe1  address : somewhere  phone: 123-123-1234 
Name: John Doe2  address : somewhere  phone: 123-123-1233 
Name: John Doe3  address : somewhere  phone: 123-123-1232 
1

这里的大多数解决方案都只是重新格式化您正在阅读的文件中的数据。也许这就是你想要的。

如果你真的想分析数据,把它放在一个数据结构中。

这个例子中的Python:

data="""\ 
Name: John Doe2 
address : 123 Main St, Los Angeles, CA 95002 
phone: 213-123-1234 

Name: John Doe1 
address : 145 Pearl St, La Jolla, CA 92013 
phone: 858-123-1233 

Name: Billy Bob Doe3 
address : 454 Heartland St, Mobile, AL 00103 
phone: 205-123-1232""".split('\n\n')  # just a fill-in for your file 
              # you would use `with open(file) as data:` 

addr={} 
w0,w1,w2=0,0,0    # these keep track of the max width of the field 
for line in data: 
    fields=[e.split(':')[1].strip() for e in [f for f in line.split('\n')]] 
    nam=fields[0].split() 
    name=nam[-1]+', '+' '.join(nam[0:-1]) 
    addr[(name,fields[2])]=fields 
    w0,w1,w2=[max(t) for t in zip(map(len,fields),(w0,w1,w2))] 

现在你有自由排序,改变格式,放在数据库等

这与该数据将打印格式,排序:

for add in sorted(addr.keys()): 
    print 'Name: {0:{w0}} Address: {1:{w1}} phone: {2:{w2}}'.format(*addr[add],w0=w0,w1=w1,w2=w2) 

打印:

Name: John Doe1  Address: 145 Pearl St, La Jolla, CA 92013 phone: 858-123-1233 
Name: John Doe2  Address: 123 Main St, Los Angeles, CA 95002 phone: 213-123-1234 
Name: Billy Bob Doe3 Address: 454 Heartland St, Mobile, AL 00103 phone: 205-123-1232 

这是按字典中使用的姓氏,名字排序的。

for add in sorted(addr.keys(),key=lambda x: addr[x][2]): 
    print 'Name: {0:{w0}} Address: {1:{w1}} phone: {2:{w2}}'.format(*addr[add],w0=w0,w1=w1,w2=w2) 

打印:

现在打印的区域代码排序

Name: Billy Bob Doe3 Address: 454 Heartland St, Mobile, AL 00103 phone: 205-123-1232 
Name: John Doe2  Address: 123 Main St, Los Angeles, CA 95002 phone: 213-123-1234 
Name: John Doe1  Address: 145 Pearl St, La Jolla, CA 92013 phone: 858-123-1233 

但是,既然你有一个索引的字典中的数据,你可以打印它列为排序表邮政编码:

# print table header 
print '|{0:^{w0}}|{1:^{w1}}|{2:^{w2}}|'.format('Name','Address','Phone',w0=w0+2,w1=w1+2,w2=w2+2) 
print '|{0:^{w0}}|{1:^{w1}}|{2:^{w2}}|'.format('----','-------','-----',w0=w0+2,w1=w1+2,w2=w2+2) 
# print data sorted by last field of the address - probably a zip code 
for add in sorted(addr.keys(),key=lambda x: addr[x][1].split()[-1]): 
    print '|{0:>{w0}}|{1:>{w1}}|{2:>{w2}}|'.format(*addr[add],w0=w0+2,w1=w1+2,w2=w2+2) 

打印:

|  Name  |    Address    | Phone  | 
|  ----  |    -------    | -----  | 
| Billy Bob Doe3| 454 Heartland St, Mobile, AL 00103| 205-123-1232| 
|  John Doe1| 145 Pearl St, La Jolla, CA 92013| 858-123-1233| 
|  John Doe2| 123 Main St, Los Angeles, CA 95002| 213-123-1234|