2013-05-30 69 views
0

我从服务器收到一个制表符分隔的文件,该文件根据每个应答者输出问题答案。我想将数据导入熊猫数据框,其中列是每个问题,行是每个答复者的答案。以下是一位受访者的看法:重新排列熊猫数据框的数据?

[2072] Anonymous 
Q-0 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.14 Student (Graduate/ Undergraduate) 
Q-1 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-1 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 1|1|1|1|4| 
Q-2 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00 1-3 
Q-3 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Male 
Q-4 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 18-24 
Q-5 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00  
Q-6 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Prefer not to answer 
Q-7 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Yes 
Q-8 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.13 Bachelor's Degree 
Q-9 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Other 
Q-10 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Mathematics 
Q-11 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 High school 
Q-11 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 College (introductory courses) 
Q-12 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00 Professional 
Q-13 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Mac OS X 
Q-14 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.25 Every week 
Q-15 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 A test that proves or disproves of some abstract theory about the world 
Q-16 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-17 [01] Sat 25 May 2013 7:43 PM UTC +0000 2.00 Yes 
Q-18 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-19 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.20 Timely feedback from the instructor 
Q-20 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  

每位受访者的回答之间都有回车。谢谢你的帮助!

+0

嗯......为什么downvote,帮派?这似乎是一个很好的用例,可能适用于其他人。 –

回答

1

不平凡的一步是划定每个受访者的区块。如何重写文件以在每一行前加上被访者的ID?例如,在“匿名”的情况下,我看到“2072”。

import re 

f = open('new_file', 'w') 
for line in open('filename'): 
    # line might be like [####] Student_Name or Q-... 
    m = re.match('\[(\d+)\] .*', line) 
    if m: 
     # Line is like [####] Student_name. 
     respondent_id = m.group(1) 
     continue 
    # Line is like Q-... 
    # Write new line like #### Q-... 
    f.write(str(respondent_id) + line) 

然后使用pandas read_csv加载这个修改过的文件,给索引分配前两列。 (它们将是MultiIndex。)然后使用unstack将Q的索引转换为列。

(全面披露:我测试了正则表达式,但我没有测试过所有)

+0

实际上,如果它们是固定大小的块(例如每个10行),那么可以只读它,然后BinGroup,我认为 – Jeff

+0

很酷。我不知道这是一件事。 –

+0

实际上,更容易做到这一点:''''df.groupby(df.index.to_series()/ 3).sum()''(每3行)'''BinGrouper''必须直接指定标签 – Jeff

0

下面是我工作:

import re 

f = open('new_file', 'w') 
for line in open('filename'): 
    m = re.match('\[\d+\]*', line) 
    if m: 
     respondent_id = m.group() 
     continue 
    f.write(str(respondent_id) + line)