重新排列熊猫数据框的数据？

我从服务器收到一个制表符分隔的文件，该文件根据每个应答者输出问题答案。我想将数据导入熊猫数据框，其中列是每个问题，行是每个答复者的答案。以下是一位受访者的看法：重新排列熊猫数据框的数据？

[2072] Anonymous 
Q-0 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.14 Student (Graduate/ Undergraduate) 
Q-1 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-1 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 1|1|1|1|4| 
Q-2 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00 1-3 
Q-3 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Male 
Q-4 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 18-24 
Q-5 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00  
Q-6 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Prefer not to answer 
Q-7 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Yes 
Q-8 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.13 Bachelor's Degree 
Q-9 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Other 
Q-10 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 Mathematics 
Q-11 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 High school 
Q-11 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.33 College (introductory courses) 
Q-12 [01] Sat 25 May 2013 7:43 PM UTC +0000 1.00 Professional 
Q-13 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.50 Mac OS X 
Q-14 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.25 Every week 
Q-15 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00 A test that proves or disproves of some abstract theory about the world 
Q-16 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-17 [01] Sat 25 May 2013 7:43 PM UTC +0000 2.00 Yes 
Q-18 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00  
Q-19 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.20 Timely feedback from the instructor 
Q-20 [01] Sat 25 May 2013 7:43 PM UTC +0000 0.00

每位受访者的回答之间都有回车。谢谢你的帮助！

来源

2013-05-30 dannycab

嗯......为什么downvote，帮派？这似乎是一个很好的用例，可能适用于其他人。 –

不平凡的一步是划定每个受访者的区块。如何重写文件以在每一行前加上被访者的ID？例如，在“匿名”的情况下，我看到“2072”。

import re 

f = open('new_file', 'w') 
for line in open('filename'): 
    # line might be like [####] Student_Name or Q-... 
    m = re.match('\[(\d+)\] .*', line) 
    if m: 
     # Line is like [####] Student_name. 
     respondent_id = m.group(1) 
     continue 
    # Line is like Q-... 
    # Write new line like #### Q-... 
    f.write(str(respondent_id) + line)

然后使用pandas read_csv加载这个修改过的文件，给索引分配前两列。（它们将是MultiIndex。）然后使用unstack将Q的索引转换为列。

（全面披露：我测试了正则表达式，但我没有测试过所有）

来源

2013-05-30 15:28:01

实际上，如果它们是固定大小的块（例如每个10行），那么可以只读它，然后BinGroup，我认为 – Jeff

很酷。我不知道这是一件事。 –

实际上，更容易做到这一点：''''df.groupby（df.index.to_series（）/ 3）.sum（）''（每3行）'''BinGrouper''必须直接指定标签 – Jeff

下面是我工作：

import re 

f = open('new_file', 'w') 
for line in open('filename'): 
    m = re.match('\[\d+\]*', line) 
    if m: 
     respondent_id = m.group() 
     continue 
    f.write(str(respondent_id) + line)

来源

2013-05-30 16:21:40 dannycab

重新排列熊猫数据框的数据？

回答

相关问题