2015-12-08 162 views
0

我使用JotForm可配置列表小部件来收集数据,但遇到麻烦解析正确的结果数据。当我使用Python /熊猫CSV解析

testdf = pd.read_csv ("TestLoad.csv") 

数据读入为两条记录,详细信息存储在“信息”列中。我明白为什么按照它的方式进行解析,但我想将细节分解为多个记录,如下所述。

任何帮助,将不胜感激。

样品CSV

"Date","Information","Type" 
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New" 
"2015-12-06","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","New" 

当前结果

Date  Information                  Type 
2015-12-06 First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA; New 
2015-12-06 First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA; New 

所需的结果

Date  First Last School Type 
2015-12-06 Tom Smith MCAA New 
2015-12-06 Tammy Smith MCAA New 
2015-12-06 Jim Jones MCAA New 
2015-12-06 Jane Jones MCAA New 

回答

2

这是无用的文本,需要由主持人维持一个答案。下面是我使用的数据:

"Date","Information","Type" 
"2015-12-07","First: Jim, Last: Jones, School: MCAA; First: Jane, Last: Jones, School: MCAA;","Old" 
"2015-12-06","First: Tom, Last: Smith, School: MCAA; First: Tammy, Last: Smith, School: MCAA;","New" 

import pandas as pd 
import numpy as np 
import csv 
import re 
import itertools as it 
import pprint 
import datetime as dt 

records = [] #Construct a complete record for each person 

colon_pairs = r""" 
    (\w+) #Match a 'word' character, one or more times, captured in group 1, followed by.. 
    :  #A colon, followed by... 
    \s*  #Whitespace, 0 or more times, followed by... 
    (\w+) #A 'word' character, one or more times, captured in group 2. 
""" 

colon_pairs_per_person = 3 

with open("csv1.csv", encoding='utf-8') as f: 
    next(f) #skip header line 
    record = {} 

    for date, info, the_type in csv.reader(f): 
     info_parser = re.finditer(colon_pairs, info, flags=re.X) 

     for i, match_obj in enumerate(info_parser): 
      key, val = match_obj.groups() 
      record[key] = val 

      if (i+1) % colon_pairs_per_person == 0: #then done with info for a person 
       record['Date'] = dt.datetime.strptime(date, '%Y-%m-%d') #So that you can sort the DataFrame rows by date. 
       record['Type'] = the_type 

       records.append(record) 
       record = {} 

pprint.pprint(records) 
df = pd.DataFrame(
     sorted(records, key=lambda record: record['Date']) 
) 
print(df) 
df.set_index('Date', inplace=True) 
print(df) 

--output:-- 
[{'Date': datetime.datetime(2015, 12, 7, 0, 0), 
    'First': 'Jim', 
    'Last': 'Jones', 
    'School': 'MCAA', 
    'Type': 'Old'}, 
{'Date': datetime.datetime(2015, 12, 7, 0, 0), 
    'First': 'Jane', 
    'Last': 'Jones', 
    'School': 'MCAA', 
    'Type': 'Old'}, 
{'Date': datetime.datetime(2015, 12, 6, 0, 0), 
    'First': 'Tom', 
    'Last': 'Smith', 
    'School': 'MCAA', 
    'Type': 'New'}, 
{'Date': datetime.datetime(2015, 12, 6, 0, 0), 
    'First': 'Tammy', 
    'Last': 'Smith', 
    'School': 'MCAA', 
    'Type': 'New'}] 

     Date First Last School Type 
0 2015-12-06 Tom Smith MCAA New 
1 2015-12-06 Tammy Smith MCAA New 
2 2015-12-07 Jim Jones MCAA Old 
3 2015-12-07 Jane Jones MCAA Old 

      First Last School Type 
Date         
2015-12-06 Tom Smith MCAA New 
2015-12-06 Tammy Smith MCAA New 
2015-12-07 Jim Jones MCAA Old 
2015-12-07 Jane Jones MCAA Old 
+0

7stud - 感谢您的解决方案。这是我最终使用的方法,因为记录中的人数可能是1:n – Zymurgist66

0

我用正则表达式月arator与python引擎,所以我可以指定多个分隔符。然后,我使用usecols参数来指定数据框中您想要的csv文件中的哪些列。头文件不会从文件中读取,因为它没有任何数据,所以我跳过了第一行。我将第一组记录和第二组记录读入2个数据帧,然后连接2个数据帧。

a = pd.read_csv('sample.csv', sep=',|:|;', skiprows = 1, usecols = (0,2,4,6, 14), header = None, engine='python') 
b = pd.read_csv('sample.csv', sep=',|:|;', skiprows = 1, usecols = (0,8,10,12,14), header = None, engine='python') 
a.columns = ['Date', 'First', "Last", 'School', 'Type'] 
b.columns = ['Date', 'First', "Last", 'School', 'Type'] 
final_data = pd.concat([a,b], axis = 0) 

如果您需要的顺序保存,使得第二名称出现正下方的第一个名字,你可以使用排序指数。我使用mergesort,因为它是一个稳定的排序,这确保了第一条信息记录(右边的记录)将位于左边的信息记录之上。

final_data.sort_index(kind='mergesort', inplace = True) 
>>>final_data 
     Date  First Last  School Type 
0 "2015-12-06" Tom Smith MCAA "New" 
0 "2015-12-06" Tammy Smith MCAA "New" 
1 "2015-12-06" Jim Jones MCAA "New" 
1 "2015-12-06" Jane Jones MCAA "New" 

编辑:将第二组记录包括到数据中。将轴更改为0.

+0

谢谢你的方法。我能够复制,但是当我尝试它时,代码没有在每行中找到第二个名字(例如,Tammy Smith和Jane Jones)。有什么我需要以不同的方式遍历“信息”列中的文本? – Zymurgist66

+0

@ Zymurgist66记录是否必须出现,使汤姆史密斯必须出现在蒂米史密斯的正上方?无论如何,我编辑了我的回复,阅读了两组名称并提供了一个选项,以便维护订单。 – imp9

+0

user1435522 - 否订单不相关。我测试的最初例子只有每个记录2个人。当我尝试使用整个数据集时,我发现人数可能是1:n,所以我最终需要迭代人员。 – Zymurgist66