2017-07-14 44 views
1

我有一个数据帧data与2列IDText。目标是根据日期将Text列中的值分成多列。通常情况下,日期会启动一系列需要在列中的字符串值,除非日期位于字符串的末尾(在这种情况下,它被视为以前一个日期开始的字符串的一部分)。如何使用日期来分割一个数据帧列python中的多列

data: 
ID  Text 
10  6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007 
20  7/17/06-advil, qui; 
10  7/19/06-ibuprofen. 8/31/06-penicilin, tramadol; 
40  9/26/06-penicilin, tramadol; 
91  5/23/06-penicilin, amoxicilin, tylenol; 
84  10/20/06-ibuprofen, tramadol; 
17  12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up 
23  12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up 
15  Follow up appt. scheduled 
69  talk to care giver 
32  12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months 
70  12/1/06?Follow up but no serious allergies 
70  12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil 

预期输出:

ID  Text                     Text2                     Text3 
10  6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007 
20  7/17/06-advil, qui; 
10  7/19/06-ibuprofen.                  8/31/06-penicilin, tramadol; 
40  9/26/06-penicilin, tramadol; 
91  5/23/06-penicilin, amoxicilin, tylenol; 
84  10/20/06-ibuprofen, tramadol; 
17  12/19/06-vit D, tramadol.                12/1/09 -6/18/10 vit D only for 5 months.            3/7/11 f/up 
23  12/19/06-vit D, tramadol;                12/1/09 -6/18/10 vit D;                 3/7/11 video follow-up 
15  Follow up appt. scheduled 
69  talk to care giver 
32  12/15/06-2/16/07 everyday Follow-up;             6/8/16 discharged after 2 months 
70  12/1/06?Follow up but no serious allergies 
70  12/12/06-tylenol, vit D,advil;               1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil 

到目前为止我的代码:

d = [] 
for i in data.Text: 
    d = list(datefinder.find_dates(i)) #I can get the dates so far but still want to format the date values as %m/%d/%Y 

if len(d) > 1:#Checks for every record that has more than 1 date 
    for j in range(0,len(d)): 
     i = " " + " ".join(re.split(r'[^a-z 0-9/-]',i.lower())) + " " #cleans the text strings of any special characters 
     #data.Text[j] = d[j]r'[/^(.*?)]'d[j+1]'/'#this is not working 

     #The goal is for the Text column to retain the string from the first date up to before the second date. Then create a new Text1, get every value from the second date up to before the third date. And if there are more dates, create Textn and so on. 
     #Exception, if a date immediately follows a date (i.e. 12/1/09 -6/18/10) or a date ends a value string (i.e. 6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007), they should be considered to be in the same column 

如何使这项工作将节省我一天的任何想法。谢谢!

+0

将所有相关的日期格式是MM/DD/YY格式? –

+0

@Brad Solomon - 最好以mm/dd/yyy为单位。谢谢! – CodeLearner

+0

我的意思是在您的输入数据 –

回答

1

你去那里

from itertools import chain, starmap, zip_longest 
import itertools 
import re 
import pandas as pd 

ids = [10, 20, 10, 40, 91, 84, 17, 23, 15, 69, 32, 70, 70] 

text = [ 
    "6/26/06 begin tramadol, penicilin X 6 CYCLES. 1000mg tylenol X 1 YR after 11/2007", 
    "7/17/06-advil, qui;", 
    "7/19/06-ibuprofen. 8/31/06-penicilin, tramadol;", 
    "9/26/06-penicilin, tramadol;", 
    "5/23/06-penicilin, amoxicilin, tylenol;", 
    "10/20/06-ibuprofen, tramadol;", 
    "12/19/06-vit D, tramadol. 12/1/09 -6/18/10 vit D only for 5 months. 3/7/11 f/up", 
    "12/19/06-vit D, tramadol; 12/1/09 -6/18/10 vit D; 3/7/11 video follow-up", 
    "Follow up appt. scheduled", 
    "talk to care giver", 
    "12/15/06-2/16/07 everyday Follow-up; 6/8/16 discharged after 2 months", 
    "12/1/06?Follow up but no serious allergies", 
     "12/12/06-tylenol, vit D,advil; 1/26/07 scheduled surgery but had to cancel due to severe allergic reactions to advil"] 

by_date = re.compile(
    """((?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d\s*""" 
    """(?:(?:-|to |through)\s*(?:0?[1-9]|1[012])/(?:0?[1-9]|[12]\d|3[01])/\d\d)?\s*\S)""") 


def to_items(line): 
    starts = [m.start() for m in by_date.finditer(line)] 
    if not starts or starts[0] > 0: 
     starts.insert(0, 0) 
    stops = iter(starts) 
    next(stops) 
    return map(line.__getitem__, starmap(slice, zip_longest(starts, stops))) 


cleaned = zip_longest(*map(to_items, text)) 
col_names = chain(["Text"], map("Text{}".format, itertools.count(2))) 
df = pd.DataFrame(dict(zip(col_names, cleaned), ID=ids)) 

print(df) 
+0

你是一个拯救生命的人。谢谢!快速观察:我发现一个字符串末尾的日期仍然被拉进一个新的列 - 这不应该是。我的意思是,字符串末尾的任何日期都应该被认为是该字符串的一部分,因此它应该在同一列中。我们如何摆脱这种错误的分离? – CodeLearner

+0

请参阅上面的评论。谢谢。 – CodeLearner

+0

@CodeLearner你在谈论记录中的直线吗?对不起,我没有看到字符串末尾的日期形成新列。您是否在使用其他数据进行测试?正则表达式使用了\ S来确保日期后有内容。 – frogcoder

相关问题