2017-09-04 57 views
1

我想将日期变得凌乱的笔记本转换为熊猫中的排序日期序列。在熊猫数据框中提取和解析日期

0   03/25/93 Total time of visit (in minutes):\n 
1       6/18/85 Primary Care Doctor:\n 
2  sshe plans to move as of 7/8/71 In-Home Servic... 
3     7 on 9/27/75 Audit C Score Current:\n 
4  2/6/96 sleep studyPain Treatment Pain Level (N... 
5      .Per 7/06/79 Movement D/O note:\n 
6  4, 5/18/78 Patient's thoughts about current su... 
7  10/24/89 CPT Code: 90801 - Psychiatric Diagnos... 
8       3/7/86 SOS-10 Total Score:\n 
9    (4/10/71)Score-1Audit C Score Current:\n 
10  (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC... 
11       4/09/75 SOS-10 Total Score:\n 
12  8/01/98 Communication with referring physician... 
13  1/26/72 Communication with referring physician... 
14  5/24/1990 CPT Code: 90792: With medical servic... 
15  1/25/2011 CPT Code: 90792: With medical servic... 

我有多种日期格式,如04/20/2009;零九年四月二十零日; 09年4月20日; 09年4月3日。我想将所有这些转换为mm/dd/yyyy到一个新列。

到目前为止,我已经做了

df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})') 

而且,我不如何提取只有MM/YY或YYYY格式的日期都没有行与上面的代码干扰。请记住,在没有日或月的情况下,我会将第一个和第一个月作为默认值。

回答

1

可以使用pd.Series.str.extract用正则表达式,然后应用pd.to_datetime

df['Date'] = df.Text.str.extract(r'(?P<Date>\d+(?:\/\d+){2})', expand=False)\ 
                   .apply(pd.to_datetime) 

df 

               Text  Date 
0                
0  03/25/93 Total time of visit (in minutes):\n 1993-03-25 
1      6/18/85 Primary Care Doctor:\n 1985-06-18 
2 sshe plans to move as of 7/8/71 In-Home Servic... 1971-07-08 
3    7 on 9/27/75 Audit C Score Current:\n 1975-09-27 
4 2/6/96 sleep studyPain Treatment Pain Level (N... 1996-02-06 
5     .Per 7/06/79 Movement D/O note:\n 1979-07-06 
6 4, 5/18/78 Patient's thoughts about current su... 1978-05-18 
7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos... 1989-10-24 
8      3/7/86 SOS-10 Total Score:\n 1986-03-07 
9   (4/10/71)Score-1Audit C Score Current:\n 1971-04-10 
10 (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC... 1985-05-11 
11      4/09/75 SOS-10 Total Score:\n 1975-04-09 
12 8/01/98 Communication with referring physician... 1998-08-01 
13 1/26/72 Communication with referring physician... 1972-01-26 
14 5/24/1990 CPT Code: 90792: With medical servic... 1990-05-24 
15 1/25/2011 CPT Code: 90792: With medical servic... 2011-01-25 

str.extract返回一系列字符串看起来像这样:

array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79', 
     '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75', 
     '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object) 

正则表达式的详细

(?P<Date>\d+(?:\/\d+){2}) 
  • (?P<Date>....) - 命名捕获组
  • \d+ 1或多个数字
  • (?:\/\d+){2} - 非捕获组重复两次,其中
    • \/ - 逃脱斜线
    • {2} - 中继器(2次)

正则表达式失踪天

要处理可选days,需要稍微修改的正则表达式:

(?P<Date>(?:\d+\/)?\d+/\d+) 

详细

  • (?P<Date>....) - 命名捕获组
  • (?:\d+\/)? - 嵌套组(非捕获)其中\d+\/是可选的。
  • \d+ 1个或多个数字
  • \/逃脱斜线

的其余部分是相同的。用这个正则表达式代替当前的正则表达式。 pd.to_datetime将处理缺失的日子。

相关问题