在熊猫数据框中提取和解析日期

我想将日期变得凌乱的笔记本转换为熊猫中的排序日期序列。在熊猫数据框中提取和解析日期

0   03/25/93 Total time of visit (in minutes):\n 
1       6/18/85 Primary Care Doctor:\n 
2  sshe plans to move as of 7/8/71 In-Home Servic... 
3     7 on 9/27/75 Audit C Score Current:\n 
4  2/6/96 sleep studyPain Treatment Pain Level (N... 
5      .Per 7/06/79 Movement D/O note:\n 
6  4, 5/18/78 Patient's thoughts about current su... 
7  10/24/89 CPT Code: 90801 - Psychiatric Diagnos... 
8       3/7/86 SOS-10 Total Score:\n 
9    (4/10/71)Score-1Audit C Score Current:\n 
10  (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC... 
11       4/09/75 SOS-10 Total Score:\n 
12  8/01/98 Communication with referring physician... 
13  1/26/72 Communication with referring physician... 
14  5/24/1990 CPT Code: 90792: With medical servic... 
15  1/25/2011 CPT Code: 90792: With medical servic...

我有多种日期格式，如04/20/2009;零九年四月二十零日; 09年4月20日; 09年4月3日。我想将所有这些转换为mm/dd/yyyy到一个新列。

到目前为止，我已经做了

df2['date']= df2['text'].str.extractall(r'(\d{1,2}[/-]\d{1,2}[/-]\d{2,})')

而且，我不如何提取只有MM/YY或YYYY格式的日期都没有行与上面的代码干扰。请记住，在没有日或月的情况下，我会将第一个和第一个月作为默认值。

来源

2017-09-04 JPV

可以使用pd.Series.str.extract用正则表达式，然后应用pd.to_datetime：

df['Date'] = df.Text.str.extract(r'(?P<Date>\d+(?:\/\d+){2})', expand=False)\ 
                   .apply(pd.to_datetime) 

df 

               Text  Date 
0                
0  03/25/93 Total time of visit (in minutes):\n 1993-03-25 
1      6/18/85 Primary Care Doctor:\n 1985-06-18 
2 sshe plans to move as of 7/8/71 In-Home Servic... 1971-07-08 
3    7 on 9/27/75 Audit C Score Current:\n 1975-09-27 
4 2/6/96 sleep studyPain Treatment Pain Level (N... 1996-02-06 
5     .Per 7/06/79 Movement D/O note:\n 1979-07-06 
6 4, 5/18/78 Patient's thoughts about current su... 1978-05-18 
7 10/24/89 CPT Code: 90801 - Psychiatric Diagnos... 1989-10-24 
8      3/7/86 SOS-10 Total Score:\n 1986-03-07 
9   (4/10/71)Score-1Audit C Score Current:\n 1971-04-10 
10 (5/11/85) Crt-1.96, BUN-26; AST/ALT-16/22; WBC... 1985-05-11 
11      4/09/75 SOS-10 Total Score:\n 1975-04-09 
12 8/01/98 Communication with referring physician... 1998-08-01 
13 1/26/72 Communication with referring physician... 1972-01-26 
14 5/24/1990 CPT Code: 90792: With medical servic... 1990-05-24 
15 1/25/2011 CPT Code: 90792: With medical servic... 2011-01-25

str.extract返回一系列字符串看起来像这样：

array(['03/25/93', '6/18/85', '7/8/71', '9/27/75', '2/6/96', '7/06/79', 
     '5/18/78', '10/24/89', '3/7/86', '4/10/71', '5/11/85', '4/09/75', 
     '8/01/98', '1/26/72', '5/24/1990', '1/25/2011'], dtype=object)

正则表达式的详细

(?P<Date>\d+(?:\/\d+){2})

(?P<Date>....) - 命名捕获组
\d+ 1或多个数字
(?:\/\d+){2} - 非捕获组重复两次，其中
- \/ - 逃脱斜线
- {2} - 中继器（2次）

正则表达式失踪天

要处理可选days，需要稍微修改的正则表达式：

(?P<Date>(?:\d+\/)?\d+/\d+)

详细

(?P<Date>....) - 命名捕获组
(?:\d+\/)? - 嵌套组（非捕获）其中\d+\/是可选的。
\d+ 1个或多个数字
\/逃脱斜线

的其余部分是相同的。用这个正则表达式代替当前的正则表达式。 pd.to_datetime将处理缺失的日子。

来源

2017-09-04 21:48:05

在熊猫数据框中提取和解析日期

回答

相关问题