有一组未清理的dataframe,如下:
date time order_type
2021-11-03 18:08:27 Dinner
2021-19-12 10:11:49 Breakfast
2021-fri-17 17:38:01 Dinner
Fri Apr 2 00:00:00 2021 18:38:52 Dinner
2021-01-08 08:10:08 Breakfast
我想来检查这个dataframe中的日期是否是“年份-日期-月份”的格式,并且定位到格式错误的所在行,请问要如何实现呢
不仅要校验格式的合法性,还要校验日期的合法性:
import pandas as pd
df = pd.DataFrame([
{'date': '2021-11-03', 'time': '18:08:27', 'order_type': 'Dinner'},
{'date': '2021-19-12', 'time': '10:11:49', 'order_type': 'Breakfast'},
{'date': '2021-fri-17', 'time': '17:38:01', 'order_type': 'Dinner'},
{'date': 'Fri Apr 2 00:00:00', 'time': '18:38:52', 'order_type': 'Dinner'},
{'date': '2021-01-08', 'time': '08:10:08', 'order_type': 'Breakfast'}
])
reg = re.compile('^((((19|20)\d{2})(0[13-9]|1[012])(0[1-9]|[12]\d|30))|(((19|20)\d{2})(0[13578]|1[02])31)|'
'(((19|20)\d{2})02(0[1-9]|1\d|2[0-8]))|((((19|20)([13579][26]|[2468][048]|0[48]))|'
'(2000))0229))$')
invalid_df = df[df['date'].apply(lambda x: reg.match(x.replace('-', '')) is None)]
原数据:
筛选出的非法数据:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'date': ['2022-01-01', '2022/01/01', '01/01/2022']})
for i, datestr in enumerate(df.date):
try:
datetime.strptime(datestr, '%Y-%m-%d')
except:
print(f'{i} row, date: {datestr}')