我在使用pd.to_datetime的时候出现了以下的问题:
第一个是我的测试代码,用df = read_chunks.get_chunk()读取200000行的数据,然后用pd.to_datetime将日期做标准化处理,能够正常运行得到结果。
read_chunks = pd.read_csv(r'D:训练用数据.csv',encoding='gbk', iterator=True,chunksize=200000)
df = read_chunks.get_chunk() # 读取当前的分块
df['日期'] = pd.to_datetime(df['日期'])
# 两个时间之差
cha = (df['日期'] - datetime(2018,3,1)).dt.days
df['day'] = df['日期'].dt.day
df['weekday'] = df['日期'].dt.weekday
df['week'] = (cha//7)+1
df['hour'] = df['时间'].apply(lambda x: int(x.split(':')[0])) # str.split("[")[1]. split("]")[0]输出的是 [ 后的内容以及 ] 前的内容。
print(df)
但是我对全部数据进行处理的时候,使用了for循环遍历并将chunksize改为了10000000(完整的数据有1.4亿行),却出现了报错:
read_chunks = pd.read_csv(r'D:训练用数据.csv',encoding='gbk', iterator=True, chunksize=10000000)
# 这里加了iterator=True,df应该就不是dataframe的类型了,他能够使用df.get_chunk(chunksize)来分块读取
# 参数说明:
# iterator=True :开启迭代器
# chunksize=10000000:指定一个chunksize分块的大小来读取文件,此处是读取10000000个数据为一个块。
chunk_list = list()
# 遍历每一个分块,并且将分块放入chunk_list中
for df in read_chunks:
df['日期'] = pd.to_datetime(df['日期'])
# 两个时间之差
cha = (df['日期'] - datetime(2018, 3, 1)).dt.days
df['day'] = df['日期'].dt.day
df['weekday'] = df['日期'].dt.weekday
df['week'] = (cha // 7) + 1
df['hour'] = df['时间'].apply(lambda x: int(x.split(':')[0])) # str.split("[")[1]. split("]")[0]输出的是 [ 后的内容以及 ] 前的内容。
chunk_list.append(df)
df_all = pd.concat(chunk_list, ignore_index=False)
print(df_all)
报错信息如下:
Traceback (most recent call last):
File "D:\05tools\pycharm\xxxx.py", line 2187, in objects_to_datetime64ns
values, tz_parsed = conversion.datetime_to_datetime64(data.ravel("K"))
File "pandas\_libs\tslibs\conversion.pyx", line 359, in pandas._libs.tslibs.conversion.datetime_to_datetime64
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:/05tools/pycharm/xxxx.py", line 16, in <module>
df['日期'] = pd.to_datetime(df['日期'])
File "D:\05tools\pycharm\PycharmProjects\venv\lib\site-packages\pandas\core\tools\datetimes.py", line 883, in to_datetime
cache_array = _maybe_cache(arg, format, cache, convert_listlike)
File "D:\05tools\pycharm\PycharmProjects\venv\lib\site-packages\pandas\core\tools\datetimes.py", line 195, in _maybe_cache
cache_dates = convert_listlike(unique_dates, format)
File "D:\05tools\pycharm\PycharmProjects\venv\lib\site-packages\pandas\core\tools\datetimes.py", line 401, in _convert_listlike_datetimes
result, tz_parsed = objects_to_datetime64ns(
File "D:\05tools\pycharm\PycharmProjects\venv\lib\site-packages\pandas\core\arrays\datetimes.py", line 2193, in objects_to_datetime64ns
raise err
File "D:\05tools\pycharm\PycharmProjects\venv\lib\site-packages\pandas\core\arrays\datetimes.py", line 2175, in objects_to_datetime64ns
result, tz_parsed = tslib.array_to_datetime(
File "pandas\_libs\tslib.pyx", line 379, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 606, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 602, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslib.pyx", line 557, in pandas._libs.tslib.array_to_datetime
File "pandas\_libs\tslibs\conversion.pyx", line 516, in pandas._libs.tslibs.conversion.convert_datetime_to_tsobject
File "pandas\_libs\tslibs\np_datetime.pyx", line 120, in pandas._libs.tslibs.np_datetime.check_dts_bounds
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 18-04-01 00:00:00
Process finished with exit code 1
其中,日期是这样的:
想向各位请教一下,这种错误是什么引起的呢?该如何解决?谢谢~