在合并操作之前可以检查键的唯一性(等数据清理工作),或者计算合并操作所需内存,以防止内存溢出。
1、指定连接方式是一对一,一对多,还是多对对
result = pd.merge(left, right, on='B', how='left', validate="one_to_one")
2、计算合并操作所需内存
def merge_size(left_frame, right_frame, group_by): left_groups = left_frame.groupby(group_by).size() right_groups = right_frame.groupby(group_by).size() left_keys = set(left_groups.index) right_keys = set(right_groups.index) intersection = right_keys & left_keys left_diff = left_keys - intersection right_diff = right_keys - intersection # if the joining column contains np.nan values, these get missed by the intersections # but are present in the merge. These need to be added separately. left_nan = len(left_frame.query('{0} != {0}'.format(group_by))) right_nan = len(right_frame.query('{0} != {0}'.format(group_by))) left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection] sizes += [left_groups[group_name] for group_name in left_diff] sizes += [right_groups[group_name] for group_name in right_diff] sizes += [left_nan * right_nan] return sum(sizes) rows = merge_size(df1, df2, 'unique_key') # 14084 * 400834 = 185030090 rows cols = len(df1.columns) + (len(df1.columns) - 1) # 7 required_memory = (rows * cols) * np.dtype(np.float64).itemsize # 5920962904 bytes # https://github.com/pandas-dev/pandas/issues/15068
内存溢出的问题。可能文件过大了。你先测试小内容是否可以匹配
25万行确实不算多 你可以查查看 有没有循环嵌套的地方,25万行多层的循环嵌套那就不是小数目了
看看这个https://blog.csdn.net/shywang001/article/details/90719398
https://blog.csdn.net/weixin_43064185/article/details/90665301