我写的python的merge匹配的代码为什么会报错显示内存错误?明明匹配的两个excel一个不到25万行一个才几百行


在合并操作之前可以检查键的唯一性(等数据清理工作),或者计算合并操作所需内存,以防止内存溢出。

1、指定连接方式是一对一,一对多,还是多对对

result = pd.merge(left, right, on='B', how='left', validate="one_to_one")


2、计算合并操作所需内存

def merge_size(left_frame, right_frame, group_by):
    left_groups = left_frame.groupby(group_by).size()
    right_groups = right_frame.groupby(group_by).size()
    left_keys = set(left_groups.index)
    right_keys = set(right_groups.index)
    intersection = right_keys & left_keys
    left_diff = left_keys - intersection
    right_diff = right_keys - intersection
    
    # if the joining column contains np.nan values, these get missed by the intersections
    # but are present in the merge. These need to be added separately.
    left_nan = len(left_frame.query('{0} != {0}'.format(group_by)))
    right_nan = len(right_frame.query('{0} != {0}'.format(group_by)))
    left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
    right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
    
    sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
    sizes += [left_groups[group_name] for group_name in left_diff]
    sizes += [right_groups[group_name] for group_name in right_diff]
    sizes += [left_nan * right_nan]
    return sum(sizes)

rows = merge_size(df1, df2, 'unique_key') # 14084 * 400834 = 185030090 rows
cols = len(df1.columns) + (len(df1.columns) - 1) # 7
required_memory = (rows * cols) * np.dtype(np.float64).itemsize # 5920962904 bytes
# https://github.com/pandas-dev/pandas/issues/15068

内存溢出的问题。可能文件过大了。你先测试小内容是否可以匹配

25万行确实不算多 你可以查查看 有没有循环嵌套的地方,25万行多层的循环嵌套那就不是小数目了


看看这个https://blog.csdn.net/shywang001/article/details/90719398

https://blog.csdn.net/weixin_43064185/article/details/90665301