自己用爬虫建立的一个一万条数据的数据库,想要删除重复项。
DELETE FROM newtable WHERE id IN (
SELECT * FROM (
SELECT id FROM newtable WHERE (job_name,company_href,issuedate)
IN (
SELECT job_name,company_href,issuedate FROM newtable GROUP BY job_name,company_href,issuedate HAVING COUNT(1) > 1
) AND id NOT IN (
SELECT MIN(id) FROM newtable GROUP BY job_name,company_href,issuedate HAVING COUNT(1) > 1
)
) AS newtable_copy
);
用如上代码,执行超时。然后执行以下代码:
SELECT * FROM (
SELECT id FROM newtable WHERE (job_name,company_href,issuedate)
IN (
SELECT job_name,company_href,issuedate FROM newtable GROUP BY job_name,company_href,issuedate HAVING COUNT(1) > 1
) AND id NOT IN (
SELECT MIN(id) FROM newtable GROUP BY job_name,company_href,issuedate HAVING COUNT(1) > 1
)
) AS newtable_copy
执行该代码,花费了15分钟,请教一下是什么原因为什么需要这么久?
原因:你的where条件每判断一条记录,都需要再子查询newtable表并分组,相当于10000*10000次+,如果没建立索引或者没建好索引导致全表扫描会更糟糕,你可以转换成表连接进行优化。
如有帮助,请采纳,谢谢