有如下爬取网上股票数据的代码,经天元浪子 提示这个属于多进程运行。请问在这个程序里,多线程和多进程哪个更高效合适?因为要获取全部股票数据有五千个,用哪个最能节省时间?另外我想显示完成进度(已完成数/总数),应该怎样加代码?不胜感激。
import pandas as pd
import os
from multiprocessing.pool import Pool
def updateStock_pool_per(arg_list):
download_url = arg_list[0]
stock_path = arg_list[1]
stock_code = arg_list[2]
df = pd.read_csv(download_url, encoding='gbk') # 直接将网上的文件数据读取下来
path = os.path.join(stock_path, stock_code + '.csv')
df.to_csv(path, index=False, encoding='gbk') # 保存到文件
return stock_code
def updateStock(stock_path, stock_code_list):
arg_list = []
for stock_code in stock_code_list:
download_url = 'http://quotes.money.163.com/service/chddata.html?code=' + stock_code + '&start=20220719&end=20220819&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;VOTURNOVER;VATURNOVER;TCAP;MCAP' # 构造url
list_ = [download_url, stock_path, stock_code]
arg_list.append(list_)
# pool 每组参数只有一个,多参数封装list中list传入
with Pool(processes=4) as pool:
update_success_list = pool.map(updateStock_pool_per, arg_list)
print('*' * 50)
update_all_count = len(stock_code_list) # 一共需要更新的股票数量
update_success_count = len(update_success_list)
update_fail_list = []
for m in stock_code_list:
if m not in update_success_list:
update_fail_list.append(m)
update_fail_count = len(update_fail_list)
print('共准备更新 {} 支股票, {} 支股票增加数据, {} 支股票下载数据有错'.format(update_all_count, update_success_count, update_fail_count))
if update_fail_list:
print(f'下载以下股票时出错\n{update_fail_list}')
if __name__ == '__main__':
stock_path = r'C:\Users\Administrator\Desktop\stock'
stock_code_list = ['1002296', '1002364', '1000826', '0600981', '1000997', '1000901', '1300291', '1300002', '1002051', '1300252']
updateStock(stock_path, stock_code_list)
1多线程和多进程哪个更快,给出代码。2显示进度。
Python中单线程、多线程和多进程的效率对比实验
https://www.runoob.com/w3cnote/python-single-thread-multi-thread-and-multi-process.html
一般而言,对于计算密集型的任务,多线程处理并不能提高效率,多进程技术则是有效的;对于IO密集型的任务,采用多线程或协程是可以提速的。题主的问题,属于数据抓取,瓶颈在于IO,包括网络IO和磁盘IO,因此建议使用多进程+多线程或协程的组合方案,比如,启用4进程或8进程,每个进程启动5-10个抓取线程,更极端地,还可以在线程内使用协程。下面是一个多进程+线程池的demo,主进程可以显示当前处理进度,供参考。
import time, random
import multiprocessing as mp
from concurrent.futures import ThreadPoolExecutor
n_processing = 4 # 启用处理进程数量
n_threading = 10 # 进程内线程池大小
lock = mp.Lock() # 创建互斥锁
counter = mp.Value('i', 0) # i表示整型
def func_threading(args):
"""线程程函数"""
code, lock, counter = args
# 用sleep模拟做点什么,随机延时0.5~2.5秒
time.sleep(random.random()*2 + 0.5)
lock.acquire()
counter.value = counter.value + 1
lock.release()
def func_processing(codes, lock, counter):
"""进程函数(使用线程池)"""
with ThreadPoolExecutor(max_workers=n_threading) as pool: # 线程池
result = pool.map(func_threading, [(code, lock, counter) for code in codes])
if __name__ == '__main__':
stock_codes = [
'1002296', '1002364', '1000826', '0600981', '1000997',
'1000901', '1300291', '1300002', '1002051', '1300252',
'2002296', '2002364', '2000826', '2600981', '2000997',
'3002296', '3002364', '3000826', '3600981', '3000997'
]
total = len(stock_codes)
piece = int(round(total/n_processing))
for i in range(n_processing):
stock_codes[i*piece : (i+1)*piece]
p = mp.Process(target=func_processing, args=(stock_codes[i*piece : (i+1)*piece], lock, counter))
p.daemon = True
p.start()
while True:
k = counter.value
s = '\r已完成:%d/%d'%(k, total)
print(s.rjust(20), end='', flush=True)
time.sleep(0.2)
if k == total:
break
另外,前面有同学说Python的多进程是假的,也算见仁见智吧。不过,这个说法估计只能算一家之言,不是圈内共识。
进程和线程并不影响
你一个进程可以多个线程
然后你可以多个进程 多个线程
玩Python 就别想着加速之类的了,投入精力大收益小。多进程也提高不了多少,资源占用倒是不少。
用多线程就行,发送一个请求算完成一个,多线程+计次循环