如何用Python从东方财富网页中批量抓取表格?
网址如下:
http://emweb.securities.eastmoney.com/PC_HSF10/ShareholderResearch/Index?type=web&code=sz000001
现有一个股票代码列表文件1.txt,逐个替换上述网址中“=”号后的内容,爬取网页内的第2个表格内容,分别存入到以股票代码命名的Excel文档中。
谢邀~
既然你提了,那我必须满足呀。 你txt里面的格式要这样,
放到和python文件同一目录下(如果你要改,就自己改一下路径也行):
然后运行以下代码:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
# 读取股票代码列表
with open('1.txt', 'r') as f:
stock_codes = f.read().splitlines()
headers={
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate',
'Connection':'keep-alive',
'Referer':'http://emweb.securities.eastmoney.com/PC_HSF10/ShareholderResearch/Index?type=web',
'Host':'emweb.securities.eastmoney.com',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'}
for gpdm in stock_codes:
url='http://emweb.securities.eastmoney.com/PC_HSF10/ShareholderResearch/PageAjax?code='+gpdm
print("正在获取股票{}的数据".format(gpdm))
res=requests.get(url,headers=headers)
if res.status_code==200:
soup=BeautifulSoup(res.text,"html.parser")
target=res.json()['sdltgd']
df = pd.DataFrame(target)
df.to_excel(gpdm+'.xlsx', index=False)
time.sleep(1)
else:
print("股票{}的数据获取失败,获取下一个股票!".format(gpdm))
continue
有啥需要,记得找我~
requests 请求
pandas.read_html 提取表格
pandas.to_excel 写入Excel
给你个示例,具体基于它继续修改
import requests
import pandas as pd
# 读取股票代码列表
with open('1.txt', 'r') as f:
stock_codes = f.read().splitlines()
# 对每个股票代码进行处理
for stock_code in stock_codes:
url = f'http://emweb.securities.eastmoney.com/PC_HSF10/ShareholderResearch/Index?type=web&code={stock_code}'
# 发送GET请求
response = requests.get(url)
# 使用pandas读取网页中的表格
tables = pd.read_html(response.text)
# 我们想要的表格是第2个表格,索引为1
df = tables[1]
# 将数据保存为Excel文件,文件名为股票代码
df.to_excel(f'{stock_code}.xlsx', index=False)
如果有帮助,请点击一下采纳该答案~谢谢
谢谢楼上的解答。但这段代码运行有问题,如下:
Traceback (most recent call last):
......
ValueError: No tables found
话不多说,先上图,看看运行起来的效果:由上图可知,我实现的主要功能:
users = []
books = []
举个例子,表明字典列表的详情:
我能够提供一个可能的解决方案:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import xlwt
def crawl_and_save(code, excel_path):
# 构建股票代码对应的网址
url = 'http://quote.eastmoney.com/stock/%s.html' % code
# 获取网页内容
response = requests.get(url)
# 提取网页中第2个表格数据并转换成DataFrame格式
soup = BeautifulSoup(response.content, 'html.parser')
data = []
table = soup.findAll('table')[1]
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
row = []
for td in cols:
row.append(td.text.strip())
data.append(row)
table_data = pd.DataFrame(data)
# 将DataFrame写入Excel文件
sheet_name = code
writer = pd.ExcelWriter(excel_path)
table_data.to_excel(writer, sheet_name=sheet_name)
writer.save()
def batch_crawl_and_save(codes, excel_folder):
for code in codes:
excel_path = '%s%s.xlsx' % (excel_folder, code)
crawl_and_save(code, excel_path)
codes = ['600000', '600001', '600002'] # 修改为需要抓取的股票代码列表
excel_folder = './data/' # 修改为Excel文件夹路径
batch_crawl_and_save(codes, excel_folder)
以上就是一个简单的解决方案,其中还有许多细节需要根据具体情况进行修改。