目标网站:https://sc.chinaz.com/yinxiao/
需求:
1、翻页爬网页上的音乐名字,音乐链接
2、保存到csv
import requests
from lxml import etree
import csv
start = int(input('请输入你的起始页:'))
end = int(input('请输入你的结束页:'))
lis = []
for k in range(start, end+1):
url = f'https://sc.chinaz.com/yinxiao/index_{k}.html'#改下
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}
response = requests.get(url=url, headers=headers)
response.encoding = 'utf-8'
data = response.text
# print(data)
html = etree.HTML(data)
div_tag = html.xpath('//div[@class="right-head"]/a')
for a in div_tag:
name = a.xpath('./p/text()')
href = a.xpath('./@href')
# print(name, href)
name = [s.strip() for s in name]
for i in zip(name, href):
dic = {}
dic['name'] = i[0]
dic['href'] = 'https://sc.chinaz.com'+i[1]
lis.append(dic)
print(lis)
with open('音效.csv', 'w', encoding='utf-8', newline='') as f:
write = csv.DictWriter(f, fieldnames=['name', 'href'])
write.writeheader()
write.writerows(lis)
运行结果中多了一个‘https:’ :{'name': '欢快积极向上背景音乐', 'href': 'https://sc.chinaz.comhttps:/yinxiao/220712528962.htm'}, {'name': '闹钟计时器倒计时', 'href': 'https://sc.chinaz.comhttps:/yinxiao/220712517521.htm'}, {'name': '寺庙钟声MP3音效', 'href': 'https://sc.chinaz.comhttps:/yinxiao/220712498280.htm'}, {'name': '汽车加速离开的声音', 'href': 'https://sc.chinaz.comhttps:/yinxiao/220711552202.htm'}, {'name': '用力关门的声音MP3', 'href': 'https://sc.chinaz.comhttps:/yinxiao/220711517571.htm'}, {'name': '风铃清脆的响声MP3', 'href': 'https://sc.chinaz.comhttps:/yinxiao/220711485070.htm'}, {'name': '动感鼓点节奏音乐', 'href': 'https://sc.chinaz.comhttps:/yinxiao/220710448212.htm'}
代码没有问题,他的网址有域名转发,你将这些网址都统一处理成
https://sc.chinaz.com/yinxiao/220731521332.htm
这个格式就没事了,说白了就是把中间的https:/去掉。