import requests from bs4 import BeautifulSoup url = 'https://699pic.com/qingnianshenghuo.html' resp = requests.get(url) resp.encoding='utf-8' main_page= BeautifulSoup(resp.text, 'html.parser') alist = main_page.find_all("div", class_="photo-tag") child_href_list=[] for a in alist: w=a.find("a") hrefs = "https:"+w.get("href") child_href_list.append(hrefs) for href in child_href_list: child_page_resp = requests.get(href) child_page_resp.encoding="utf_8" child_page_text = child_page_resp.text child_page=BeautifulSoup(child_page_text,"html.parser") p = child_page.find("a", class_="photo-img-link") img = p.find("img") print("https:"+img.get("src"))
主要是抓取的图片重复的太厉害,尤其是第一张图片,没有规律的循环重复……
for href in child_href_list: child_page_resp = requests.get(href) child_page_resp.encoding="utf_8" child_page_text = child_page_resp.text child_page=BeautifulSoup(child_page_text,"html.parser") p = child_page.find("a", class_="photo-img-link") img = p.find("img") print("https:"+img.get("src"))
这段代码不要放到for a in alist:循环里,提到外面就可以了
我看了下你上面代码图片路径找错了,在 `div class="photo-tag"` 的标签下是没有 img 标签的,所以我重新把上面梳理了下,发现 img 标签是在 `div class="list"` 的标签的 a 标签下,下面是我自己写的代码,你可以参考下
#!/usr/bin/env python
#-*- coding:utf-8 -*-
import requests,os
from lxml import etree
from urllib.request import urlretrieve
from time import sleep
start_url = 'https://699pic.com/qingnianshenghuo.html'
headers = {
"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Refer":start_url,
"Host":"699pic.com",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
}
def getImageUrl(url):
session = requests.session()
session.headers = headers
html = session.get(url).content.decode('utf-8')
return html
def parseHtml(html):
res = etree.HTML(html)
data = res.xpath('//div[@class="list"]/a/img')
#下载图片
for item in data:
srcUrl = "http:" + item.xpath('./@data-original')[0]
name = item.xpath("./@title")[0] + os.path.splitext(srcUrl)[1]
urlretrieve(srcUrl,name)
sleep(0.5)
if __name__ == '__main__':
html = getImageUrl(url=start_url)
parseHtml(html)
您好,我是有问必答小助手,你的问题已经有小伙伴为您解答了问题,您看下是否解决了您的问题,可以追评进行沟通哦~
如果有您比较满意的答案 / 帮您提供解决思路的答案,可以点击【采纳】按钮,给回答的小伙伴一些鼓励哦~~
ps:问答VIP仅需29元,即可享受5次/月 有问必答服务,了解详情>>>https://vip.csdn.net/askvip?utm_source=1146287632
非常感谢您使用有问必答服务,为了后续更快速的帮您解决问题,现诚邀您参与有问必答体验反馈。您的建议将会运用到我们的产品优化中,希望能得到您的支持与协助!
速戳参与调研>>>https://t.csdnimg.cn/Kf0y