最近学习爬虫,可是代码出来之后效果很不对劲,有大神看看出了什么问题,谢谢了……

import requests
from bs4 import BeautifulSoup

url = 'https://699pic.com/qingnianshenghuo.html'

resp = requests.get(url)
resp.encoding='utf-8'

main_page= BeautifulSoup(resp.text, 'html.parser')

alist = main_page.find_all("div", class_="photo-tag")
child_href_list=[]
for a in alist:


    w=a.find("a")

    hrefs = "https:"+w.get("href")
    child_href_list.append(hrefs)
    
    for href in child_href_list:
        child_page_resp = requests.get(href)
        child_page_resp.encoding="utf_8"
        child_page_text = child_page_resp.text
        child_page=BeautifulSoup(child_page_text,"html.parser")
        p = child_page.find("a", class_="photo-img-link")

        img = p.find("img")



        print("https:"+img.get("src"))

主要是抓取的图片重复的太厉害,尤其是第一张图片,没有规律的循环重复……

for href in child_href_list:
        child_page_resp = requests.get(href)
        child_page_resp.encoding="utf_8"
        child_page_text = child_page_resp.text
        child_page=BeautifulSoup(child_page_text,"html.parser")
        p = child_page.find("a", class_="photo-img-link")

        img = p.find("img")



        print("https:"+img.get("src"))

这段代码不要放到for a in alist:循环里,提到外面就可以了

我看了下你上面代码图片路径找错了,在 `div  class="photo-tag"` 的标签下是没有 img 标签的,所以我重新把上面梳理了下,发现 img 标签是在  `div  class="list"` 的标签的 a 标签下,下面是我自己写的代码,你可以参考下

#!/usr/bin/env python
#-*- coding:utf-8 -*-
import requests,os
from lxml import etree
from urllib.request import urlretrieve
from time import sleep

start_url = 'https://699pic.com/qingnianshenghuo.html'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Refer":start_url,
    "Host":"699pic.com",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
}

def getImageUrl(url):
    session = requests.session()
    session.headers = headers

    html = session.get(url).content.decode('utf-8')
    return html

def parseHtml(html):
    res = etree.HTML(html)
    data = res.xpath('//div[@class="list"]/a/img')

    #下载图片
    for item in data:
        srcUrl = "http:" + item.xpath('./@data-original')[0]
        name = item.xpath("./@title")[0] + os.path.splitext(srcUrl)[1]
        urlretrieve(srcUrl,name)
        sleep(0.5)

if __name__ == '__main__':
    html = getImageUrl(url=start_url)
    parseHtml(html)

 

您好,我是有问必答小助手,你的问题已经有小伙伴为您解答了问题,您看下是否解决了您的问题,可以追评进行沟通哦~

如果有您比较满意的答案 / 帮您提供解决思路的答案,可以点击【采纳】按钮,给回答的小伙伴一些鼓励哦~~

ps:问答VIP仅需29元,即可享受5次/月 有问必答服务,了解详情>>>https://vip.csdn.net/askvip?utm_source=1146287632

非常感谢您使用有问必答服务,为了后续更快速的帮您解决问题,现诚邀您参与有问必答体验反馈。您的建议将会运用到我们的产品优化中,希望能得到您的支持与协助!

速戳参与调研>>>https://t.csdnimg.cn/Kf0y