Python爬虫怎么爬取动态内容?

爬取蜂窝网安徽全部景点http://www.mafengwo.cn/jd/12719/gonglve.html时,爬取不到 li 标签。

使用BeautifulSoup爬取为空。

soup = BeautifulSoup(html, 'html.parser')  
print(soup.select('html body div#container div.row-allScenic div.wrapper div.bd ul.scenic-list '))

结果如下

[<ul class="scenic-list clearfix">
</ul>]

网页ul内部代码如下(应该是动态生成的,直接查看源代码ul里面就是没有)

    <li>
        <a href="/poi/9602.html" target="_blank" title="黄山风景区">
            <div class="img"><img src="http://b1-q.mafengwo.net/s13/M00/6E/FE/wKgEaVyFR3SAKchQAAJXQXSOpZc87.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>黄山风景区</h3>
        </a>

    </li>
    <li>
        <a href="/poi/7730080.html" target="_blank" title="宏村">
            <div class="img"><img src="http://p1-q.mafengwo.net/s15/M00/E6/DF/CoUBGV5HaamAcXx3AAGlNmbI4_U76.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>宏村</h3>
        </a>

    </li>
    <li>
        <a href="/poi/9684.html" target="_blank" title="西海大峡谷">
            <div class="img"><img src="http://b1-q.mafengwo.net/s14/M00/13/97/wKgE2l1ipPeAO6aYAATuez1Jq3U09.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>西海大峡谷</h3>
        </a>

    </li>
    <li>
        <a href="/poi/6328735.html" target="_blank" title="西递">
            <div class="img"><img src="http://b1-q.mafengwo.net/s15/M00/3B/4B/CoUBGV2kNdqADjy0AAPBiWhgJBo736.jpg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>西递</h3>
        </a>

    </li>
    <li>
        <a href="/poi/9720.html" target="_blank" title="屯溪老街">
            <div class="img"><img src="http://b1-q.mafengwo.net/s13/M00/B3/0D/wKgEaV2bMp6AMMdwAAQqIROm1GA735.jpg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>屯溪老街</h3>
        </a>

    </li>
    <li>
        <a href="/poi/5426908.html" target="_blank" title="徽州古城">
            <div class="img"><img src="http://n1-q.mafengwo.net/s10/M00/4F/A7/wKgBZ1jrgESAHGHQAAHt-nVAMu051.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>徽州古城</h3>
        </a>

    </li>
    <li>
        <a href="/poi/5426501.html" target="_blank" title="黄山翡翠谷景区">
            <div class="img"><img src="http://b1-q.mafengwo.net/s12/M00/60/C4/wKgED1xIMMeAL4quAAqdKm2SP-Q74.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>黄山翡翠谷景区</h3>
        </a>

    </li>
    <li>
        <a href="/poi/9605.html" target="_blank" title="光明顶">
            <div class="img"><img src="http://p1-q.mafengwo.net/s12/M00/58/27/wKgED1vkGQOAI7zOAAYh6jFZne054.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>光明顶</h3>
        </a>

    </li>
    <li>
        <a href="/poi/1548.html" target="_blank" title="月沼湖">
            <div class="img"><img src="http://p1-q.mafengwo.net/s17/M00/92/D4/CoUBXl-Np1iEZLaDAAAAADwBCO0947.jpg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>月沼湖</h3>
        </a>

    </li>
    <li>
        <a href="/poi/9724.html" target="_blank" title="南湖">
            <div class="img"><img src="http://b1-q.mafengwo.net/s10/M00/2F/F2/wKgBZ1nty7uAPRz6AAT5d2JPkUw44.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>南湖</h3>
        </a>

    </li>
    <li>
        <a href="/poi/6328738.html" target="_blank" title="木坑竹海">
            <div class="img"><img src="http://b1-q.mafengwo.net/s12/M00/89/29/wKgED1wPqW2AQO81AA5hRvN8lqU60.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>木坑竹海</h3>
        </a>

    </li>
    <li>
        <a href="/poi/5429154.html" target="_blank" title="查济古镇">
            <div class="img"><img src="http://n1-q.mafengwo.net/s12/M00/73/A6/wKgED1uTK3iAJGhOAEVmsM4Yp5c20.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>查济古镇</h3>
        </a>

    </li>
    <li>
        <a href="/poi/6625188.html" target="_blank" title="徽杭古道">
            <div class="img"><img src="http://n1-q.mafengwo.net/s12/M00/C1/45/wKgED1veKgeAWJimAB4yzt6mrKE05.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>徽杭古道</h3>
        </a>

    </li>
    <li>
        <a href="/poi/5426678.html" target="_blank" title="三河古镇">
            <div class="img"><img src="http://p1-q.mafengwo.net/s12/M00/55/06/wKgED1xD5QKAAeOgAAyamhBQPlM35.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>三河古镇</h3>
        </a>

    </li>
    <li>
        <a href="/poi/5426350.html" target="_blank" title="呈坎">
            <div class="img"><img src="http://n1-q.mafengwo.net/s10/M00/E0/CB/wKgBZ1t-zXeAEEM6AG0HFweCAxw84.jpeg?imageMogr2%2Fthumbnail%2F%21192x130r%2Fgravity%2FCenter%2Fcrop%2F%21192x130%2Fquality%2F100" width="192" height="130"></div>
            <h3>呈坎</h3>
        </a>

    </li>

使用webdriver获取到文本,不知道怎么获取标签属性值(目前需要解决的问题)

    text_class=browser.find_element_by_css_selector('.scenic-list.clearfix')
    text=text_class.text #获取文本
    print(text)

使用XPath定位获取不了信息

print(browser.find_element_by_xpath('//div[@class="row row-allScenic"]//div[@class="wrapper"]//div[@class="bd"]//ul[@class="scenic-list clearfix"]//li[1]'))

返回结果如下

<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="c4104180-ab74-44de-a274-620ffff68289", element="382f1abb-1795-48db-987d-80e5985cdef5")>

 

可以先用webdriver获取动态更新后的html代码,再交给BeautifulSoup处理。

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

browser = webdriver.Chrome()
browser.get('http://www.mafengwo.cn/jd/12719/gonglve.html')
sleep(3)
html = browser.find_element_by_tag_name("html").get_attribute("outerHTML")
soup = BeautifulSoup(html, 'html.parser')
print(soup.select('html body div#container div.row-allScenic div.wrapper div.bd ul.scenic-list '))

 

你要找到目标url,获取返回的所有内容,然后进行分析。

不一定,你要自己测试分析

webdriver 获取标签属性用.get_attribute("属性名")方法

属性名可以是 outerHTML innerHTML id value 等DOM元素属性
 

print(browser.find_element_by_xpath('//div[@class="row row-allScenic"]//div[@class="wrapper"]//div[@class="bd"]//ul[@class="scenic-list clearfix"]//li[1]').get_attribute("outerHTML"))

您好,我是有问必答小助手,你的问题已经有小伙伴为您解答了问题,您看下是否解决了您的问题,可以追评进行沟通哦~

如果有您比较满意的答案 / 帮您提供解决思路的答案,可以点击【采纳】按钮,给回答的小伙伴一些鼓励哦~~

ps:问答VIP仅需29元,即可享受5次/月 有问必答服务,了解详情>>>https://vip.csdn.net/askvip?utm_source=1146287632

可以到 network 中查找你要的内容,找到对应的文件,查看这个文件的请求头,获取url。如果内容被分块,url也会有规律。