怎么爬取百度文库第二页及以后的?

刚学python爬虫,爬百度文库只能爬取第一页,第二页以后就爬不出来了。

第一页的p标签如图

img

第二页及以后的p标签不一样,如图

img

代码如下:

import requests
from bs4 import BeautifulSoup

headers =  {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, compress',
    'Accept-Language': 'en-us;q=0.5,en;q=0.3',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
}

def getHTMLText(url):
    try:
        r=requests.get(url, headers=headers)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return ''

def fillList(ulist,html):
    soup = BeautifulSoup(html, 'lxml')
    # print(soup)
    ads = soup.find_all('p')
    # print(ads)
    for ad in ads:
        abstract = ad.get_text()
        ulist.append(abstract)
    return ulist

url = 'https://wenku.baidu.com/view/fda4f37d905f804d2b160b4e767f5acfa1c783ed.html'
wenben = []
html = getHTMLText(url)
text = fillList(wenben, html)
print(text)
for t in text:
    print(t, end="", sep="")


希望有可以帮忙解惑的大拿,不胜感激!

您好,我是有问必答小助手,您的问题已经有小伙伴帮您解答,感谢您对有问必答的支持与关注!
PS:问答VIP年卡 【限时加赠:IT技术图书免费领】,了解详情>>> https://vip.csdn.net/askvip?utm_source=1146287632

建议用下xpath试试