刚学python爬虫,爬百度文库只能爬取第一页,第二页以后就爬不出来了。
第一页的p标签如图
第二页及以后的p标签不一样,如图
代码如下:
import requests
from bs4 import BeautifulSoup
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, compress',
'Accept-Language': 'en-us;q=0.5,en;q=0.3',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'
}
def getHTMLText(url):
try:
r=requests.get(url, headers=headers)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
return ''
def fillList(ulist,html):
soup = BeautifulSoup(html, 'lxml')
# print(soup)
ads = soup.find_all('p')
# print(ads)
for ad in ads:
abstract = ad.get_text()
ulist.append(abstract)
return ulist
url = 'https://wenku.baidu.com/view/fda4f37d905f804d2b160b4e767f5acfa1c783ed.html'
wenben = []
html = getHTMLText(url)
text = fillList(wenben, html)
print(text)
for t in text:
print(t, end="", sep="")
希望有可以帮忙解惑的大拿,不胜感激!
您好,我是有问必答小助手,您的问题已经有小伙伴帮您解答,感谢您对有问必答的支持与关注!建议用下xpath试试