如题,我希望获得完整的request返回并爬取某div标签内内容,然而我通过beautifulsoup发现该部分的div内容被省略掉了
import csv
import time
import requests
import urllib.parse
from lxml import etree
import useragent
from bs4 import BeautifulSoup
page_num = 1
for i in range(1,52):
url = "https://www.xuetangx.com/search?query=&org=&classify=1&type=&status=&page={}".format(page_num)
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.34',
'method' : 'POST',
}
cookieStr = '_ga=GA1.2.1190422675.1612939709; sensorsdata2015jssdkcross={"distinct_id":"17d3d83d4df9bc-0950b2ac5a2dee8-561a1154-1327104-17d3d83d4e0e26","first_id":"","props":{"$latest_traffic_source_type":"直接流量","$latest_search_keyword":"未取到值_直接打开","$latest_referrer":""},"$device_id":"17d3d83d4df9bc-0950b2ac5a2dee8-561a1154-1327104-17d3d83d4e0e26"}; provider=xuetang; _gid=GA1.2.1824666357.1638002159; django_language=zh; JG_016f5b1907c3bc045f8f48de1_PV=1638008153767|1638009299317'
cookieStr = cookieStr.encode("utf-8").decode("latin-1")
cookies = {
'Cookie' : cookieStr
}
res = requests.get(url,headers = headers,cookies = cookies).text
dom = etree.HTML(res)
for list_num in range(2, 10):
result = []
# result.append(dom.xpath('//*[@id="app"]/div/div[2]/div[1]/div[1]/div[2]/div[1]/div[2]/div[2]/p[1]/span[1]/text()'.format(list_num)))
# result.append(dom.xpath('/html/body/div[1]/div/div[2]/div[1]/div[1]/div[2]/div[1]/div[{}]/div[2]/p[2]/span[1]'.format(list_num)))
# result.append(dom.xpath('/html/body/div[1]/div/div[2]/div[1]/div[1]/div[2]/div[1]/div[{}]/div[2]/p[2]/span[2]/span'.format(list_num)))
# result.append(dom.xpath('/html/body/div[1]/div/div[2]/div[1]/div[1]/div[2]/div[1]/div[{}]/div[2]/p[2]/span[3]/text()'.format(list_num)))
soup = BeautifulSoup(res)
print(soup.prettify())
print(soup.select('div'))
print(result)
尝试过在get方法中添加header和cookie,无果,问题并非在于无法返回而在于返回结果不全。希望这个问题能够被解决。
该页面数据是动态加载的,需要用此链接用post请求去获取
https://www.xuetangx.com/api/v1/lms/get_product_list/?page=1
不要查看元素,用查看源文件的方式,查看你想要的数据是否存在于html文件中
一般来说,很多页面的数据都是通过xhr进行获取的,仅仅只进行页面的 requests.get 是无法获取xhr的