我爬取豆瓣阅读其中一个子网页,返回的内容不太正常,跟页面源码不同,进行不了页面解析,请问这个是怎么回事呢。
hash_url = 'https://read.douban.com/category/1?sort=hot&page=1%27
resp = requests.get(hash_url, headers=headers)
print(resp.text)
打印出现
Ark.kindTree = [{"children": [{"children": [], "id": 501, "name": "\u8a00\u60c5\u5c0f\u8bf4"}, {"children": [], "id": 532, "name": "\u5973\u6027\u5c0f\u8bf4"}, {"children": [], "id": 508, "name": "\u60ac\u7591\u5c0f\u8bf4"}, {"children": [], "id": 506, "name": "\u5e7b\u60f3\u5c0f\u8bf4"}, {"children": [], "id": 505, "name": "\u79d1\u5e7b\u5c0f\u8bf4"},
等内容,请问是网站反爬策略吗
需要这个吗
import requests
import pandas as pd
headers = {
'Referer': 'https://read.douban.com/category/1?sort=hot&page=1%27',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
}
json_data = {
'sort': 'hot',
'page': 1,
'kind': 1,
'query': '\n query getFilterWorksList($works_ids: [ID!]) {\n worksList(worksIds: $works_ids) {\n \n \n title\n cover(useSmall: false)\n url\n isBundle\n coverLabel(preferVip: true)\n \n \n url\n title\n\n \n author {\n name\n url\n }\n origAuthor {\n name\n url\n }\n translator {\n name\n url\n }\n\n \n abstract\n authorHighlight\n editorHighlight\n\n \n isOrigin\n kinds {\n \n name @skip(if: true)\n shortName @include(if: true)\n id\n \n }\n ... on WorksBase @include(if: true) {\n wordCount\n wordCountUnit\n }\n ... on WorksBase @include(if: false) {\n inLibraryCount\n }\n ... on WorksBase @include(if: false) {\n \n isEssay\n \n ... on EssayWorks {\n favorCount\n }\n \n \n \n averageRating\n ratingCount\n url\n isColumn\n isFinished\n \n \n \n }\n ... on EbookWorks @include(if: false) {\n \n ... on EbookWorks {\n book {\n url\n averageRating\n ratingCount\n }\n }\n \n }\n ... on WorksBase @include(if: false) {\n isColumn\n isEssay\n onSaleTime\n ... on ColumnWorks {\n updateTime\n }\n }\n ... on WorksBase @include(if: true) {\n isColumn\n ... on ColumnWorks {\n isFinished\n }\n }\n ... on EssayWorks {\n essayActivityData {\n \n title\n uri\n tag {\n name\n color\n background\n icon2x\n icon3x\n iconSize {\n height\n }\n iconPosition {\n x y\n }\n }\n \n }\n }\n highlightTags {\n name\n }\n ... on WorksBase @include(if: false) {\n fanfiction {\n tags {\n id\n name\n url\n }\n }\n }\n \n \n ... on WorksBase {\n copyrightInfo {\n newlyAdapted\n newlyPublished\n adaptedName\n publishedName\n }\n }\n\n isInLibrary\n ... on WorksBase @include(if: false) {\n \n fixedPrice\n salesPrice\n isRebate\n \n }\n ... on EbookWorks {\n \n fixedPrice\n salesPrice\n isRebate\n \n }\n ... on WorksBase @include(if: true) {\n ... on EbookWorks {\n id\n isPurchased\n isInWishlist\n }\n }\n ... on WorksBase @include(if: false) {\n fanfiction {\n fandoms {\n title\n url\n }\n }\n }\n ... on WorksBase @include(if: false) {\n fanfiction {\n kudoCount\n }\n }\n \n id\n isOrigin\n isEssay\n }\n }\n ',
'variables': {},
}
response = requests.post('https://read.douban.com/j/kind/', headers=headers, json=json_data)
data=pd.DataFrame(response.json()['list'])
【以下回答由 GPT 生成】
根据给出的代码和结果,可以判断问题不是由网站的反爬策略引起的。因为返回的结果不是网页的源码,而是一段JavaScript代码。这表明目标页面使用了JavaScript动态加载数据,而不是在HTML源码中直接包含数据。因此,我们需要模拟JavaScript的执行,获取页面中动态加载的内容。
page_source
属性获取。下面是具体的代码示例:
from selenium import webdriver
# 创建 Chrome WebDriver 实例
driver = webdriver.Chrome('/path/to/chromedriver')
# 打开目标网页
hash_url = 'https://read.douban.com/category/1?sort=hot&page=1'
driver.get(hash_url)
# 等待页面加载完毕
# 这里可以根据页面中特定元素的出现来判断页面是否加载完成
# 这里以页面标题为例
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.XPATH, "//title")))
# 获取页面源码
page_source = driver.page_source
# 打印页面源码
print(page_source)
# 关闭 WebDriver
driver.quit()
上述代码使用了 Selenium 库来模拟浏览器的操作,等待页面加载完成后再获取页面的源码。可以通过运行上述代码来获取完整的页面源码,进而进行页面解析。
需要注意的是,使用 Selenium 进行爬取可能会增加页面加载时间和消耗系统资源。如果目标页面的动态加载过程较长或涉及大量数据,建议添加适当的等待时间或优化代码,以提高效率。
希望上述解决方案对你有帮助,如果还有其他问题,请随时提问。
【相关推荐】