爬取豆瓣阅读页面代码返回出现问题

我爬取豆瓣阅读其中一个子网页,返回的内容不太正常,跟页面源码不同,进行不了页面解析,请问这个是怎么回事呢。
hash_url = 'https://read.douban.com/category/1?sort=hot&page=1%27
resp = requests.get(hash_url, headers=headers)
print(resp.text)
打印出现
Ark.kindTree = [{"children": [{"children": [], "id": 501, "name": "\u8a00\u60c5\u5c0f\u8bf4"}, {"children": [], "id": 532, "name": "\u5973\u6027\u5c0f\u8bf4"}, {"children": [], "id": 508, "name": "\u60ac\u7591\u5c0f\u8bf4"}, {"children": [], "id": 506, "name": "\u5e7b\u60f3\u5c0f\u8bf4"}, {"children": [], "id": 505, "name": "\u79d1\u5e7b\u5c0f\u8bf4"},
等内容,请问是网站反爬策略吗

img

需要这个吗


import requests
import pandas as pd
headers = {

    'Referer': 'https://read.douban.com/category/1?sort=hot&page=1%27',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
}

json_data = {
    'sort': 'hot',
    'page': 1,
    'kind': 1,
    'query': '\n    query getFilterWorksList($works_ids: [ID!]) {\n      worksList(worksIds: $works_ids) {\n        \n    \n    title\n    cover(useSmall: false)\n    url\n    isBundle\n    coverLabel(preferVip: true)\n  \n    \n  url\n  title\n\n    \n  author {\n    name\n    url\n  }\n  origAuthor {\n    name\n    url\n  }\n  translator {\n    name\n    url\n  }\n\n    \n  abstract\n  authorHighlight\n  editorHighlight\n\n    \n    isOrigin\n    kinds {\n      \n    name @skip(if: true)\n    shortName @include(if: true)\n    id\n  \n    }\n    ... on WorksBase @include(if: true) {\n      wordCount\n      wordCountUnit\n    }\n    ... on WorksBase @include(if: false) {\n      inLibraryCount\n    }\n    ... on WorksBase @include(if: false) {\n      \n    isEssay\n    \n    ... on EssayWorks {\n      favorCount\n    }\n  \n    \n    \n    averageRating\n    ratingCount\n    url\n    isColumn\n    isFinished\n  \n  \n  \n    }\n    ... on EbookWorks @include(if: false) {\n      \n    ... on EbookWorks {\n      book {\n        url\n        averageRating\n        ratingCount\n      }\n    }\n  \n    }\n    ... on WorksBase @include(if: false) {\n      isColumn\n      isEssay\n      onSaleTime\n      ... on ColumnWorks {\n        updateTime\n      }\n    }\n    ... on WorksBase @include(if: true) {\n      isColumn\n      ... on ColumnWorks {\n        isFinished\n      }\n    }\n    ... on EssayWorks {\n      essayActivityData {\n        \n    title\n    uri\n    tag {\n      name\n      color\n      background\n      icon2x\n      icon3x\n      iconSize {\n        height\n      }\n      iconPosition {\n        x y\n      }\n    }\n  \n      }\n    }\n    highlightTags {\n      name\n    }\n    ... on WorksBase @include(if: false) {\n      fanfiction {\n        tags {\n          id\n          name\n          url\n        }\n      }\n    }\n  \n    \n  ... on WorksBase {\n    copyrightInfo {\n      newlyAdapted\n      newlyPublished\n      adaptedName\n      publishedName\n    }\n  }\n\n    isInLibrary\n    ... on WorksBase @include(if: false) {\n      \n    fixedPrice\n    salesPrice\n    isRebate\n  \n    }\n    ... on EbookWorks {\n      \n    fixedPrice\n    salesPrice\n    isRebate\n  \n    }\n    ... on WorksBase @include(if: true) {\n      ... on EbookWorks {\n        id\n        isPurchased\n        isInWishlist\n      }\n    }\n    ... on WorksBase @include(if: false) {\n      fanfiction {\n        fandoms {\n          title\n          url\n        }\n      }\n    }\n    ... on WorksBase @include(if: false) {\n      fanfiction {\n        kudoCount\n      }\n    }\n  \n        id\n        isOrigin\n        isEssay\n      }\n    }\n  ',
    'variables': {},
}

response = requests.post('https://read.douban.com/j/kind/', headers=headers, json=json_data)
data=pd.DataFrame(response.json()['list'])

img

【以下回答由 GPT 生成】

问题分析

根据给出的代码和结果,可以判断问题不是由网站的反爬策略引起的。因为返回的结果不是网页的源码,而是一段JavaScript代码。这表明目标页面使用了JavaScript动态加载数据,而不是在HTML源码中直接包含数据。因此,我们需要模拟JavaScript的执行,获取页面中动态加载的内容。

解决方案

  1. 使用 Selenium 库来模拟浏览器的操作,实现动态加载页面内容的获取。
  2. 首先,确保已经安装了 Selenium 库和对应的浏览器驱动(如 ChromeDriver)。
  3. 创建一个 Selenium WebDriver 实例,打开豆瓣阅读的目标网页。
  4. 等待页面完全加载,可以通过等待特定元素的出现(如页面某个元素的文本内容)来判断页面是否加载完毕。
  5. 获取页面的源码,可以通过 WebDriver 的 page_source 属性获取。
  6. 关闭 WebDriver,释放资源。

下面是具体的代码示例:

from selenium import webdriver

# 创建 Chrome WebDriver 实例
driver = webdriver.Chrome('/path/to/chromedriver')

# 打开目标网页
hash_url = 'https://read.douban.com/category/1?sort=hot&page=1'
driver.get(hash_url)

# 等待页面加载完毕
# 这里可以根据页面中特定元素的出现来判断页面是否加载完成
# 这里以页面标题为例
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.XPATH, "//title")))

# 获取页面源码
page_source = driver.page_source

# 打印页面源码
print(page_source)

# 关闭 WebDriver
driver.quit()

上述代码使用了 Selenium 库来模拟浏览器的操作,等待页面加载完成后再获取页面的源码。可以通过运行上述代码来获取完整的页面源码,进而进行页面解析。

需要注意的是,使用 Selenium 进行爬取可能会增加页面加载时间和消耗系统资源。如果目标页面的动态加载过程较长或涉及大量数据,建议添加适当的等待时间或优化代码,以提高效率。

希望上述解决方案对你有帮助,如果还有其他问题,请随时提问。



【相关推荐】



如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^