利用r = requests.get(url1,headers=headers)
r.text获得网页的html,然后和把网页保存下来用记事本打开,放到word里面看,结果发现爬取出来的html只有700+的字数,而保存下来的网页用记事本打开有25000+的字数
对于有些异步加载的网页, 可以用selenium库模拟浏览器爬取数据
这段代码可参考:
# 导入工具包
import pandas as pd
import numpy as np
import time
from selenium import webdriver
driver = webdriver.Chrome()
# 爬取的网址
url = ['https://qd.xiaozhu.com/search-duanzufang-p{}-0/'.format(i) for i in range(1,14)]
lis = []
for urli in url:
driver.get(urli)
driver.implicitly_wait(10)
# 抓取信息
# 名称 #page_list > ul > li:nth-child(21) > div.result_btm_con.lodgeunitname > div.result_intro > a > span
names = driver.find_elements_by_css_selector('div.result_btm_con.lodgeunitname > div.result_intro > a > span')
# 价格 #page_list > ul > li:nth-child(1) > div.result_btm_con.lodgeunitname > div:nth-child(1) > span > i
jiages = driver.find_elements_by_css_selector('div.result_btm_con.lodgeunitname > div > span > i')
# 描述 #page_list > ul > li:nth-child(21) > div.result_btm_con.lodgeunitname > div.result_intro > em
jianjies = driver.find_elements_by_css_selector('div.result_btm_con.lodgeunitname > div.result_intro > em')
# 链接 #page_list > ul > li:nth-child(1) > a
lianjies = driver.find_elements_by_css_selector('#page_list > ul > li > a')
# 经纬度#page_list > ul > li:nth-child(1)
jwdus = driver.find_elements_by_css_selector('#page_list > ul > li')
# 汇总数据
for name,jiage,jianjie,lianjie,jwdu in zip(names,jiages,jianjies,lianjies,jwdus):
namei = name.text
jiagei = jiage.text
jianjiei = jianjie.text.strip().replace('\n','').replace(' ','')
lianjiei = lianjie.get_attribute('href')
#weizhi = get_info(lianjiei)
jwdui = jwdu.get_attribute('latlng')
lis.append([namei,jiagei,jianjiei,lianjiei,jwdui])
time.sleep(np.random.randint(5,15))
result1 = pd.DataFrame(lis)
result1.columns = ['名称','价格','描述','链接','经纬度']
人家如果是js异步加载数据,你爬取页面,爬到的也就是别人的代码,当然不一样了
selenium 可以了解下
不如直接把要爬的网页发出来看看
利用selenium 库,创建浏览器对象,模拟浏览器访问待爬网址 ,通过创建对象的get方法即可获取当前网页的完整html
谢谢各位大佬回答,我去试试
你好,请问你解决了嘛,我也出现了这个问题,不过网站并不是异步加载,是某种反爬机制