在爬贝壳网数据时遇到几个问题:
问题1::面积、户型、朝向信息都没有包含在节点内,放一个网页源代码截图:
我试着用following-sibling去爬两个/之间的文字,但是好像不能写在循环里,一直报错“list index out of range”,如果把./div改为//div可以运行,但是这样爬到的数据不对,想知道要怎么修改
size = div.xpath('./div[@class="content__list--item--main"]/p[2]/i[1]/following-sibling::node()[position() <count(./div[@class="content__list--item--main"]/p[2]/i[1]/following-sibling::node())-count(./div[@class="content__list--item--main"]/p[2]/i[2]/following-sibling::node())]')[0]
direction = div.xpath('./div[@class="content__list--item--main"]/p[2]/i[2]/following-sibling::node()[position() <count(./div[@class="content__list--item--main"]/p[2]/i[2]/following-sibling::node())-count(./div[@class="content__list--item--main"]/p[2]/i[3]/following-sibling::node())]')[0]
pattern = div.xpath('./div[@class="content__list--item--main"]/p[2]/i[3]/following-sibling::node()[position() <count(./div[@class="content__list--item--main"]/p[2]/i[3]/following-sibling::node())-count(./div[@class="content__list--item--main"]/p[2]/span[@class="hide"]/following-sibling::node())]')[0]
问题2:楼层的数据被hide住了,一直取不出来,网页源代码截图:
我正常用xpath去取,一直取出来为空:
floor = div.xpath('./div[@class="content__list--item--main"]/p[2]/span[@class="hide"]/text()')[0]
全部源代码如下:
import requests
import csv
from lxml import etree
if __name__ == "__main__":
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'
}
url='https://sz.zu.ke.com/zufang/futianqu/pg%drt200600000001/#contentList'
fp = open('beike.csv','w',newline='',encoding='utf-8')
writer = csv.writer(fp)
writer.writerow(['名称','面积','户型','区','街道','地点','价格','朝向','楼层','维护时间','标签','链接'])
for pageNum in range(0,1):
new_url=format(url%pageNum)
page_text = requests.get(url=new_url,headers=headers).text
#解析页面
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="content__list"]/div[@class="content__list--item"]')
for div in div_list:
title = div.xpath('./div[@class="content__list--item--main"]/p[1]/a/text()')[0]
link = 'https://sz.zu.ke.com'+div.xpath('./div[@class="content__list--item--main"]/p[1]/a/@href')[0]
district_1=div.xpath('./div[@class="content__list--item--main"]/p[2]/a[1]/text()')[0]
district_2=div.xpath('./div[@class="content__list--item--main"]/p[2]/a[2]/text()')[0]
district_3=div.xpath('./div[@class="content__list--item--main"]/p[2]/a[3]/text()')[0]
size = div.xpath('./div[@class="content__list--item--main"]/p[2]/i[1]/following-sibling::node()[position() <count(./div[@class="content__list--item--main"]/p[2]/i[1]/following-sibling::node())-count(./div[@class="content__list--item--main"]/p[2]/i[2]/following-sibling::node())]')[0]
direction = div.xpath('./div[@class="content__list--item--main"]/p[2]/i[2]/following-sibling::node()[position() <count(./div[@class="content__list--item--main"]/p[2]/i[2]/following-sibling::node())-count(./div[@class="content__list--item--main"]/p[2]/i[3]/following-sibling::node())]')[0]
pattern = div.xpath('./div[@class="content__list--item--main"]/p[2]/i[3]/following-sibling::node()[position() <count(./div[@class="content__list--item--main"]/p[2]/i[3]/following-sibling::node())-count(./div[@class="content__list--item--main"]/p[2]/span[@class="hide"]/following-sibling::node())]')[0]
floor = div.xpath('./div[@class="content__list--item--main"]/p[2]/span[@class="hide"]/text()')[0]
label_is = div.xpath('./div[@class="content__list--item--main"]/p[3]')[0]
label = label_is.xpath('string(.)')
time = div.xpath('./div[@class="content__list--item--main"]/p[4]/span[@class="content__list--item--time oneline"]/text()')[0]
price = div.xpath('./div[@class="content__list--item--main"]/span[@class="content__list--item-price"]/em/text()')[0]
print(title,link,district_1,district_2,district_3,size,floor,time,label,price)
house=[title,size,district_1,district_2,district_3,price,floor,time,label,link]
writer.writerow(house)
直接在页面源代码里右键选择copy xpath
//*[@id="content"]/div[1]/div[1]/div[10]/div/p[2]/text()[3]
你会发现中括号是在里面的