我想把indeed上面员工关于对公司的评价爬取下来,整体的评价(reviews)可以爬取,但是具体的优点和缺点(pros and cons)只返回了NA。
下面是关于我想爬取Indeed上面airbnb的评价的代码。
from bs4 import BeautifulSoup
import pandas as pd
import requests
import numpy as np
lst=[]
for i in range(0, 240, 20):
print(i)
url = (f'https://www.indeed.com/cmp/Airbnb/reviews?start={i}')
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"}
page = requests.get(url, headers = header)
soup = BeautifulSoup(page.content, 'lxml')
main_data = soup.find_all("div",attrs={"data-tn-section":"reviews"})
for data in main_data:
try:
title=data.find("h2").get_text(strip=True)
except AttributeError:
title=np.nan
try:
location=data.find("span",attrs={"itemprop":"author"}).get_text(strip=True).split("-")[1]
except AttributeError:
location=np.nan
try:
status=data.find("span",attrs={"itemprop":"author"}).get_text(strip=True).split("-")[0]
except AttributeError:
status=np.nan
try:
review=data.find("span",attrs={"itemprop":"reviewBody"}).get_text(strip=True)
except AttributeError:
review=np.nan
try:
pros=data.find('div',class_='cmp-review-pro-text')
except:
pros=np.nan
try:
cons=data.find('div',class_='cmp-review-con-text')
except:
cons=np.nan
try:
rating=data.find("div",attrs={"itemprop":"reviewRating"}).find("button")['aria-label'].split(" ")[0]
except AttributeError:
rating=np.nan
lst.append([title,location,status,pros,cons,review,rating])
import pandas as pd
df_airbnb=pd.DataFrame(data=lst,columns=['title','location','status','pros','cons','review','rating'])
df_airbnb
试了很久,不管怎么修改代码,pros和cons返回的始终是NA,麻烦大家看一下,谢谢啦
排查下获取到的页面,有没有这两个。
我是JAVA的,我用了下JAVA的Xpath,是可以取到的.你检查看下你取pros 和cons这块.应该是没写对