我用的xpath 和 requests模块爬取的
就是网站页面一更新新的数据,我的代码就会停止爬取 报错
没有更新的时候,是可以怎么一直爬下去的。
是因为动态网站和静态网站的原因吗?
怎么解决这个问题。
【相关推荐】
在爬取时,发现当返回超过一定数据的时候,会返回空数据,猜测可能是网站通过ip或者请求头判断出了此请求可能为爬虫请求,因此设置了代理和随机请求头。
设置代理时,我将可用的代理保存到一个名为 “https_ips_pool.csv” 文件中了。获得代理的代码如下:
def get_proxies(ip_pool_name='https_ips_pool.csv'):
with open(ip_pool_name, 'r') as f:
datas = f.readlines()
ran_num = random.choice(datas)
ip = ran_num.strip().split(',')
proxies = {ip[0]: ip[1] + ':' + ip[2]}
return proxies
设置随机请求头时,我对UA进行了设置,将相应的UA保存到 “user_agent.txt” 文件中,获取请求头的代码如下:
def get_headers():
file = open('user_agent.txt', 'r')
user_agent_list = file.readlines()
user_agent = str(choice(user_agent_list)).replace('\n', '')
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36" if len(user_agent) < 20 else user_agent
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Cookie": "__uuid=1576743949178.08; need_bind_tel=false; new_user=false; c_flag=a99aaaa31f739e3b04d3fa768574cabd; gr_user_id=bdb451db-1bc4-4899-a3b3-410b19c06161; bad1b2d9162fab1f80dde1897f7a2972_gr_last_sent_cs1=3872aec89444b8931a667e00ad0d9493; grwng_uid=6c0a08dc-d2e4-407b-b227-89fe9281943e; fe_work_exp_add=true; gr_session_id_bad1b2d9162fab1f80dde1897f7a2972=39311f19-2c25-419e-9a69-64940ae15c78; gr_cs1_39311f19-2c25-419e-9a69-64940ae15c78=UniqueKey%3A3872aec89444b8931a667e00ad0d9493; AGL_USER_ID=15f1f78f-e535-4ccc-8da0-ec0728eb9fb7; abtest=0; __s_bid=5ec9f0f87b044308fb05861763266522a1d4; access_system=C; _fecdn_=1; bad1b2d9162fab1f80dde1897f7a2972_gr_session_id=b500cb67-657d-4f10-9211-7e69d7e319c4; user_roles=0; user_photo=5e7c0e2937483e328d66574804u.jpg; user_name=%E5%B8%B8%E4%BF%8A%E6%9D%B0; fe_se=-1587405434042; Hm_lvt_a2647413544f5a04f00da7eee0d5e200=1586223102,1586753746,1587356899,1587405434; __tlog=1587405434243.54%7C00000000%7CR000000075%7C00000000%7C00000000; UniqueKey=3872aec89444b8931a667e00ad0d9493; lt_auth=7bsJaSdWzg%2Bv4iTRiTBf7fpI3Yr5VmTL%2FX0Mh0gJh4W6W%2FWw4PzqRQiDrbIPxAMhwUxzf8ULNLj5Men%2FznJL7UYQwGmulICyv%2F2k03sEUeVhIsW2vezHg%2FXSQp4ilEAC8nJbpEIL%2BQ%3D%3D; bad1b2d9162fab1f80dde1897f7a2972_gr_last_sent_sid_with_cs1=b500cb67-657d-4f10-9211-7e69d7e319c4; imClientId=deef7ae9f2746887611c3686cabc4d86; imId=deef7ae9f2746887f5aceb762480da5b; imClientId_0=deef7ae9f2746887611c3686cabc4d86; imId_0=deef7ae9f2746887f5aceb762480da5b; bad1b2d9162fab1f80dde1897f7a2972_gr_session_id_b500cb67-657d-4f10-9211-7e69d7e319c4=true; JSESSIONID=8801E5B4E379482E21B82766ACE7C16F; __uv_seq=38; __session_seq=22; bad1b2d9162fab1f80dde1897f7a2972_gr_cs1=3872aec89444b8931a667e00ad0d9493; Hm_lpvt_a2647413544f5a04f00da7eee0d5e200=1587411721; fe_im_socketSequence_0=11_11_11",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-User": "?1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": user_agent
}
return headers