1、爬虫返回的response内容完整,但是用etree.HTML解析后,内容就变少了,导致不能用xpath定位,是为啥?
import requests
from lxml import etree
url = "https://tieba.baidu.com/f?fr=wwwt&kw=%E4%B8%8D%E8%89%AF%E4%BA%BA"
headers = {
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
}
response = requests.get(url,headers=headers).content.decode()
print(response)
html_str = etree.HTML(response)
print(etree.tostring(html_str).decode())
# li = html_str.xpath("//ul[@id='thread_list']/li[@class='j_thread_list clearfix']")
# print(li)
他返回的网页内容中,你真正要的内容代码被注释掉了,虽然返回的是完整网页内容,但你用etree.HTML解析后,有用的内容就被清除掉了,所以用不了xpath,我也遇到了这个坑,用正则解析应该能获取到内容
参考这个答案 https://blog.csdn.net/WBerica/article/details/88745406
调用这个函数即可
create_root_node(text, base_url=None, doc_type='html')