我的爬虫得到的是空内容

本来我是打算得到十个最火的图片 但是在加粗的地方出现了问题,得到的是空的数组(大概),我不太清楚应该怎么改,欢迎各位指正

def main():
    baseurl = "https://stock.tuchong.com/topic?topicId=50344&from=%E7%B2%BE%E9%80%89%E5%9B%BE%E9%9B%86-%E4%B8%8B%E8%BD%BD%E6%8E%92%E8%A1%8C-%E5%AD%A3%E5%BA%A6%E6%A6%9C%E5%8D%95"
    datalist = getDate(baseurl)

findImgSrc = re.compile(r'<a href="(.*?)">')
def getDate(baseurl):
    datalist = []
    html = askURL(baseurl)
   ** soup = BeautifulSoup(html,"html.parser")
    link = re.findall(findImgSrc,str(soup))[10]
    print(link)**
def askURL(baseurl):
    head = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"}
    request = urllib.request.Request(baseurl,headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html
if __name__ =="__main__":
    main()

图片是js解析后生成的,并不是被反扒了

request得到的源代码和浏览器解析的后不一样,数据在页头js变量goods中

img

主要用到image_id变量,组成成图片地址为

//cdn6-banquan.ituchong.com/weili/smh/{image_id}.webp

链接地址

https://stock.tuchong.com/image/detail?imageId={image_id}&platform=image&term=&requestId=&searchId=&page=1&entryFrom=%E4%B8%93%E9%A2%98%E5%88%97%E8%A1%A8&index=29

代码如下

img

import re
import urllib.request, urllib.error
import json
def main():
    baseurl = "https://stock.tuchong.com/topic?topicId=50344&from=%E7%B2%BE%E9%80%89%E5%9B%BE%E9%9B%86-%E4%B8%8B%E8%BD%BD%E6%8E%92%E8%A1%8C-%E5%AD%A3%E5%BA%A6%E6%A6%9C%E5%8D%95"
    datalist = getDate(baseurl)
    print(datalist)
 
reJs = re.compile(r'goods=([\s\S]+?)</script>')
def getDate(baseurl):
    datalist = []
    html = askURL(baseurl)
    jsonstr = reJs.findall(html)[0].strip().rstrip(';')
    data=json.loads(jsonstr)
    arr=[]
    for i in range(0,10):#获取前10张图
        arr.append({
            'url':'https://stock.tuchong.com/image/detail?imageId='+data[i]['image_id']+'&platform=image&term=&requestId=&searchId=&page=1&entryFrom=%E4%B8%93%E9%A2%98%E5%88%97%E8%A1%A8&index='+str(i),
            'img':'//cdn6-banquan.ituchong.com/weili/smh/'+data[i]['image_id']+'.webp'
         })
    return arr

def askURL(baseurl):
    head = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36"}
    request = urllib.request.Request(baseurl,headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(request)
        html = response.read().decode("utf-8")
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
    return html
if __name__ =="__main__":
    main()
 

img

这是这个网站有反爬,你爬取的soup长这样,并没有网址信息,

img


所以,你后面的正则就找不出来