在爬图片网站的图片,要把"src="后面的链接取出来,但是一直取出来是空的,不知道正则表达式哪里有错!
要爬取的链接:
想要得到的结果:
//img.ivsky.com/img/tupian/li/202107/15/xingkong-005.jpg
目前的正则写法:
ex = '
完整的代码:
import requests
import re
import os
if __name__ == "__main__":
if not os.path.exists('./otherLibs'):
os.mkdir('./otherLibs')
url = 'https://www.ivsky.com/tupian/ziranfengguang/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0'
}
page_text = requests.get(url=url,headers=headers).text
ex = '<div class="il_img">.*?<img src="(.*?)" alt.*?</div>'
img_src_list = re.findall(ex,page_text,re.S)
for src in img_src_list:
src = 'https:'+src
iamge_data = requests.get(url=url,headers=headers).content
img_name = src.split('/')[-1]
img_Path= './otherLibs'+img_name
with open(img_Path,'wb') as fp:
fp.write(iamge_data)
print(img_name,'下载成功')
望被采纳
import os
import re
import requests
if __name__ == "__main__":
if not os.path.exists('./otherLibs'):
os.mkdir('./otherLibs')
url = 'https://www.ivsky.com/tupian/ziranfengguang/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36',
'cookie': 't = 15dc3f49950b911ad395e4d2b5f21bd0;r = 9777;Hm_lvt_a0c29956538209f8d51a5eede55c74f9 = 1642561701;'
'Hm_lpvt_c13cf8e9faf62071ac13fd4eafaf1acf = 1642561835;Hm_lpvt_a0c29956538209f8d51a5eede55c74f9 = 1642561836'
}
# 必须要加cookie ,可以在浏览器登录后获取,或这接口请求再提取cookie
page_text = requests.get(url=url, headers=headers).text
ex = '<img src="(.*?)" alt="(.*?)">'
img_src_list = re.findall(ex,page_text) #找出src 和alt 返回元组
for src in img_src_list:
url = 'https:' + src[0] #拼接图片地址
alt=src[1] #获取alt 标题
if not os.path.exists('./otherLibs/{}'.format(alt)):
os.mkdir('./otherLibs/{}'.format(alt)) #创建以标题为名的文件夹
with open("./otherLibs/{}/{}.txt".format(alt,alt), 'w+') as fp: #打开或创建以标题为名的 文件
fp.write(alt+"-->"+url) #向文件中写入src
print(alt, '下载成功')
生成的文件和下载地址
望被采纳
你可以复制一段网页代码出来,测试下你的正则能否匹配,如果能,也许你获得的pag_text有问题
我用js是这么写的
ex = /<img.*src="(.*?)"/g
str.match(ex)