利用多线程爬虫技术,完成网站信息的获取,爬取网站内容自行选择,但必须确保每条记录包括5条以上信息,并且爬取页码数可控(第几页到第几页)。
网上这种不要太多
import requests
from bs4 import BeautifulSoup
import threading
class MovieSpider:
def __init__(self, start_page, end_page):
self.start_page = start_page
self.end_page = end_page
def run(self):
threads = []
for page in range(self.start_page, self.end_page + 1):
url = f'<https://movie.douban.com/top250?start={(page-1)*25}>'
t = threading.Thread(target=self.parse_page, args=(url,))
threads.append(t)
t.start()
for t in threads:
t.join()
def parse_page(self, url):
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
items = soup.select('.grid_view .item')
for item in items:
title = item.select_one('.title').text.strip()
info = item.select_one('.bd p').text.strip()
rating = item.select_one('.rating_num').text.strip()
num_ratings = item.select_one('.star span:last-child').text.strip()
quote = item.select_one('.quote .inq').text.strip()
if len(info.split('/')) >= 5:
print(title, info, rating, num_ratings, quote)
if __name__ == '__main__':
spider = MovieSpider(1, 2)
spider.run()
1、选择一个适合你的网页爬取工具。这里我们以Python的Request和BeautifulSoup库为例。安装这两个库:
pip install requests
pip install beautifulsoup4
2、导入所需库:
import requests
from bs4 import BeautifulSoup
import threading
3、定义一个函数来解析页面并提取所需信息。例如,我们假设要爬取一个包含电影信息的网站,可以定义一个提取电影名称、导演、主演、上映日期和评分的函数:
def extract_info(url):
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# 获取页面内容
title = soup.find('h1', {'class': 'title'})
director = soup.find('span', {'class': '导演'})
actor = soup.find('span', {'class': '主演'})
release_date = soup.find('span', {'class': '上映日期'})
rating = soup.find('span', {'class': '评分'})
# 构建结果字典
info = {'title': title.text, 'director': director.text, 'actor': actor.text,
'release_date': release_date.text, 'rating': rating.text}
return info
except:
return None
4、创建一个存储URL列表的队列。在此示例中,我们假设要爬取首页之后的第一页。将这两页的URL添加到队列中:
url_queue = []
url_queue.append('https://www.example.com/') # 首页URL
url_queue.append('https://www.example.com/page/2/') # 下一页URL
5、创建一个包含下载器线程数的线程池:
pool = threading.Pool(5) # 下载器线程数为5
6、使用线程池运行提取信息的函数:
results = []
for url in url_queue:
result = pool.apply_async(extract_info, (url,))
results.append(result)
pool.close()
pool.join()
7、收集和打印结果:
for result in results:
info = result.get()
if info is not None:
print(info)
这样,就可以使用多线程爬虫技术从指定页码范围爬取网站信息了。