这是我第一次提问~~我正在制作一个网络爬虫,我想用它来爬取invia.cz上所有的酒店链接和名称。
import scrapy
y=0
class invia(scrapy.Spider):
name = 'Kreta'
start_urls = ['https://dovolena.invia.cz/?d_start_from=13.01.2017&sort=nl_sell&page=1']
def parse(self, response):
for x in range (1, 9):
yield {
'titles':response.css("#main > div > div > div > div.col.col-content > div.product-list > div > ul > li:nth-child(%d)>div.head>h2>a>span.name::text"%(x)).extract() ,
}
if (response.css('#main > div > div > div > div.col.col-content >
div.product-list > div > p >
a.next').extract_first()):
y=y+1
go = ["https://dovolena.invia.cz/d_start_from=13.01.2017&sort=nl_sell&page=%d" % y]
print go
yield scrapy.Request(
response.urljoin(go),
callback=self.parse
)
这个网站页面是用Ajax加载的,我手动更改了URL的值,只有当Next按钮出现在页面中时,才会增加一个URL值。当我测试按钮是否出现时,所有条件都运行得很好,但是当我启动爬虫时,它只爬取第一页。这是我第一个爬虫项目,可能还做的不是很成熟,总之先谢谢你的解答!
错误日志在这:Error Log1 Error Log
Your usage of "global" y
variable is not only peculiar but won't work either
You're using y
to calculate how many times parse was called. Ideally you don't want to access anything outside of the functions scope, so you can achieve the same thing with using request.meta
attribute:
def parse(self, response):
y = response.meta.get('index', 1) # default is page 1
y += 1
# ...
#next page
url = 'http://example.com/?p={}'.format(y)
yield Request(url, self.parse, meta={'index':y})
Regarding your pagination issue, your next page url css selector is incorrect since the <a>
node you're selecting doesn't have a absolute href attached to it, also this issue makes your y
issue obsolete. To solve this try:
def parse(self, response):
next_page = response.css("a.next::attr(data-page)").extract_first()
# replace "page=1" part of the url with next number
url = re.sub('page=\d+', 'page=' + next_page, response.url)
yield Request(url, self.parse, meta={'index':y})
EDIT: Here's the whole working spider:
import scrapy
import re
class InviaSpider(scrapy.Spider):
name = 'invia'
start_urls = ['https://dovolena.invia.cz/?d_start_from=13.01.2017&sort=nl_sell&page=1']
def parse(self, response):
names = response.css('span.name::text').extract()
for name in names:
yield {'name': name}
# next page
next_page = response.css("a.next::attr(data-page)").extract_first()
url = re.sub('page=\d+', 'page=' + next_page, response.url)
yield scrapy.Request(url, self.parse)