python - Scrapy Cloud 跳过循环
问题描述
这个蜘蛛应该循环通过https://lihkg.com/thread/`2169007 - i*10`/page/1。但由于某种原因,它会跳过循环中的页面。
我查看了在 Scrapy Cloud 中抓取的项目,具有以下 url 的项目被抓取:
...
Item 10: https://lihkg.com/thread/2479941/page/1
Item 11: https://lihkg.com/thread/2479981/page/1
Item 12: https://lihkg.com/thread/2479971/page/1
Item 13: https://lihkg.com/thread/2479931/page/1
Item 14: https://lihkg.com/thread/2479751/page/1
Item 15: https://lihkg.com/thread/2479991/page/1
Item 16: https://lihkg.com/thread/1504771/page/1
Item 17: https://lihkg.com/thread/1184871/page/1
Item 18: https://lihkg.com/thread/1115901/page/1
Item 19: https://lihkg.com/thread/1062181/page/1
Item 20: https://lihkg.com/thread/1015801/page/1
Item 21: https://lihkg.com/thread/955001/page/1
Item 22: https://lihkg.com/thread/955011/page/1
Item 23: https://lihkg.com/thread/955021/page/1
Item 24: https://lihkg.com/thread/955041/page/1
...
大约有一百万页被跳过。
这是代码:
from lihkg.items import LihkgItem
import scrapy
import time
from scrapy_splash import SplashRequest
class LihkgSpider13(scrapy.Spider):
name = 'lihkg1-950000'
http_user = '(my splash api key here)'
allowed_domains = ['lihkg.com']
start_urls = ['https://lihkg.com/']
script1 = """
function main(splash, args)
splash.images_enabled = false
assert (splash:go(args.url))
assert (splash:wait(2))
return {
html = splash: html(),
png = splash:png(),
har = splash:har(),
}
end
"""
def parse(self, response):
for i in range(152500):
time.sleep(0)
url = "https://lihkg.com/thread/" + str(2479991 - i*10) + "/page/1"
yield SplashRequest (url=url, callback=self.parse_article, endpoint='execute',
args={
'html': 1,
'lua_source': self.script1,
'wait': 2,
})
def parse_article(self, response):
item = LihkgItem()
item['author'] = response.xpath('//*[@id="1"]/div/small/span[2]/a/text()').get()
item['time'] = response.xpath('//*[@id="1"]/div/small/span[4]/@data-tip').get()
item['texts'] = response.xpath('//*[@id="1"]/div/div[1]/div/text()').getall()
item['images'] = response.xpath('//*[@id="1"]/div/div[1]/div/a/@href').getall()
item['emoji'] = response.xpath('//*[@id="1"]/div/div[1]/div/img/@src').getall()
item['title'] = response.xpath('//*[@id="app"]/nav/div[2]/div[1]/span/text()').get()
item['likes'] = response.xpath('//*[@id="1"]/div/div[2]/div/div[1]/div/div[1]/label/text()').get()
item['dislikes'] = response.xpath('//*[@id="1"]/div/div[2]/div/div[1]/div/div[2]/label/text()').get()
item['category'] = response.xpath('//*[@id="app"]/nav/div[1]/div[2]/div/span/text()').get()
item['url'] = response.url
yield item
我在项目中启用了 Crawlera、DeltaFetch 和 DotScrapy Persistence。
解决方案
推荐阅读
- reactjs - 第三方类型弄乱了我的proptypes?
- java - 在代码中有递归但在序列化中没有递归是不是很糟糕?
- python - CondaError:在 conda-pkgs 中找不到可写的包缓存目录
- jenkins - SonarQube 与 Jenkins 错误:无法从服务器下载库
- spring - Spring Data JPA 中 findAll 的默认顺序是什么?
- c# - 带有用于搜索的文本框的 DataTemplate
- java - 在 ConcurrentHashMap 中以原子方式 searchKeys() 和 put()
- regex - Elasticsearch 中的通配符搜索,字符串字段只有一个单词
- javascript - iIam 试图在输入字段中显示地理定位 javascript 函数的结果
- angular - 限制传单地图中已定义多边形内的标记