python - 如何使用scrapy-splash进行分页
问题描述
目的
我想使用 scrapy+splash 抓取https://www.livecoinwatch.com(我不想使用 selenium)。但我不知道如何制作分页。我只能爬第一页。
- 我想知道如何在 splash(lua) 中进行分页
- 是否可以?
- 单击下一页按钮时,网址不会更改。
这是我的蜘蛛代码:
import scrapy
from scrapy_splash import SplashRequest
from coins.items import CoinsItem
class CoinsSpiderSpider(scrapy.Spider):
name = 'coins_spider'
allowed_domains = ['livecoinwatch.com']
start_urls = ['https://www.livecoinwatch.com']
Pages = 3
lua_script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
headers = {
['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36 Edg/94.0.992.38'
}
splash:set_custom_headers(headers)
assert(splash:go(url))
assert(splash:wait(1))
assert(splash:wait(5))
splash:set_viewport_full()
return splash:html()
end
'''
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url = url,callback= self.parse,endpoint='execute',args={
'lua_source':self.lua_script
})
def parse(self, response):
# 50 results in first page
rows = response.xpath('//tr[@class="table-row filter-row"]')
for row in rows:
item = CoinsItem()
item['coin'] = row.xpath('./td[2]//div[@class="item-name ml10"]/div/text()').extract_first()
item['price'] = row.xpath('./td[3]').extract_first()
item['marketCap'] = row.xpath('./td[4]/text()').extract_first()
item['volumn24h'] = row.xpath('./td[5]/text()').extract_first()
item['Liquidity'] = row.xpath('./td[6]/text()').extract_first()
item['allTimeHigh'] = row.xpath('./td[7]/text()').extract_first()
item['hour1_value'] = row.xpath('./td[8]/span/text()').extract_first()
item['hour1_class'] = row.xpath('./td[8]/@class').extract_first()
item['hour24_value'] = row.xpath('./td[9]/span/text()').extract_first()
item['hour24_class'] = row.xpath('./td[9]/@class').extract_first()
yield item
# next page
# do not know how to code!!!
解决方案
使用对 api 的直接请求比使用scrapy_splash
. 当您单击底部的页面导航时检查 XHR 请求,您会注意到该请求https://http-api.livecoinwatch.com/coins?offset=50&limit=50&sort=rank&order=ascending¤cy=USD
返回一个 json 响应。调整offset
和limit
参数以限制您返回的数据量。
请参阅下面的示例实现
import scrapy
from coins.items import CoinsItem
class CoinsSpiderSpider(scrapy.Spider):
name = 'coins_spider'
allowed_domains = ['livecoinwatch.com']
# start from first item and fetch 500 items in each request. Modify as suits you
offset = 0
limit = 500
start_urls = [f'https://http-api.livecoinwatch.com/coins?offset={offset}&limit={limit}&sort=rank&order=ascending¤cy=USD']
def parse(self, response):
data = response.json()
for coin in data['data']:
item = CoinsItem()
item['coin'] = coin.get('code')
item['price'] = coin.get('price')
item['marketCap'] = coin.get('cap')
item['volum24h'] = coin.get('volume')
#... check the json response and add the other fields you need
yield item
# yield next request
self.offset += self.limit
next_url = f'https://http-api.livecoinwatch.com/coins?offset={self.offset}&limit={self.limit}&sort=rank&order=ascending¤cy=USD'
yield scrapy.Request(next_url)
推荐阅读
- css - 在页脚中将图标靠得更近出了问题?
- python - cv2.warpPerspective 和 cv2.PerspectiveTransform 有什么区别?
- mysql - 模型未关联到模型 Sequelize nodejs belongsToMany 与迁移关联
- c++ - c++中的高精度计时器
- javascript - 带产量的生成器函数
- excel - 如何将excel中的1列拆分为2列
- angular - (点击)中奇怪的角度行为 - 需要添加'true;' 如果设置了 false 值,则更新 UI
- javascript - 谷歌图表时间线 - 只渲染几个月
- vue.js - 阻止源为“http://localhost:3000”的框架访问跨域框架 VUE IFRAME
- api - Google Ads Api,如何获取 CrmBasedUserList