首页 > 解决方案 > 使用 Kimurai gem 抓取网页

问题描述

我正在使用Kimurai Ruby gem进行一些网络抓取。我有这个很好用的脚本:

require 'kimurai'

class SimpleSpider < Kimurai::Base
  @name = "simple_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://apply.workable.com/taxjar/"]

  def parse(response, url:, data: {})
    # Update response to current response after interaction with a browser
    count = 0
    # browser.click_button "Show more"
    doc = browser.current_response
    returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')
    returned_jobs.css('li').each do |char_element|
        # puts char_element
        title = char_element.css('a')[0]['aria-label']
        link = "https://apply.workable.com" + char_element.css('a')[0]['href']

        #click on job link and get description
        browser.visit(link)
        job_page = browser.current_response
        description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text

        puts '*******'
        puts title
      puts link
        puts description
        puts count += 1
    end
    puts "There are #{count} jobs total"
  end
end

SimpleSpider.crawl!

但是,在这种情况下,我希望这一切都返回一个对象数组……或作业。我想在 parse 方法中创建一个作业数组,并jobs << [title, link, description, company]returned_jobs循环中执行类似的操作,并在我调用时返回SimpleSpider.crawl!它,但这不起作用。

任何帮助表示赞赏。

标签: rubyweb-scrapingkimurai

解决方案


您可以像这样稍微修改您的代码:

class SimpleSpider < Kimurai::Base
  @name = "simple_spider"
  @engine = :selenium_chrome
  @start_urls = ["https://apply.workable.com/taxjar/"]

  def parse(response, url:, data: {})
    # Update response to current response after interaction with a browser
    count = 0
    # browser.click_button "Show more"
    doc = browser.current_response
    returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')

    jobs = []
    returned_jobs.css('li').each do |char_element|
        # puts char_element
        title = char_element.css('a')[0]['aria-label']
        link = "https://apply.workable.com" + char_element.css('a')[0]['href']

        #click on job link and get description
        browser.visit(link)
        job_page = browser.current_response
        description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text

        jobs << [title, link, description]
    end

    puts "There are #{jobs.count} jobs total"
    puts jobs
  end
end

我不确定该公司,因为我在您的代码中看不到该变量。但是,您可以看到在上面调用数组并进行处理的想法。

这是在终端中运行的输出的一部分:

屏幕

我还有一篇关于如何在 Ruby on Rails 应用程序中使用 Kimurai 框架的博客文章。


推荐阅读