首页 > 解决方案 > Scrapy抓取页面的url,从每个新页面中提取值

问题描述

第一次在这里发帖,我对scrapy很陌生。我正在尝试抓取 .htm 网址页面: https ://www.bls.gov/bls/news-release/empsit.htm#2008

并从每个爬取的页面中提取两个数据点,使用 Scrapy 将每次迭代保存到 csv 文件。

目前,我有:

import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

class EmpsitSpider(scrapy.Spider):
name = "EmpsitSpider"
allowed_domains = ["bls.gov"]
start_urls = [
'https://www.bls.gov/bls/news-release/empsit.htm#2008'
]

rules = [
    Rule(
        LinkExtractor(allow_domains=("bls.gov"), restrict_xpaths=('//div[@id="bodytext"]/a[following-sibling::text()[contains(., ".htm")]]')), 
        follow= True, callback= "parse_items"),
        ]

def parse_items(self, response):
    self.logger.info("bls item page %s", response.url)
    item = scrapy.Item()
    item["SA"] = response.xpath(('//*[@id="ces_table1"]/tbody/tr[138]/td[8]/span')[0].text).extract()
    item["NSA"] = response.xpath(tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[4]/span')[0].text).extract()
    return item

然后运行

    scrapy crawl EmpsitSpider -o data.csv

我遇到了scrapy的瓶颈。我无法循环浏览 html 页面。我可以使用它们的 xpaths 和 lxml 从每个页面中提取两个数据点:

from lxml import html
import requests

page = 
requests.get('https://www.bls.gov/news.release/archives/empsit_01042019.htm')
tree = html.fromstring(page.content)

#part time for economic reasons, seasonally adjusted
SA = tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[8]/span')[0].text
print(SA)

#part time for economic reasons, not seasonally adjusted
NSA = tree.xpath('//*[@id="ces_table1"]/tbody/tr[138]/td[4]/span')[0].text
print(NSA)

我无法通过 url 循环访问。关于如何进行的任何想法?谢谢您的帮助。

标签: pythonscrapy

解决方案


推荐阅读