python - Scrapy spider 无法正确迭代并存在 If 语句问题
问题描述
我正在尝试使用 Scrapy 从表中抓取申请人数据。我有两个问题:
1)我想要每行每个申请人的CSV:
'username': ['clickclack123'],'lsat':['170'],'gpa':['3.57']...
我的代码当前在一行中提取所有申请人数据,忽略空值,并针对页面上的申请人数重复提取(100 行相同,每行包含页面上的所有数据):
'username': ['clickclack123','UM2014','3litersaday'...
2) 该表包含一类元素(“能指”),表明申请人的特征。我想包含一个 If 语句来检查能指并将每个特征保存为 True 如果适用。我在 lawschool.py(下)中包含了一个带有这种逻辑的 If 语句,但它不允许我的蜘蛛运行。
我的想法和尝试:
- 对于问题 #1,我看到过类似问题的帖子,但这些解决方案在这种情况下不起作用,因为我的数据包含我不想忽略的空值。
- 我相信我的 For 循环存在问题,因为它没有正确迭代每个申请人,但我无法修复它。它目前将页面上的所有数据提取到我的 CSV 的一行中,但会针对页面上的申请人数重复提取(100 行相同的行,其中每行包含页面上的所有数据)。如果我将 extract() 更改为 extract_first(),蜘蛛将只提取第一个申请人的数据(100 行相同的行,每行包含来自第一个申请人的数据)。
- 对于问题 #2,我不确定为什么我的代码无法使用此 If 语句运行,我不得不将其注释掉以解决问题 #1。
法学院.py
import scrapy
from ..items import ApplicantItem
class LawschoolSpider(scrapy.Spider):
name = "lawschool"
start_urls = [
'http://nyu.lawschoolnumbers.com/applicants',
'http://columbia.lawschoolnumbers.com/applicants'
]
def parse(self, response):
items = []
for applicant in response.xpath("//tr[@class='row']"):
signifier = response.xpath("//span[@class='signifier']/text()").extract()
if signifier == 'W':
withdrawn = True
elif signifier == 'A':
accepted == True
elif signifier == 'U':
minority == True
elif signifier == 'N':
non_traditional == True
elif signifier == 'I':
international = True
else:
return False
school = response.xpath("//h1/text()").extract()
school = [i.replace(' Applicants','') for i in school]
item = ApplicantItem(
school = school,
username = response.xpath("//td/a/text()").extract(),
lsat = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[1]/text()").extract(),
gpa = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[2]/text()").extract(),
scholarship = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[4]/text()").extract(),
status = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[5]/text()").extract(),
sent = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[6]/text()").extract(),
complete = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[7]/text()").extract(),
decision = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[8]/text()").extract(),
last_updated = response.xpath("//td[contains(@style, 'font-weight:bold')]/following-sibling::td[9]/text()").extract()
withdrawn_application = withdrawn,
accepted_offer = accepted,
minority = minority,
non_traditional = non_traditional,
international = international
)
yield item
for a in response.xpath("//*[@id='applicants_list']/div/a[9]"):
yield response.follow(a, callback=self.parse)
项目.py
from scrapy import Item, Field
class ApplicantItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
school = Field()
username = Field()
lsat = Field()
gpa = Field()
scholarship = Field()
status = Field()
sent = Field()
complete = Field()
decision = Field()
last_updated = Field()
withdrawn_application = Field()
accepted_offer = Field()
minority = Field()
non_traditional = Field()
international = Field()
管道.py
from scrapy import signals
from scrapy.exporters import CsvItemExporter
from .items import ApplicantItem
class LSNPipeline(object):
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
item_names = ['applicant']
self.files = self.files = {n: open('%s.csv' % n, 'w+b') for n in item_names}
self.exporters = {n: CsvItemExporter(f) for n, f in self.files.items()}
for exporter in self.exporters.values():
exporter.start_exporting()
def spider_closed(self, spider):
for exporter in self.exporters.values():
exporter.finish_exporting()
for file in self.files.values():
file.close()
def process_item(self, item, spider):
if isinstance(item, ApplicantItem):
self.exporters['applicant'].export_item(item)
return item
解决方案
您需要相对XPath 表达式:
username = applicant.xpath(".//td/a/text()").extract(),
lsat = applicant.xpath(".//td[2]/text()").extract(),
gpa = applicant.xpath(".//td[3]/text()").extract(),
...
推荐阅读
- java - build.gradle 错误。抱歉不能把错误放在这里看正文
- tensorflow - tensorflow 从集线器 keras 层嵌入中获取元数据
- android - Android BiometricAuthenticator.AuthenticationResult mUserId 和多个用户
- list - Ansible - 根据另一个列表中的值从一个列表中获取一个项目
- swift - 如何让 uitextfield 从分钟和秒开始倒计时
- django - 如何访问 Django 模板中相关外键上的对象中的值?
- node.js - 如果它是 NodeJS 应用程序,那么在 2 个 API 之间发出 Oauth2 请求的 Google Apps 脚本应用程序将如何?
- dictionary - SVG 地图缩放保持图标和工具提示正常大小
- android - 在我的情况下,从后端刷新屏幕上显示的数据
- node.js - 护照-谷歌-oauth20 不能在生产中工作?