python-3.x - Scrapy 在抓取表格时忽略格式化数据
问题描述
我正在尝试使用 CSS 选择器从https://en.wikipedia.org/wiki/List_of_UFC_events抓取 UFC 日期。但是,我发现如果单元格中有任何数据被附加标签包围,或者<b></b>
数据根本不会被刮掉。<a></a>
<p></p>
我尝试使用.getall()
and.extract_first()
并且它们都提供相同的输出。我错过了什么?
标记问题。如何抓取具有特定 ID 的表格?
import scrapy
class UFCEVENTSSpider(scrapy.Spider):
name = "ufcevents"
def start_requests(self):
url = "https://en.wikipedia.org/wiki/List_of_UFC_events"
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
#TODO
# row.css calls ignore data if it has html tags around it such as <b>13</b>
for row in response.css("tbody tr"):
## Use the below to output to console
##
#event = row.css("td:nth-child(1)::text").get()
#date = row.css("td:nth-child(2)::text").get()
#venue = row.css("td:nth-child(3)::text").get()
#location = row.css("td:nth-child(4)::text").get()
#ref = row.css("td:nth-child(5)::text").get()
#notes = row.css("td:nth-child(6)::text").get()
#ufce = UFCEvent(date, eventtype, dead, injured, location)
#ufce.displayEvent()
## Use the below to create a json file with
## scrapy crawl terrorism -o terrorism.json
yield{
"event": row.css("td:nth-child(1)::text").getall(),
"date": row.css("td:nth-child(2)::text").extract_first(),
"venue": row.css("td:nth-child(3)::text").extract_first(),
"location": row.css("td:nth-child(4)::text").extract_first(),
"ref": row.css("td:nth-child(5)::text").extract_first(),
"notes": row.css("td:nth-child(6)::text").extract_first()
}
class UFCEvent:
def __init__(self, event, date, venue, location, ref, notes):
self.event = event
self.date = date
self.venue = venue
self.location = location
self.ref = ref
self.notes = notes
def displayEvent(self):
print ("Event : ", self.event, ", Date: ", self.date, ", Venue: ", self.venue, ", Location: ", self.location, ", Reference: ", self.ref, ", Notes: ", self.notes)
解决方案
使用这样的选择器,row.css("td:nth-child(1)::text").getall()
您只能从td
标签中获取文本,如果需要来自td
标签及其子项的文本,您必须使用这样的选择器:
row.css("td:nth-child(1) ::text").getall()
还需要在获取子文本space
之前添加::
您需要在代码中进行以下更改:
....
yield{
"event": row.css("td:nth-child(1) ::text").getall(),
"date": row.css("td:nth-child(2) ::text").extract_first(),
"venue": row.css("td:nth-child(3) ::text").extract_first(),
"location": row.css("td:nth-child(4) ::text").extract_first(),
"ref": row.css("td:nth-child(5) ::text").extract_first(),
"notes": row.css("td:nth-child(6) ::text").extract_first()
}
....
推荐阅读
- angular - JS:错误类型错误:无法读取未定义的属性“推送”
- java - 如何在 Log4J 2 配置的查找中使用多个替换?可能吗?
- python - 如何在使用 lmfit 进行最小化时修复“函数返回的数组在调用之间改变了大小”?
- sql - 格鲁吉亚语中的 SQL Server 排序规则冲突
- dynamics-365 - 在动态 crm 门户中显示子网格
- php - 如果发生变化,Laravel PHP 会自动运行 php artisan serve
- ios - Peek-and-Pop UIView 以显示 SFSafariViewController
- javascript - 正则表达式 - 至少 6 个字符,包括 2 个特殊字符
- python - Python 和 Dynamics CRM:使用 Web api
- c# - 将结果添加到 TestCaseSource