首页 > 解决方案 > Scrapy 在抓取表格时忽略格式化数据

问题描述

我正在尝试使用 CSS 选择器从https://en.wikipedia.org/wiki/List_of_UFC_events抓取 UFC 日期。但是,我发现如果单元格中有任何数据被附加标签包围,或者<b></b>数据根本不会被刮掉。<a></a><p></p>

我尝试使用.getall()and.extract_first()并且它们都提供相同的输出。我错过了什么?

标记问题。如何抓取具有特定 ID 的表格?

import scrapy

class UFCEVENTSSpider(scrapy.Spider):
    name = "ufcevents"

    def start_requests(self):
        url = "https://en.wikipedia.org/wiki/List_of_UFC_events"
        yield scrapy.Request(url=url, callback=self.parse)
    
    def parse(self, response):

        #TODO
        # row.css calls ignore data if it has html tags around it such as <b>13</b>

        for row in response.css("tbody tr"):

            ## Use the below to output to console
            ##
            #event = row.css("td:nth-child(1)::text").get()
            #date = row.css("td:nth-child(2)::text").get()
            #venue = row.css("td:nth-child(3)::text").get()
            #location = row.css("td:nth-child(4)::text").get()
            #ref = row.css("td:nth-child(5)::text").get()
            #notes = row.css("td:nth-child(6)::text").get()

            #ufce = UFCEvent(date, eventtype, dead, injured, location)
            #ufce.displayEvent()

            ## Use the below to create a json file with
            ## scrapy crawl terrorism -o terrorism.json
            yield{
                "event": row.css("td:nth-child(1)::text").getall(),
                "date": row.css("td:nth-child(2)::text").extract_first(),
                "venue": row.css("td:nth-child(3)::text").extract_first(),
                "location": row.css("td:nth-child(4)::text").extract_first(),
                "ref": row.css("td:nth-child(5)::text").extract_first(),
                "notes": row.css("td:nth-child(6)::text").extract_first()
            }


class UFCEvent:

    def __init__(self, event, date, venue, location, ref, notes):
        self.event = event
        self.date = date
        self.venue = venue
        self.location = location
        self.ref = ref
        self.notes = notes

    def displayEvent(self):
        print ("Event : ", self.event,  ", Date: ", self.date, ", Venue: ", self.venue, ", Location: ", self.location, ", Reference: ", self.ref, ", Notes: ", self.notes)

标签: python-3.xscrapy

解决方案


使用这样的选择器,row.css("td:nth-child(1)::text").getall()您只能从td标签中获取文本,如果需要来自td标签及其子项的文本,您必须使用这样的选择器:

row.css("td:nth-child(1) ::text").getall()

还需要在获取子文本space之前添加::

您需要在代码中进行以下更改:

....
yield{
                "event": row.css("td:nth-child(1) ::text").getall(),
                "date": row.css("td:nth-child(2) ::text").extract_first(),
                "venue": row.css("td:nth-child(3) ::text").extract_first(),
                "location": row.css("td:nth-child(4) ::text").extract_first(),
                "ref": row.css("td:nth-child(5) ::text").extract_first(),
                "notes": row.css("td:nth-child(6) ::text").extract_first()
            }
....

推荐阅读