首页 > 解决方案 > Extracting data from HTML table using scrapy: response.xpath() yields None

问题描述

I've been building a web scraper in python 3 using the scrapy library and I'm running into a problem I don't understand. I've successfully scraped other tables using inspect element on the table to get the xpath variables. However, with this table, I am unable to figure out how to extract the data from the table. I am new to HTML but not new to programming, so please help me if I'm way off here.

An example of this web page would be: http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1

Inspecting the page and getting the xpath for the target table yields //*[@id="aspnetForm"]/table/tbody/tr[3]/td[1]/table/tbody/tr[1]/td/table/tbody/tr[3]/td/table

However, using this in a scrapy shell response.xpath(target).extract() returns []. Trying to target any individual cells also appears to provide the same null result. My intended result would be a dataframe or dictionary correlating something like {'Dwelling Units': 1, 'Year Built': 2010 ... } Any help identifying where I'm going wrong would or how to get the data formatted as such would be appreciated. Thanks!

标签: pythonhtmlxpathweb-scrapingscrapy

解决方案


import scrapy


class ResidentialRecordsSpider(scrapy.Spider):
    name = "residential_records"

    start_urls = [
        'http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1',
    ]

    def parse(self, response):
        for record in response.xpath('//table[@width="90%"]//td'):
            key = record.xpath('./strong/text()').extract_first(default='')
            value = record.xpath('./text()').extract_first(default='')

            yield { key: value }

Here you need to perform some data cleaning only


推荐阅读