python - Extracting data from HTML table using scrapy: response.xpath() yields None
问题描述
I've been building a web scraper in python 3 using the scrapy library and I'm running into a problem I don't understand. I've successfully scraped other tables using inspect element on the table to get the xpath variables. However, with this table, I am unable to figure out how to extract the data from the table. I am new to HTML but not new to programming, so please help me if I'm way off here.
An example of this web page would be: http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1
Inspecting the page and getting the xpath for the target table yields //*[@id="aspnetForm"]/table/tbody/tr[3]/td[1]/table/tbody/tr[1]/td/table/tbody/tr[3]/td/table
However, using this in a scrapy shell response.xpath(target).extract()
returns []
. Trying to target any individual cells also appears to provide the same null result. My intended result would be a dataframe or dictionary correlating something like {'Dwelling Units': 1, 'Year Built': 2010 ... }
Any help identifying where I'm going wrong would or how to get the data formatted as such would be appreciated. Thanks!
解决方案
import scrapy
class ResidentialRecordsSpider(scrapy.Spider):
name = "residential_records"
start_urls = [
'http://land.elpasoco.com/ResidentialBuilding.aspx?schd=5317443025&bldg=1',
]
def parse(self, response):
for record in response.xpath('//table[@width="90%"]//td'):
key = record.xpath('./strong/text()').extract_first(default='')
value = record.xpath('./text()').extract_first(default='')
yield { key: value }
Here you need to perform some data cleaning only
推荐阅读
- javascript - 溢出的内容会影响 window.innerWidth 吗?
- ios - 处理来自靠近边缘的 TrueDepth 相机缓冲区的伪影的正确方法是什么?
- reactjs - 使用函数内的箭头函数更改父状态
- ios - 为什么 ZStack 只能在我的 ContentView 中工作?
- python - Smtplib 不使用环境变量并获取 AttributeError: 'NoneType' object has no attribute 'encode'
- javascript - Vue.js 如何访问对象的元素
- azure - 我的函数应用程序是否依赖于存储帐户?我想将我的函数应用程序移动到新的资源组
- java - 如何用brew安装openjdk?
- android-studio - Gradle 使用 NanoHTTPD 构建,但类无法导入
- docker - 是否可以在没有 GPU 的服务器上构建基于“nvidia/cuda”的图像?