首页 > 解决方案 > scrapy 在表中返回太多行

问题描述

感觉就像我在这里没有掌握一些概念,或者在我可以爬行之前尝试飞行(双关语)。

页面上确实有 5 个表格,我感兴趣的是第三个。但是执行这个:

#!/usr/bin/python
# python 3.x

import sys
import os
import re
import requests
import scrapy

class iso3166_spider( scrapy.Spider):
  name = "countries"

  def start_requests( self):
    urls = ["https://en.wikipedia.org/wiki/ISO_3166-1"]
    for url in urls:
      yield scrapy.Request( url=url, callback=self.parse) 

  def parse( self, response):
    title = response.xpath('//title/text()').get()
    print("-- title -- {0}".format(title))
    list_table_selector = response.xpath('//table')   # gets all tables on page
    print("-- table count -- {0}".format( len( list_table_selector)))
    table_selector = response.xpath('//table[2]')     # inspect to figure out which one u want
    table_selector_text = table_selector.getall()     # got the right table, starts with Afghanistan
#   print( table_selector_text)
#
#   here is where things go wrong
    list_row_selector = table_selector.xpath('//tr')
    print("number of rows in table: {0}".format( len( list_row_selector)))  # gives 302, should be close to 247
    for i in range(0,20):
      row_selector = list_row_selector[i]
      row_selector_text = row_selector.getall()
      print("i={0}, getall:{1}".format(i, row_selector_text)

打印每个表中每一行的 getall() - 我看到阿富汗的行是第 8 行而不是第 2 行

改变

    list_row_selector = table_selector.xpath('//tr')

    list_row_selector = table_selector.xpath('/tr')

结果在我预计大约 247 处找到零行

最终我想要每个国家的名称和三个代码,应该是直截了当的。

我究竟做错了什么?

TIA,

克重克

标签: pythonscrapy

解决方案


tbl = response.xpath("//th[starts-with(text(),'English short name')]/ancestor::table/tbody/tr[position()>1]") # try this xpath. I check the source of web page, the header ("th" elements) line is under tbody also.

您也可以尝试将“//tr”替换为“.//tr”


推荐阅读