python-3.x - 抓取网站时陷入循环/错误的 Xpath
问题描述
我正在尝试从该网站上抓取数据:https ://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM
我为初始数据制作了以下脚本:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = rows.xpath("(//tr/td[@class='time']/span)[1]/text()").get()
time = rows.xpath("//tr/td[@class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}
但是我得到的数据是重复的,就像我没有在 For 循环中导航一样:
PS C:\Users\gasgu\PycharmProjects\ScrapingProject\projects\waia> scrapy crawl waiascrap 2021-08-20 15:25:11 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: waia) 2021- 08-20 15:25:11 [scrapy.utils.log] 信息:版本:lxml 4.6.3.0、libxml2 2.9.5、cssselect 1.1.0、parsel 1.6.0、w3lib 1.22.0、Twisted 21.7.0、Python 3.9.6(标签/v3.9.6:db3ff76,2021 年 6 月 28 日,15:26:21)[MSC v.1929 64 位(AMD64)],pyOpenSSL 20.0.1(OpenSSL 1.1.1k 2021 年 3 月 25 日),密码学3.4.7,平台 Windows-10-10.0.19042-SP0 2021-08-20 15:25:11 [scrapy.utils.log] 调试:使用反应器:twisted.internet.selectreactor.SelectReactor 2021-08-20 15: 25:11 [scrapy.crawler] 信息:覆盖设置:{'BOT_NAME':'waia','NEWSPIDER_MODULE':'waia.spiders','ROBOTSTXT_OBEY':真,'SPIDER_MODULES':['waia.spiders']} 2021-08-20 15:25:11 [scrapy.extensions.telnet] 信息:Telnet 密码:9299b6be5840b21c 2021-08-20 15:25:11 [scrapy.middleware] 信息:启用扩展:['scrapy.extensions .corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2021-08-20 15:25:11 [scrapy.middleware] INFO: 启用下载器中间件: ['scrapy .downloadermiddlewares.robotstxt.RobotsTxtMiddleware'、'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware'、'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'、'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'、'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware'、' .downloadermiddlewares.retry.RetryMiddleware','scrapy.downloadermiddlewares.redirect。MetaRefreshMiddleware','scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware','scrapy.downloadermiddlewares.redirect.RedirectMiddleware','scrapy.downloadermiddlewares.cookies.CookiesMiddleware','scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware','scrapy.downloadermiddlewares.stats。 DownloaderStats'] 2021-08-20 15:25:11 [scrapy.middleware] 信息:启用蜘蛛中间件:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware','scrapy.spidermiddlewares.offsite.OffsiteMiddleware','scrapy.spidermiddlewares。 referer.RefererMiddleware','scrapy.spidermiddlewares.urllength.UrlLengthMiddleware','scrapy.spidermiddlewares.depth.DepthMiddleware'] 2021-08-20 15:25:11 [scrapy.middleware] 信息:启用项目管道:[] 2021- 08-20 15:25:11 [scrapy.core.engine] INFO:Spider 打开 2021-08-20 15:25:11 [scrapy.extensions.logstats] INFO:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)2021-08-20 15:25:11 [scrapy.extensions.telnet] 信息:Telnet 控制台正在监听 127.0.0.1:6023 2021-08-20 15:25:12 [scrapy.core.engine] 调试:已爬网(404) <获取https://aa-dc.org/robots.txt> (referer: None) 2021-08-20 15:25:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://aa- dc.org/meetings?tsml-day=any&tsml-type=IPM> (referer: None) 2021-08-20 15:25:16 [scrapy.core.scraper] 调试:从 <200 https://aa- dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:19 [scrapy.core .scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am' } 2021-08-20 15:25:22 [scrapy.core.scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day' :' Sunday', 'time': '6:45 am'} 2021-08-20 15:25:26 [scrapy.core.scraper] 调试:从 <200 开始https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25 :29 [scrapy.core.scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:32 [scrapy.core.scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=抓取IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:35 [scrapy.core.scraper] 调试:从 <200 https://aa -dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'} 2021-08-20 15:25:39 [scrapy.核。 scraper] 调试:从 <200 https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM> {'day': 'Sunday', 'time': '6:45 am'}
编辑:
现在它可以工作了,@Prophet 标记的错误和我的 Xpath 有问题。
我将我的代码放在下面:
import scrapy
class WaiascrapSpider(scrapy.Spider):
name = 'waiascrap'
allowed_domains = ['clsaa-dc.org']
start_urls = ['https://aa-dc.org/meetings?tsml-day=any&tsml-type=IPM']
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.xpath(".//td[@class='time']/span/text()").get()
time = row.xpath(".//td[@class='time']/span/time/text()").get()
yield {
'day': day,
'time': time,
}
解决方案
要选择元素内的元素,您必须.
在 XPath 表达式前面放置一个点,表示“从这里开始”。
否则,它(//tr/td[@class='time']/span)[1]/text()
每次都会为您带来整个页面上的第一个匹配项,如您所见。
此外,由于您正在迭代每个row
它应该是row.xpath...
,而不是rows.xpath
因为rows
它是一个元素列表,而每个元素row
都是一个元素。
此外,要根据 XPath 定位器对 Web 元素应用搜索,您应该使用find_element_by_xpath
方法,而不是xpath
.
def parse(self, response):
rows = response.xpath("//tr")
for row in rows:
day = row.find_element_by_xpath(".(//tr/td[@class='time']/span)[1]/text()").get()
time = row.find_element_by_xpath("//.tr/td[@class='time']/span/time/text()").get()
yield{
'day': day,
'time': time,
}
推荐阅读
- python - Python `or`、`and` 运算符优先级示例
- python - 获取“urllib.error.HTTPError:HTTP 错误 400:错误请求”
- maven - Jacoco 不计算子模块的覆盖率
- ios - Swift UIDocumentPickerViewController 没有像操作表弹出一样显示
- javascript - 模糊/隐藏 Android 键盘和焦点输入类型
- javascript - 如何在 html 中重绘或调整矩形的大小?
- r - 使用所有属性运行神经网络时出现问题
- r - 如何将鼠标悬停在闪亮的 ggplot2 极坐标图中的标签上?
- ios - PerformSegueWithIdentifier() 在启动应用程序后第一次执行时花费的时间明显更长
- python - Selenium Python Behave WebDriverWait