首页 > 解决方案 > Unable to understand XPath siblings behaviour

问题描述

I am trying to scrape a HTML page in an scenario where I only have consecutive tags with information.

From the following code I would like to get the text for the tags (e.g. Name1, Name2, ...), taking into consideration:

"a" followed by "span" gives information about that ID being a Customer or not.

"a" followed by "a" means that ID is anonymous.

<span class="list">
    <em>List 1:</em>
</span>
<a href="/ID/423006">Name1</a>, 
<a href="/ID/115325">Name2</a>
<span class="small">(Customer)</span>, 
<a href="/ID/248819">Name3</a>
<span class="small">(Non Customer)</span>, 
<a href="/ID/658259">Name4</a>
<span class="small">(Customer)</span>, 
<a href="/ID/294083">Name5</a>
<a href="/ID/218292">Name6</a>
<span class="small">(Non Customer)</span>

I'm using the following XPATH to try to match "a" followed by "span"

//a[contains(@href,'ID/') and ./following-sibling::span[1][text() = '(Customer)']]/text()

This will return Name1, Name2 and Name4, even if Name1 is not a Customer. What am I doing wrong?

标签: xpathscrapy

解决方案


It's because the first following-sibling span of that Name1 does indeed equal "(Customer)".

相反,您应该做的是找到第一个以下同级 ( *[1]) 并检查该同级是否为span( [self::span]),如果是,则检查它是否等于“(客户)”...

//a[contains(@href,'ID/') and ./following-sibling::*[1][self::span][text() = '(Customer)']]/text()

推荐阅读