首页 > 解决方案 > How to skip paragraphs with comments in XPath expression?

问题描述

I'm trying to scrape websites like this with the following Xpath expression:

.//div[@class="tresc"]/p[not(starts-with(text(), "<!--"))]

The thing is that the first paragraph is a comment section, so I'd like to skip it:

<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:HyphenationZone>21</w:HyphenationZone>
<w:PunctuationKerning />
<w:ValidateAgainstSchemas />
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid
<w:IgnoreMixedContent>false</w:IgnoreMixedContent
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:Compatibility>
<w:BreakWrappedTables />
<w:SnapToGridInCell />
<w:WrapTextWithPunct />
<w:UseAsianBreakRules />
<w:DontGrowAutofit />
</w:Compatibility>
<w:BrowserLevel>MicrosoftInternetExplorer4</w:BrowserLevel>
</w:WordDocument>
</xml><![endif]-->

Unfortunately, my expression does not skip the paragraph with comments. Anyone know what I'm doing wrong?

标签: xpathweb-scrapingscrapyxpath-2.0

解决方案


评论不是 的一部分text(),它们构成了自己的一个节点:comment()。要排除p包含注释的 's,请使用

p[not(comment())]

推荐阅读