首页 > 解决方案 > 如何选择某些标头之间的所有节点?

问题描述

每个<header>标签都包含一个会议标题。每个<ul>标签都包含本次会议的链接。

当我尝试抓取网站时,我会尝试将<header>标签与标签中的链接相关联<ul>。但我不知道我如何只能选择<ul>are兄弟两个确定的标签<headers>

HTML:

<header>... 0 ... </header>
<ul class="publ-list">... 0 ...</ul>
<header>... 1 ... </header> 
<ul class="publ-list">... 0 ...</ul>
<header>... 2 ... </header>
<ul class="publ-list">... 0 ...</ul>
<p>...</p>
<ul class="publ-list">... 1 ...</ul>
<header>... 3 ...</header>
<ul class="publ-list">... 0 ...</ul>
<ul class="publ-list">... 1 ...</ul>
<ul class="publ-list">... 2 ....</ul>
<ul class="publ-list">... 3 ....</ul>
<ul class="publ-list">... 4 ....</ul>
<header>... 4 ...</header>

例子:

一些案例:

我的代码:

TITLE_OF_EDITIONS_SELECTIOR = 'header h2'
GROUP_OF_TYPES_OF_EDITION_SELECTOR = ".publ-list"

size_editions = len(response.css(GROUP_OF_TYPES_OF_EDITION_SELECTOR))
i = 0
while i < size_editions:

    # Get the title of conference
    title_edition_conference = response.css(TITLE_OF_EDITIONS_SELECTIOR)[i]


    # Get datas and links of <ul> tags "(.publ-list)"
    TYPES_OF_CONFERENCE = response.css(GROUP_OF_TYPES_OF_EDITION_SELECTOR)[i]
    TYPE = TYPES_OF_CONFERENCE.css('.entry')
    types_of_edition = {}
    size_type_editions = 0
    for type_of_conference in TYPE:
        title_type = type_of_conference.css('.data .title ::text').extract()
        link_type = type_of_conference.css('.publ ul .drop-down .body ul li a ::attr(href)').extract_first()
        types_of_edition[size_type_editions] = {
            "title": title_type,
            "link": link_type,
            }
        size_type_editions = size_type_editions + 1

    editions[i] = {
        "title_edition_conference": title_edition_conference,
        "types_of_edition": types_of_edition
        }
    i = i + 1

我的代码问题

我在 Google Chrome 的控制台上使用 JQuery 进行了测试,例如:

"$($('header')[0]).nextUntil($('header')[1])"

但是我如何使用 xPath 或 CSS 选择器来选择它?谢谢!

标签: python-3.xxpathscrapycss-selectors

解决方案


尝试在following-sibling这里使用:

>>> txt = """<header>..</header>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <p>...</p>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <ul class="publ-list">...</ul>
... <header>..</header>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> sel.xpath('//header/following-sibling::*[not(self::header)]').extract()
[u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<p>...</p>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>']

所以//header/following-sibling::*[not(self::header)]我们选择所有header兄弟姐妹,但不是header


推荐阅读