首页 > 解决方案 > 获得

问题描述

我正在使用 BeautifulSoup 来抓取网站。我可以获取<li class="level-item">标签中的所有数据,但是我需要获取<h2>标签中与相应标签相关的日期<li>

期望的输出:

05182018,/somedirectoryname/anothername/009,sometext,another value,long description 
05182018,/somedirectoryname/anothername/008,sometext,another value,long description 
03092018,/somedirectoryname/anothername/007,sometext,another value,long description 
03092018,/somedirectoryname/anothername/006,sometext,another value,long description 
03092018,/somedirectoryname/anothername/005,sometext,another value,long description 
03092018,/somedirectoryname/anothername/004,sometext,another value,long description 

网页结构:

<h2>May 18, 2018<h2>
<ul>

 <li class="level-item"><a href=“/somedirectoryname/anothername/009”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

 <li class="level-item"><a href=“/somedirectoryname/anothername/008”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

</ul>

<h2>March 9, 2018<h2>
<ul>
<li class="level-item"><a href=“/somedirectoryname/anothername/007”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

<li class="level-item"><a href=“/somedirectoryname/anothername/006”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

<li class="level-item"><a href=“/somedirectoryname/anothername/005”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

<li class="level-item"><a href=“/somedirectoryname/anothername/004”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

</ul>

<h2>December 1, 2017<h2>
<ul>

<li class="level-item"><a href=“/somedirectoryname/anothername/003”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

<li class="level-item"><a href=“/somedirectoryname/anothername/002”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

<li class="level-item"><a href=“/somedirectoryname/anothername/001”&gt;<span class=“some text”&gt;another value</span> long description </a></li>

我的代码片段: 我只需要获取与<ul>标签相关的<li>标签正上方的日期。

date = results_table.find_all('h2', string=re.compile('January|February|March|April|May|June|July|August|September|October|November|December'))
    locale.setlocale(locale.LC_ALL, 'en_US')
    changeDateFormat = date.text.strip()
    datePublished = datetime.datetime.strptime(changeDateFormat, '%B %d, %Y').strftime('%m%d%Y')
    ul = results_table.find('ul')

    for item in results_table.find_all('li', {'class': 'level-item'}):
        # try to obtain the correct date
        print(ul.previous_element)
        for nextLink in item.find_all('a'):
            for ad_id in nextLink.find_all('span'):
                print(ad_id.text.strip())

标签: pythonbeautifulsoup

解决方案


<h2>使用您所做的找到所有标签后,您可以<ul>使用find_next()或获取相应的标签.next_sibling。然后简单地遍历所有<li>标签。

代码:

for date_tag in results_table.find_all('h2'):
    date = date_tag.text
    for item in date_tag.find_next('ul').find_all('li'):
        print(date, item.a['href'], item.span['class'][0], item.get_text(',', strip=True), sep=',')

输出:

May 18, 2018,/somedirectoryname/anothername/009,some,another value,long description
May 18, 2018,/somedirectoryname/anothername/008,some,another value,long description
March 9, 2018,/somedirectoryname/anothername/007,some,another value,long description
March 9, 2018,/somedirectoryname/anothername/006,some,another value,long description
March 9, 2018,/somedirectoryname/anothername/005,some,another value,long description
March 9, 2018,/somedirectoryname/anothername/004,some,another value,long description
December 1, 2017,/somedirectoryname/anothername/003,some,another value,long description
December 1, 2017,/somedirectoryname/anothername/002,some,another value,long description
December 1, 2017,/somedirectoryname/anothername/001,some,another value,long description

推荐阅读