python - 获得
问题描述
我正在使用 BeautifulSoup 来抓取网站。我可以获取<li class="level-item">
标签中的所有数据,但是我需要获取<h2>
标签中与相应标签相关的日期<li>
。
期望的输出:
05182018,/somedirectoryname/anothername/009,sometext,another value,long description
05182018,/somedirectoryname/anothername/008,sometext,another value,long description
03092018,/somedirectoryname/anothername/007,sometext,another value,long description
03092018,/somedirectoryname/anothername/006,sometext,another value,long description
03092018,/somedirectoryname/anothername/005,sometext,another value,long description
03092018,/somedirectoryname/anothername/004,sometext,another value,long description
网页结构:
<h2>May 18, 2018<h2>
<ul>
<li class="level-item"><a href=“/somedirectoryname/anothername/009”><span class=“some text”>another value</span> long description </a></li>
<li class="level-item"><a href=“/somedirectoryname/anothername/008”><span class=“some text”>another value</span> long description </a></li>
</ul>
<h2>March 9, 2018<h2>
<ul>
<li class="level-item"><a href=“/somedirectoryname/anothername/007”><span class=“some text”>another value</span> long description </a></li>
<li class="level-item"><a href=“/somedirectoryname/anothername/006”><span class=“some text”>another value</span> long description </a></li>
<li class="level-item"><a href=“/somedirectoryname/anothername/005”><span class=“some text”>another value</span> long description </a></li>
<li class="level-item"><a href=“/somedirectoryname/anothername/004”><span class=“some text”>another value</span> long description </a></li>
</ul>
<h2>December 1, 2017<h2>
<ul>
<li class="level-item"><a href=“/somedirectoryname/anothername/003”><span class=“some text”>another value</span> long description </a></li>
<li class="level-item"><a href=“/somedirectoryname/anothername/002”><span class=“some text”>another value</span> long description </a></li>
<li class="level-item"><a href=“/somedirectoryname/anothername/001”><span class=“some text”>another value</span> long description </a></li>
我的代码片段:
我只需要获取与<ul>
标签相关的<li>
标签正上方的日期。
date = results_table.find_all('h2', string=re.compile('January|February|March|April|May|June|July|August|September|October|November|December'))
locale.setlocale(locale.LC_ALL, 'en_US')
changeDateFormat = date.text.strip()
datePublished = datetime.datetime.strptime(changeDateFormat, '%B %d, %Y').strftime('%m%d%Y')
ul = results_table.find('ul')
for item in results_table.find_all('li', {'class': 'level-item'}):
# try to obtain the correct date
print(ul.previous_element)
for nextLink in item.find_all('a'):
for ad_id in nextLink.find_all('span'):
print(ad_id.text.strip())
解决方案
<h2>
使用您所做的找到所有标签后,您可以<ul>
使用find_next()
或获取相应的标签.next_sibling
。然后简单地遍历所有<li>
标签。
代码:
for date_tag in results_table.find_all('h2'):
date = date_tag.text
for item in date_tag.find_next('ul').find_all('li'):
print(date, item.a['href'], item.span['class'][0], item.get_text(',', strip=True), sep=',')
输出:
May 18, 2018,/somedirectoryname/anothername/009,some,another value,long description
May 18, 2018,/somedirectoryname/anothername/008,some,another value,long description
March 9, 2018,/somedirectoryname/anothername/007,some,another value,long description
March 9, 2018,/somedirectoryname/anothername/006,some,another value,long description
March 9, 2018,/somedirectoryname/anothername/005,some,another value,long description
March 9, 2018,/somedirectoryname/anothername/004,some,another value,long description
December 1, 2017,/somedirectoryname/anothername/003,some,another value,long description
December 1, 2017,/somedirectoryname/anothername/002,some,another value,long description
December 1, 2017,/somedirectoryname/anothername/001,some,another value,long description
推荐阅读
- android - 如何将 MaterialButton 文本旁边的图标与定义的填充对齐?
- c# - 从 Asp.Net API 将数据插入二进制(64)列
- javascript - Node js - 同步执行Websocket的对象函数
- regex - 正则表达式 - 任意位数 + 数字或 [az]
- php - php 发布原始数据 - 关注发布 URL
- testing - Robot Framework中的抽象关键字
- ansible - 如何通过--extra-vars将额外变量作为字典列表传递给ansible剧本?
- node.js - 重命名猫鼬中的字段
- scala - Scala反射:我可以看看某物是否是(案例)对象吗?
- animation - Vuetify - 对话框关闭/动画结束处理