python - Scraping with specific criteria when similar classes used in html source
问题描述
I am trying to scrape the 8 instances of x between td tags on the following
<th class="first"> Temperature </th>
<td> x </td> # repeated for 8 lines
There are however numerous classes on page that are <th class="first">
The only unique identifier is the string that follows first, in this example Temperature.
Not sure what to add to the following code I am using to create some kind of criteria to scrape for <th class="first">
where Temperature (and other strings follow)
for tag in soup.find_all("th", {"class":"first"}):
temps.append(tag.text)
Is it a matter of additional code (re.compile?) or should I use something else entirely?
Edit: Html of interest below
<tbody>
<tr> <th class="first">Temperature</th> <td>x</td> <td>x</td> <td>x</td> <td>x</td> <td>x</td> <td>x</td> <td>x</td> <td>x</td> </tr>
Edit: current code
from bs4 import BeautifulSoup as bs
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'c:\program files\firefox\geckodriver.exe')
driver.get("http://www.bom.gov.au/places/nsw/sydney/forecast/detailed/")
html = driver.page_source
soup = bs(html, "lxml")
dates = []
for tag in soup.find_all("a", {"class":"toggle"}):
dates.append(tag.text)
temps = [item.text for item in soup.select('th.first:contains(Temperature) ~ td')]
print(dates)
print(temps)
解决方案
If I understand correctly, try this:
from bs4 import BeautifulSoup
import re
s = '''
<tr>
<th class="first">Temperature</th>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
<td>x</td>
</tr>
'''
soup = BeautifulSoup(s, "lxml")
[td.text for td in soup.find('th', string=re.compile("Temperature")).find_next_siblings()]
and you get:
['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x']
推荐阅读
- python - Tensorflow 中的 YOLO v2 精度差
- mysql - Docker:尝试设置 2 个数据库容器时的 MySQL [2002]
- loops - 如何在 VBScript 中中断或退出 With..End With 循环
- java - 这些使 URLDecoder 与 UTF-8 崩溃的符号是什么?
- python - 如何模拟线程之间的广播消息传递
- mysql - 存储博客文件
- html - 如何设置从父 ngb 下拉菜单继承的 ngb 下拉菜单的宽度附加到正文?
- sql - 如何通过比较 2 个不同的日期列来删除基于 1 列的重复 SQL 行?
- git - 我如何计算子文件夹的生命周期 git commit 大小?在整个 git 存储库大小中占用了多少子文件夹?
- javascript - dy在js中的span元素中没有改变