首页 > 解决方案 > Scraping with specific criteria when similar classes used in html source

问题描述

I am trying to scrape the 8 instances of x between td tags on the following

<th class="first"> Temperature </th>
<td> x </td> # repeated for 8 lines

There are however numerous classes on page that are <th class="first"> The only unique identifier is the string that follows first, in this example Temperature.

Not sure what to add to the following code I am using to create some kind of criteria to scrape for <th class="first"> where Temperature (and other strings follow)

for tag in soup.find_all("th", {"class":"first"}):
    temps.append(tag.text)

Is it a matter of additional code (re.compile?) or should I use something else entirely?

Edit: Html of interest below

   <tbody>
<tr>
    <th class="first">Temperature</th>
    <td>x</td>
    <td>x</td>
    <td>x</td>
    <td>x</td>
    <td>x</td>
    <td>x</td>
    <td>x</td>
    <td>x</td>
</tr>

Edit: current code

from bs4 import BeautifulSoup as bs
from selenium import webdriver

driver = webdriver.Firefox(executable_path=r'c:\program files\firefox\geckodriver.exe')
driver.get("http://www.bom.gov.au/places/nsw/sydney/forecast/detailed/")

html = driver.page_source
soup = bs(html, "lxml")

dates = []

for tag in soup.find_all("a", {"class":"toggle"}):
    dates.append(tag.text)

temps = [item.text for item in soup.select('th.first:contains(Temperature) ~ td')]

print(dates)
print(temps)

标签: pythonweb-scrapingbeautifulsoup

解决方案


If I understand correctly, try this:

from bs4 import BeautifulSoup
import re

s = '''
    <tr>
        <th class="first">Temperature</th>
        <td>x</td>
        <td>x</td>
        <td>x</td>
        <td>x</td>
        <td>x</td>
        <td>x</td>
        <td>x</td>
        <td>x</td>
    </tr>

'''

soup = BeautifulSoup(s, "lxml")

[td.text for td in soup.find('th', string=re.compile("Temperature")).find_next_siblings()]

and you get:

['x', 'x', 'x', 'x', 'x', 'x', 'x', 'x']

推荐阅读