python - 在 Python 中使用 BeautifulSoup 忽略其他人时,如何获取特定的“字体大小”?
问题描述
我目前正在抓取一个网站并且需要获取某些字体大小,例如在 style="font-size: 140%" 中,我想获取 140% 或者最好只获取 140,以便我可以在某些计算中使用它,因为每个一个会有不同的字体大小。
更具体地说,我想从这样的标签中获取字体大小......
<div style="font-size: 141%; line-height: 110%"><a href="engenremap-latinrock.html" style="color: #AF7E1C">latin rock</a></div>
<div style="font-size: 139%; line-height: 110%"><a href="engenremap-mexicanindie.html" style="color: #B18230">mexican indie</a></div>
通常我可以毫无问题地做到这一点。但是,我遇到的问题是我想要抓取的内容没有可区分的标签,并且前面有一堆由具有像这样字体大小的标签组成的行...
<tr valign=top class="datarow firstrow" style="white-space: nowrap"><td align=right class=note style="font-size: 20px; line-height: 24px">1</td><td style="font-size: 20px; line-height: 24px"> <a href="spotify:playlist:0DsV1U8e3xXsmsDSaW88XT" class=note target=spotify title="See this playlist in Spotify.">☊</a></td><td class=note style="font-size: 20px; line-height: 24px"><a href="?scope=MX&vector=activity" title="Show only schools from Mexico." style="color: #BA890D">Mexico</a></td><td class=note style="font-size: 20px; line-height: 24px"><a href="?root=Universidad%20Nacional%20Aut%C3%B3noma%20De%20M%C3%A9xico%20%28UNAM%29&scope=all" title="Re-sort the list by similarity to Universidad Nacional Autónoma De México (UNAM)." style="color: #BA890D">Universidad Nacional Autónoma De México (UNAM)</a></td></tr>
<tr valign=top class="datarow " style="white-space: nowrap"><td align=right class=note style="font-size: 20px; line-height: 24px">2</td><td style="font-size: 20px; line-height: 24px"> <a href="spotify:playlist:5QAomgXhxwYjg975DWtTTv" class=note target=spotify title="See this playlist in Spotify.">☊</a></td><td class=note style="font-size: 20px; line-height: 24px"><a href="?scope=US&vector=activity" title="Show only schools from USA." style="color: #948F04">USA</a></td><td class=note style="font-size: 20px; line-height: 24px"><a href="?root=Texas%20A%20%26%20M%20University-College%20Station&scope=all" title="Re-sort the list by similarity to Texas A & M University-College Station." style="color: #948F04">Texas A & M University-College Station</a></td></tr>
请记住,我已经遍历了上一个片段中的链接(它们是静态的,并且与我遍历的位置保持一致)并且第一个片段中的标签(学校的流派)对于每个链接(学校)都会发生变化,如何我开始忽略 tr 标签中的字体大小,只从第一个 HTML 片段中获取字体大小?我确信这有一个简单的解决方案,但我会很感激任何帮助。
** 我已经在遍历链接并获取每所学校各自的流派,我只需要也获取这些特定流派的字体大小。**
这是我的一些代码以提供更多上下文...
data = [] # used to sort between country and university <td> tags
links = [] # stores links from clicking on the university name and used to get genres
countries = [] #
universities = [] # indices match for these lists
spotifyLinks = [] #
fontSizes = [] #
genres = [[]] #
genres_weight = [[]] #
page = requests.get("http://everynoise.com/everyschool.cgi") # stores response from Every Noise into page
soup = BeautifulSoup(page.content, 'html.parser') # used to create data list
soup1 = BeautifulSoup(page.content, 'html.parser') # used to create links and spotifyLinks lists
soupList = list(soup.find_all('td', class_="note")) # creates list of <td> tags where class="note"
for soup in soupList: #
if not soup.get_text().isnumeric(): # stores all country and university names in data list
data.append(soup.get_text()) #
for i in range(len(data)): #
if i%2 == 0: # separates data list into two individual lists
countries.append(data[i]) # for country and university names respectively
else: #
universities.append(data[i]) #
for a in soup1.find_all('a', attrs={'href': re.compile("\?root=")}):
links.append('http://everynoise.com/everyschool.cgi' + a['href'])
for a in soup1.find_all('a', attrs={'href': re.compile("spotify:playlist:")}):
spotifyLinks.append('https://open.spotify.com/playlist/' + a['href'][17:])
spotifyLinks = spotifyLinks[:-1]
linkSubset = links[0:4] # subset of links for quicker testing
j=1
for link in linkSubset: # switch out linkSubset with links for full dataset
time.sleep(1) # so we don't spam their servers
schoolGenres = []
nextPage = urllib.urlopen(url=link)
bs_obj = BeautifulSoup(nextPage, "html.parser")
for a in bs_obj.find_all('a', attrs={'href': re.compile("^engenremap-")}):
schoolGenres.append(a.get_text())
genres.append(schoolGenres)
print "Scraping...", j
j=j+1
genres = genres[1:]
distinct_genres = set()
for genre in genres:
distinct_genres.update(genre)
print "\nDistinct Genres:", distinct_genres
编辑/答案:最终通过使用所选答案的略微修改版本来解决。
pattern = re.compile(r'font-size: (\d+)')
for a in bs_obj.select('div[style*="font-size"]'):
genreWeights.append(int(pattern.search(str(a)).group(1)))
解决方案
您可以搜索包含特定文本的标签,然后提取 的值font-size
。例如:
import re
from bs4 import BeautifulSoup
txt = """<div style="font-size: 141%; line-height: 110%"><a href="engenremap-latinrock.html" style="color: #AF7E1C">latin rock</a></div>
<div style="font-size: 139%; line-height: 110%"><a href="engenremap-mexicanindie.html" style="color: #B18230">mexican indie</a></div>
<div style="font-size: 139%; line-height: 110%"><a href="engenremap-mexicanindie.html" style="color: #B18230">hello world/a></div>"""
soup = BeautifulSoup(txt, "html.parser")
pattern = re.compile(r'font-size: (\d+)')
for tag in soup.select("div:contains(latin, mexican)"):
font_size = pattern.search(str(tag)).group(1)
print(font_size)
输出:
141
139
推荐阅读
- angular - 安装后 ag 网格样式不起作用
- c++ - 如何授予对源文件夹的包含访问权限,以便在使用库时可以通过包含文件间接访问源?
- c++ - 多个类中的单例对象的多次访问
- php - 将 posts_per_page 更改为 WooCommerce AJAX 实时搜索的 MySQL 查询
- ios - 从用户输入的 URL 下载内容的大小限制
- c++ - 朋友功能错误:“B”尚未声明
- php - Laravel 字符串数据,右截断:1406 Data too long for column
- jquery - 在我当前的日期选择器代码中禁用特定日期的最快方法?
- c++ - c++类成员函数指针在类中不能正常工作
- node.js - 如何将变量分配给节点js中的sqlite查询的输出