python - 如何使用 Beautiful Soup 在 HTML 中查找下一个文本实例?
问题描述
我正在编写一个程序,使用此网站查找当天的国家美食节:https ://foodimentary.com/today-in-national-food-holidays/may-holidays/ 。
到目前为止,我已经能够始终如一地获得带有当前日期的标签,但是我在使用它作为获取相关食品日的基本参考时遇到了麻烦。这是我到目前为止所拥有的:
month = date.today().strftime('%b') # Get month
day = date.today().strftime('%d') # Get day
date = f'{month.lower()}-{day}' # Format date
# Get HTML from home page
url = 'https://foodimentary.com/today-in-national-food-holidays/todayinfoodhistorycalenderfoodnjanuary/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser') # Parse HTML with Beautiful Soup
# Get the current month URL
months = soup.find('ul', id='menu-months', class_='menu') # Isolate the months table
monthUrl = months.find('a', href=True, string=month)['href'] # Get the month URL for the current month
# Get HTML from month page, parse
r = requests.get(monthUrl)
soup = BeautifulSoup(r.text, 'html.parser')
# Find tag with URL that contains formatted date
holidayTag = soup.select_one(f'a[href*={date}]')
print(holidayTag)
# TODO: Get the name of the food day based on holidayTag
使用我的浏览器的开发者控制台,将日期与食物假期名称相关联的最一致的模式似乎是假期始终是日期标签之后的下一个文本实例。这是一段 HTML 示例:
<div style="text-align:center;">
<strong><a title="May 29" href="https://foodimentaryguy.wordpress.com/2011/05/29/may-29/">May 29</a></strong><br>
<span style="color:#000000;"><a style="color:#000000;" href="https://foodimentary.com/2017/02/12/february-12th-is-national-biscotti-day/">National Biscuit Day</a></span>
<div style="text-align:center;"><strong><a title="May 28" href="https://foodimentaryguy.wordpress.com/2011/05/28/may-28/">May 28</a></strong><br>
<span style="color:#000000;"><a style="color:#000000;" href="https://foodimentary.com/2016/05/28/may-28-is-national-brisket-day/">National Brisket Day</a></span>
</div>
</div>
我的问题是:如何使用 Beautiful Soup 从日期标签中获取假期名称?
解决方案
该文本非常非结构化(很可能是手写而不是机器生成的)。我建议使用re
模块进行主解析:
import re
from bs4 import BeautifulSoup
url = 'https://foodimentary.com/today-in-national-food-holidays/may-holidays/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
txt = soup.select_one('section[role="main"]').text
out = {}
for day, names in re.findall(r'^([A-Z][^\n]+\d\s*)$(.*?)\n\n', txt, flags=re.DOTALL|re.M):
out[day.strip()] = [name.replace('\xa0', ' ') for name in names.strip().split('\n')]
# pretty print on screen:
from pprint import pprint
pprint(out)
印刷:
{'May 1': ['National Chocolate Parfait Day'],
'May 10': ['National Liver and Onions Day'],
'May 11': ['National “Eat What You Want” Day'],
'May 12': ['National Nutty Fudge Day'],
'May 13': ['National Apple Pie Day',
'National Fruit Cocktail Day',
'National Hummus Day'],
'May 14': ['National Brioche Day', 'National Buttermilk Biscuit Day'],
'May 15': ['National Chocolate Chip Day'],
'May 16': ['National Barbecue Day'],
'May 17': ['National Cherry Cobbler Day'],
'May 18': ['National Cheese Souffle Day', 'I love Reese’s Day'],
'May 19': ['National Devil’s Food Cake Day'],
'May 2': ['National Chocolate Truffle Day'],
'May 20': ['National Quiche Lorraine Day', 'National Pick Strawberries Day'],
'May 21': ['National Strawberries and Cream Day'],
'May 22': ['National Vanilla Pudding Day'],
'May 23': ['National Taffy Day'],
'May 24': ['National Escargot Day'],
'May 25': ['National Brown-Bag-It Day', 'National Wine Day'],
'May 26': ['National Blueberry Cheesecake Day', 'National Cherry Dessert Day'],
'May 27': ['National Italian Beef Day', 'National Grape Popsicle Day'],
'May 28': ['National Brisket Day'],
'May 29': ['National Biscuit Day'],
'May 3': ['National Raspberry Popover Day',
'National Raspberry Tart Day',
'National Chocolate Custard Day'],
'May 30': ['National Mint Julep Day'],
'May 31': ['National Macaroon Day'],
'May 4': ['National Candied Orange Peel Day',
'National Homebrew Day',
'National Hoagie Day'],
'May 5': ['National Enchilada Day – Happy Cinco de Mayo!'],
'May 6': ['National Crepe Suzette Day'],
'May 7': ['National Roast Leg of Lamb Day'],
'May 8': ['National Coconut Cream Pie Day'],
'May 9': ['National Shrimp Day', 'National Foodies Day*']}
推荐阅读
- html - 在Angular中替换innerHTML不必要地关闭标签
- matlab - 计算月中的天数 (MATLAB)
- python - Python selenium:元素不可交互
- c++ - 如何将一个数字分成几个不相等但不断增加的数字 [用于发送 PlaceOrder(OP_BUY,lots) 合约 XTO]
- git - 如何在 Gerrit 推送主题中转义空格?
- actions-on-google - 提交 Google Assistant 进行 Alpha 测试时是否有地域限制?
- service-worker - 如何手动终止正在运行的 Service Worker
- python - Django - 删除按钮未重定向到正确的路径
- mvvm - 在 c# 中更改项目源时,UWP DataGrid 不会更新 UI
- javascript - 按观看次数和日期排序的趋势