python - 试图获得直接的孩子,但让所有孩子都使用 BeautifulSoup
问题描述
我正在尝试创建一个类别字典,尤其是来自这个url的食物。现在,当我尝试使用下面的代码时,它给出了重复的li
项目。
from bs4 import BeautifulSoup as bs
import requests
url = 'https://developer.foursquare.com/docs/build-with-foursquare/categories/'
req = requests.get(url)
soup = bs(req.text)
food_categories = soup.select('div.documentTemplate__Content-sc-5mpekp-0 > ul > li:nth-child(4)')[0]
for tagli in food_categories.find_all("li"):
print(tagli.find('h3').text)
for another_tagli in tagli.find_all('ul'):
for some_tagli in another_tagli.find_all('li'):
print(some_tagli.find('h3').text)
for one_tagli in some_tagli.find_all('ul'):
for aon_tagli in one_tagli.find_all('li'):
print(aon_tagli.find('h3').text)
现在,根据许多stackoverflow帖子,我尝试使用recursive=False
参数来获取唯一的直接子级,但如果我使用它,我什么也得不到。
我正在寻找这样的输出:
{
'food': {
'Afghan Restaurant': [],
'African Restaurant': ['Ethiopian Restaurant'],
'Asian Restaurant': {
'Chinese Restaurant': ['Anhui Restaurant', 'Beijing Restaurant']
}
}
}
请在这里指导我。
解决方案
此脚本从 Food 子类别生成树:
import requests
from bs4 import BeautifulSoup
def parse_tree(t):
dct = {}
for li in t.find_all('li', recursive=False):
dct[li.find_next('h3').text] = parse_tree(li.select_one('ul'))
return dct
url = 'https://developer.foursquare.com/docs/build-with-foursquare/categories/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
root = soup.select_one('h3:contains("Food") ~ ul')
tree = parse_tree(root)
# pretty print the tree:
import json
print(json.dumps(tree, indent=4))
印刷:
{
"Afghan Restaurant": {},
"African Restaurant": {
"Ethiopian Restaurant": {}
},
"American Restaurant": {
"New American Restaurant": {}
},
"Asian Restaurant": {
"Burmese Restaurant": {},
"Cambodian Restaurant": {},
"Chinese Restaurant": {
"Anhui Restaurant": {},
"Beijing Restaurant": {},
"Cantonese Restaurant": {},
"Cha Chaan Teng": {},
"Chinese Aristocrat Restaurant": {},
... and so on.
推荐阅读
- git - 为什么我们需要在我们的分叉存储库中创建一个分支来推送我们的更改,然后创建一个拉取请求到上游存储库?
- ios - Files App/FileManager:App文件夹被隐藏,Documents文件夹中的文件可以读取但不能删除
- r - 使用ggplot沿水平轴移动几何对象
- c# - 如何在课堂上从主窗口访问按钮/文本框.....
- ios - M1芯片上的Xcode构建失败
- javascript - 使用javascript动态添加div
- sql - 获取当年每个月的订阅者计数,即使计数为 0
- python - 如何用pdfplumber完成for循环?
- bash - 如何使用 awk 将混合/部分缺失的记录提取到定义的顺序
- reactjs - 如何使用 ReactJS 更新由一组对象组成的堆肥状态?