首页 > 解决方案 > 试图获得直接的孩子,但让所有孩子都使用 BeautifulSoup

问题描述

我正在尝试创建一个类别字典,尤其是来自这个url的食物。现在,当我尝试使用下面的代码时,它给出了重复的li项目。

from bs4 import BeautifulSoup as bs
import requests

url = 'https://developer.foursquare.com/docs/build-with-foursquare/categories/'
req = requests.get(url)
soup = bs(req.text)

food_categories = soup.select('div.documentTemplate__Content-sc-5mpekp-0 > ul > li:nth-child(4)')[0]

for tagli in food_categories.find_all("li"):
    print(tagli.find('h3').text)
    for another_tagli in tagli.find_all('ul'):
        for some_tagli in another_tagli.find_all('li'):
            print(some_tagli.find('h3').text)
            for one_tagli in some_tagli.find_all('ul'):
                for aon_tagli in one_tagli.find_all('li'):
                    print(aon_tagli.find('h3').text)

现在,根据许多stackoverflow帖子,我尝试使用recursive=False参数来获取唯一的直接子级,但如果我使用它,我什么也得不到。

我正在寻找这样的输出:

{
  'food': {
    'Afghan Restaurant': [],
    'African Restaurant': ['Ethiopian Restaurant'],
    'Asian Restaurant': {
      'Chinese Restaurant': ['Anhui Restaurant', 'Beijing Restaurant']
    }
  }
}

请在这里指导我。

标签: pythonweb-scrapingbeautifulsoup

解决方案


此脚本从 Food 子类别生成树:

import requests
from bs4 import BeautifulSoup

def parse_tree(t):
    dct = {}

    for li in t.find_all('li', recursive=False):
        dct[li.find_next('h3').text] = parse_tree(li.select_one('ul'))

    return dct

url = 'https://developer.foursquare.com/docs/build-with-foursquare/categories/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
root = soup.select_one('h3:contains("Food") ~ ul')

tree = parse_tree(root)

# pretty print the tree:
import json
print(json.dumps(tree, indent=4))

印刷:

{
    "Afghan Restaurant": {},
    "African Restaurant": {
        "Ethiopian Restaurant": {}
    },
    "American Restaurant": {
        "New American Restaurant": {}
    },
    "Asian Restaurant": {
        "Burmese Restaurant": {},
        "Cambodian Restaurant": {},
        "Chinese Restaurant": {
            "Anhui Restaurant": {},
            "Beijing Restaurant": {},
            "Cantonese Restaurant": {},
            "Cha Chaan Teng": {},
            "Chinese Aristocrat Restaurant": {},

    ... and so on.

推荐阅读