首页 > 解决方案 > Beautiful Soup - 从 div 类中的外部引号中提取值

问题描述

我在使用以下代码从网站提取的属性中的元素中提取特定值时遇到了一些问题:

from bs4 import BeautifulSoup
import requests

# Get mills and estates information from dashboard
url = 'http://nestetraceabilitydashboard.com/nestes-palm-oil-dashboard' 
page = requests.get(url).text
soup = BeautifulSoup(page, "html.parser")

divList = soup.findAll('div', attrs={"class" : "map-item estate-map-item"})
data = {}
for div in divList:
    for k,v in div.attrs.items(): 
        if k not in ('class'):
            data[k] = data.get(k, []) + [v]

df = pd.DataFrame(data)

摘录divList如下:

[<div class="map-item estate-map-item" data-country="Indonesia" data-latitude="1.926944000" data-location="Riau" data-longitude="99.906390000" data-mills="Aek Nabara" id="map_item_5600">(Aek Nabara) - Aek Nabara</div>,
 <div class="map-item estate-map-item" data-country="Indonesia" data-latitude="0.429444444" data-location="Riau" data-longitude="101.818611100" data-mills="Buatan I " id="map_item_5601">(Buatan I/II ) - Buatan</div>,

但是,输出dictdataframe删除 .map_item_XXXX 之后的所有内容id

我将如何仅在 my 中的引号之外获取值dict,然后将值放入dataframe id(Aek Nabara) - Aek Nabara中,例如上面的第一项divList

标签: pythonbeautifulsouphtml-parsing

解决方案


(Aek Nabara) - Aek Nabar不是属性(.attrs)而是textContent用来.text获取值

for div in divList:
    for k,v in div.attrs.items(): 
        if k != 'class':
            if k == 'id':
                # insert "(Aek Nabara) - Aek Nabara" instead of "map_item_5600"
                data[k] = data.get(k, []) + [div.text.strip()]
            else:
                data[k] = data.get(k, []) + [v]

df = pd.DataFrame(data)

推荐阅读