首页 > 解决方案 > 将网站完全以 XML 格式转换为 pandas 数据框

问题描述

我正在尝试将以下网站转换为数据框,以便可以使用数据:https ://www.ifsqn.com/forum/index.php/rss/forums/4-food-safety-quality-discussion/

我在网上到处看,我只看到如何将 XML 文件转换为数据帧。我尝试了以下方法,但它不起作用,因为它不是 XML 文件。我可以自己做熊猫部分,但首先,需要有数据可以使用。

import requests
import xml.etree.ElementTree as ET

headers = {'User-Agent': 'Mozilla/5.0'}

r = requests.get("https://www.ifsqn.com/forum/index.php/rss/forums/4-food-safety-quality-discussion/",headers=headers)

c = r.content

root = ET.parse(r).getroot()

print(root)

我在这里缺少哪些步骤来将 XML 转换为可读格式以将数据转换为 pandas 数据框?

非常感谢任何输入!

标签: pythonpython-3.xpandas

解决方案


您要解析的 XML 是 RSS,并且由于它具有特定格式,因此您可以使用解析 RSS 提要的 Python 库(以feedparser为例)

import feedparser
import pandas as pd

parsed_rss = feedparser.parse('https://www.ifsqn.com/forum/index.php/rss/forums/4-food-safety-quality-discussion/')

pd.DataFrame(parsed_rss['entries'])
                                                title                                       title_detail  ...                                                 id guidislink
0                      Monitored vs Verifying Records  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
1   Is it necessary to follow the new ISO 22000 to...  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
2                      usda inspector tagging product  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
3                              Chocolate Liquor Discs  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
4                              Multi-Pack Beef Sticks  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
..                                                ...                                                ...  ...                                                ...        ...
95  HACCP Pan for super critical fluid extraction ...  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
96               Illegal Drugs Pictured on Food Label  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
97    BRC metal can packaging compliance requirements  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
98  Codex Decision tree in ISO 22000:2018 - Clause...  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False
99           BRC clause 4.3.4 - Battery Charging area  {'type': 'text/plain', 'language': None, 'base...  ...  https://www.ifsqn.com/forum/index.php/topic/38...      False

[100 rows x 10 columns]

另一种方法是自己将 XML 解析为可用于构造 DataFrame 的某种结构,此处为示例

编辑:

现在我看到您通过了r,而不是c在以下行中:

root = ET.parse(r).getroot()

推荐阅读