首页 > 解决方案 > 解析 xml 文件并使用检索到的数据提供对象列表

问题描述

编辑:在帖子底部添加了一个工作解决方案以供审查。

所以每次我接触 xml 时,我都想把头撞到墙上。通常是为了写一个文件,我终于设法找到解决所有不一致的方法,但这次我必须解析一个文档。

这是场景:我有一个 xml 文档列出游戏和每个游戏作为一些属性(或子节点?实际上我不确定)。我想要的是:

For each game:
 Gets it's path, name, and genre
 Build a Game object with this
 Store the object in an array list

我理解“findall”命令,但我不明白如何链接它们之间的数据。因为它是一棵树,我想我应该能够从一个游戏走到另一个游戏,获取我需要的数据,然后继续下一个游戏,但见鬼,我被卡住了。

所以这里是我需要解析的 xml 文件的摘录:

<?xml version="1.0"?>
<gameList>
    <provider>
        <System>Megadrive</System>
        <software>Skraper</software>
        <database>ScreenScraper.fr</database>
        <web>http://www.screenscraper.fr</web>
    </provider>
    <game id="574" source="ScreenScraper.fr">
        <path>./3 Ninjas Kick Back.zip</path>
        <name>3 Ninjas Kick Back</name>
        <genre>Platform-Action</genre>
    </game>
    <game id="394" source="ScreenScraper.fr">
        <path>./688 Attack Sub.zip</path>
        <name>688 Attack Sub</name>
        <genre>Simulation</genre>
    </game>
</gameList>

这是我当前的代码,在沙箱中,正在尝试和体验状态:

import os
from xml.etree import ElementTree


class GameListParser:
    GAMELIST_FILE = 'gamelist.xml'

    GAMELIST_KEY = "gameList"
    GAME_KEY = "game"
    GENRE_KEY = "genre"
    PATH_KEY = "path"
    NAME_KEY = "name"

    keys_map = {
        GAMELIST_KEY: {
            GAME_KEY: [NAME_KEY, GENRE_KEY, PATH_KEY]
        }
    }

    def __init__(self, gamelist_path):
        self.gamelist = os.path.join(gamelist_path, self.GAMELIST_FILE)
        self.parsed_gamelist = None
        self.__parse()

    def __parse(self):
        self.parsed_gamelist = ElementTree.parse(self.gamelist)

    def __get_root(self):
        return self.parsed_gamelist.getroot()

    def get_all_games(self):
        return self.parsed_gamelist.findall(self.GAME_KEY)

    def print_games_details(self):
        for node in self.get_all_games():
            for game in node.getiterator():
                name = game.attrib.get(self.NAME_KEY)
                genre = game.attrib.get(self.GENRE_KEY)

使用该print_games_details方法我只是希望打印游戏数据,但实际上节点和游戏对象是相同的,所以名称和流派是 None 并且我没有检索我需要的数据。

我很确定这很简单,但我一生中只需要使用 xml 3 到 4 次,我唯一需要解析为对象的一次是使用 C++,它是一个系统完整的重构。另外两次是在 Matlab 和 Python 中,将对象指向 xml 文件。每次我都难以理解树的逻辑,如何解析/创建它,而在线资源对我帮助不大。

编辑:所以我研究了一个解决方案,虽然它给了我结果,但我希望我对它一点也不满意。我的问题是,这个解决方案意味着我非常了解 xml 文件的结构,而我只是走过去。

我无法用它做一些通用的事情,这是我对 xml 方法的主要关注之一。

如果你们中的任何人可以查看以下代码并提供反馈和改进,我将不胜感激:

import os
from xml.etree import ElementTree


class GameListParser:
    GAMELIST_FILE = 'gamelist.xml'

    GAME_ID = 'id'
    GAME_KEY = "game"
    GENRE_KEY = "genre"
    PATH_KEY = "path"
    NAME_KEY = "name"

    keys_map = [NAME_KEY, GENRE_KEY, PATH_KEY]
    game_map = {}

    def __init__(self, gamelist_path):
        self.gamelist = os.path.join(gamelist_path, self.GAMELIST_FILE)
        self.parsed_gamelist = None
        self.__parse()

    def __str__(self):
        text_output = []

        for game_id, game in self.game_map.items():
            text_output.append("Game " + game_id + " has properties:")
            for key, value in game.items():
                text_output.append(key + ": " + value)
            text_output.append("\n")

        return "\n".join(text_output)

    def __get_game_id(self, game):
        return game.get(self.GAME_ID)

    def __game_is_valid(self, game):
        return self.__get_game_id(game) is not None

    def __get_all_games(self):
        return self.parsed_gamelist.findall(self.GAME_KEY)

    def __process_all_games(self):
        for game in self.__get_all_games():
            self.__process_game_nodes(game)

    def __process_game_nodes(self, game):

        if self.__game_is_valid(game):

            details = {}
            self.game_map[self.__get_game_id(game)] = details

            for key in self.keys_map:
                game_child = game.find(key)
                if game_child is not None:
                    details[key] = game_child.text
                else:
                    details[key] = ""

    def __parse(self):
        self.parsed_gamelist = ElementTree.parse(self.gamelist)
        self.__process_all_games()

标签: pythonxmlxml-parsing

解决方案


推荐第三方库:SimplifiedDoc。点安装 -U 简化的_scrapy

from simplified_scrapy import SimplifiedDoc
html = '''
<?xml version="1.0"?>
<gameList>
    <provider>
        <System>Megadrive</System>
        <software>Skraper</software>
        <database>ScreenScraper.fr</database>
        <web>http://www.screenscraper.fr</web>
    </provider>
    <game id="574" source="ScreenScraper.fr">
        <path>./3 Ninjas Kick Back.zip</path>
        <name>3 Ninjas Kick Back</name>
        <genre>Platform-Action</genre>
    </game>
    <game id="394" source="ScreenScraper.fr">
        <path>./688 Attack Sub.zip</path>
        <name>688 Attack Sub</name>
        <genre>Simulation</genre>
    </game>
</gameList>
'''
doc = SimplifiedDoc(html)
games = doc.gameList.games
datas = [[g.path.text,g.name.text,g.genre.text] for g in games]
print (datas)

结果:

[['./3 Ninjas Kick Back.zip', '3 Ninjas Kick Back', 'Platform-Action'], ['./688 Attack Sub.zip', '688 Attack Sub', 'Simulation']]

这里有更多例子:https ://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples


推荐阅读