首页 > 解决方案 > Python:解析包含标头的 XML (xliff) 文件

问题描述

我正在尝试解析一个 XML 文件(更准确地说,它是一个 XLIFF 翻译文件),并将其转换为(略有不同的)TMX 格式。

我的源 XLIFF 文件如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
  <file origin="Some/Folder/proj/SomeFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah" product-version="3.9.12" build-num="1" x-train="Blurt">
    <header>
      <count-group name="SomeFile.strings">
        <count count-type="total" unit="word">2</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="8.text" restype="string" resname=""><source>End</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11) _one-word-threshold(-25)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">Kraj</target><note>This is a note</note></trans-unit>
    </body>
  </file>
  <file origin="Some/Folder/proj/SomeOtherFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah2" product-version="3.12.56" build-num="1" x-train="Blurt2">
    <header>
      <count-group name="SomeOtherFile.strings">
        <count count-type="total" unit="word">4</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="14.accessibilityLabel" restype="string" resname=""><source>return to project list</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">povratak na popis projekata</target><note>This is again a note</note></trans-unit>
    </body>
  </file>

  (and more <file> elements continue... some with many more <trans-unit> </trans-unit> elements than these above)

  </xliff>

我的目标是稍微重新排列和简化这些,以将上述格式转换为以下格式:

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeOtherFile.strings</prop>
    <tuv xml:lang="en">
        <seg>return to project list</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is again a note</prop></prop>
        <seg>povratak na popis projekata</seg>
    </tuv>
</tu>

请注意,原始 XLIFF 文件可能有几个<file origin ...>部分,每个部分都有许多<trans-unit ...>元素(这些是该文件中的实际字符串......)

我已经设法编写了一个部分,它给了我“源”和“目标”部分 OK,但我仍然需要的是来自“文件源”元素的部分......定义语言的地方(即“源语言”和“目标语言”,然后我将作为每个字符串写出),<tuv xml:lang="en">以及<tuv xml:lang="hr">在哪里可以找到对字符串文件的相关引用(即“SomeFile.strings”和“SomeOtherFile.strings”,用作<prop type="FileSource">SomeFile.strings</prop>)。

目前我有以下 Python 代码,它很好地提取了所需的“源”和“目标”元素:

#!/usr/bin/env python3
#

import sys

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.iterparse(file)
for action, elem in tree:
    if elem.tag == "source":
        print("<TransUnit>")
        print("\t<Source>" + elem.text  + "</Source>")
    elif elem.tag == "target":
        print("\t<Target>" + elem.text + "</Target>")
    elif elem.tag == "note":
        if elem.text is not None:
            print("\t<Note>" + elem.text + "</Note>")
            print("</TransUnit>")
        else: 
            print("</TransUnit>")
    else:
        next

现在,我怎么还能从“文件来源”中提取“源语言”(即值“en”)、“目标语言”(即值“hr”)和文件引用(即“SomeFile.strings”) ...." 原始 XLIFF 文件中的元素?

另外,我需要保留(记住)该文件引用,即:

<prop type="FileSource">SomeOtherFile.strings</prop>

因此,例如,我会:

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>Start</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Početak</seg>
    </tuv>
</tu>

我非常感谢在这方面的任何帮助......

标签: pythonlxmlxliff

解决方案


import xml.etree.cElementTree as ET

tree=ET.ElementTree(file='inputfile.xlf')

root=tree.getroot()

for tag in root.findall('file'):
    t_value = tag.get('target-language')

for tag in root.findall('file'):
    s_value = tag.get('source-language')

推荐阅读