首页 > 解决方案 > 如何在 Python 中递归地包含 XML 文件,跟踪原始文件和行?

问题描述

我正在准备一个 Python 框架来处理存储在 XML 文件中的系统描述。描述是分层的,应该允许创建带有子模块描述的库。这需要支持包含 XML 文件。我曾尝试使用xml.etree.ElementInclude模块,但它似乎无法正确处理嵌套的包含。

因此,我创建了自己的解决方案,替换了隐藏在 XML 注释中的 include 指令:

<!-- include path/to/the/included_file -->

包含文件的内容。如果包含的文件包含其他包含指令,它们将被递归处理。代码非常简单:

import os.path
import re
R1 = r"<!--\s*include\s*(?P<fname>\S+)\s*-->"
P1 = re.compile(R1)

def handle_includes(file_path,base_dir="./"):
    """ Function handle_includes replaces the include directives:
    <!-- include path/to/the/included_file -->
    with the contents of the included file.
    If the included file also contains include directives, they
    are handled recursively.
    The base_dir argument specifies base directory for relative
    paths.
    """
    # Check if the file_path is relative or absolute
    if file_path[0] == '/':
        # absolute
        full_file_path = file_path
    else:
        # relative
        full_file_path = base_dir + '/' + file_path
    # Read the file contents
    contents = open(full_file_path, 'r').read()
    # Create the base directory for possible further includes
    next_base_dir = os.path.dirname(full_file_path)
    # Mark the start position
    start_pos = 0
    # List of the parts of the string
    chunks = []
    # Find the include directives
    incl_iter = P1.finditer(contents)
    for incl_instance in incl_iter:
        # Find the occurence of include
        include_span = incl_instance.span()
        # Put the unmodified part of the string to the list
        chunks.append(contents[start_pos:include_span[0]])
        # Read the included file and handle nested includes
        replacement = handle_includes(incl_instance.groups()[0],next_base_dir)
        chunks.append(replacement)
        # Adjust the start position
        start_pos = include_span[1]
    # Add the final text (if any)
    chunks.append(contents[start_pos:])
    # Now create and return the content with resolved includes
    res = ''.join(chunks)
    return res

该函数由简单调用

final_xml=handle_includes('path/to/top.xml')

上面的代码可以正常工作,生成的 XML 可能会由xml.etree.ElementTree.fromstring进一步处理。但是,当生成的最终 XML 变大时,很难发现深度包含的 XML 文件中可能存在的错误。是否有可能以某种方式将有关原始源文件和行号的信息附加到生成的 XML 中?

标签: pythonxmlinclude

解决方案


我已经设法实现跟踪包含行的来源。该handle_includes函数现在不仅返回包含插入的文件的内容,而且还返回存储行块来源的对象列表。每个LineLocation对象存储:

  • 生成的 XML 中块的第一行
  • 生成的 XML 中块的最后一行
  • 块的第一行在原始文件中的位置
  • 读取块的文件的路径

如果在处理最终 XML 中的某一行期间检测到错误,则该对象列表允许在由多个文件组成的原始源中轻松找到相应行的位置。

实现只是稍微复杂一点:

import os.path
import re
R1 = r"<!--\s*include\s*(?P<fname>\S+)\s*-->"
P1 = re.compile(R1)
class LineLocation(object):
    """ Class LineLocation stores the origin of the
    block of source code lines.
    "start" is the location of the first line of the block
    "end" is the location of the last line of the block
    "offset" is the position of the first line of the blok in the original file
    "fpath" is the path to the file from where the lines were read.
    """
    def __init__(self, start, end, offset, fpath):
        self.start = start
        self.end = end
        self.offset = offset
        self.fpath = fpath
    def adjust(self, shift):
        self.start += shift
        self.end += shift
    def tostr(self):
   return str(self.start)+"-"+str(self.end)+"->"+str(self.offset)+":"+self.fpath

def handle_includes(file_path, base_dir="./"):
    """ Function handle_includes replaces the include directives:
    <!-- include path/to/the/included_file -->
    with the contents of the included file.
    If the included file also contains include directives, they
    are handled recursively.
    The base_dir argument specifies base directory for relative
    paths.
    """
    # Check if the file_path is relative or absolute
    if file_path[0] == '/':
        # absolute
        full_file_path = file_path
    else:
        # relative
        full_file_path = base_dir + '/' + file_path
    # Read the file contents
    contents = open(full_file_path, 'r').read()
    # Create the base directory for possible further includes
    next_base_dir = os.path.dirname(full_file_path)
    # Find the include directives
    # Mark the start position
    start_pos = 0
    # Current number of lines
    start_line = 0
    # Offset in lines from the beginning of the file
    offset_line = 0
    # List of the parts of the string
    chunks = []
    lines = []
    incl_iter = P1.finditer(contents)
    for incl_instance in incl_iter:
        # Find the occurence of include
        include_span = incl_instance.span()
        # Put the unmodified part of the string to the list
        part = contents[start_pos:include_span[0]]
        chunks.append(part)
        # Find the number of the end line
        n_of_lines = len(part.split('\n'))-1
        end_line = start_line + n_of_lines
        lines.append(LineLocation(start_line,end_line,offset_line,file_path))
        offset_line += n_of_lines
        start_line = end_line
        # Read the included file and handle nested includes
        replacement, rlines = handle_includes(incl_instance.groups()[0], next_base_dir)
        chunks.append(replacement)
        # Now adjust the line positions accorrding to the first line of the include
        for r in rlines:
            r.adjust(start_line)
        # Adjust the start line after the end of the include
        start_line = r.end
        # Append lines positions
        lines += rlines
        # Adjust the start position
        start_pos = include_span[1]
    # Add the final text (if any)
    part = contents[start_pos:]
    if len(part) > 0:
        chunks.append(part)
        # And add the final part line positions
        n_of_lines = len(part.split('\n'))-1
        end_line = start_line + n_of_lines
        lines.append(LineLocation(start_line, end_line, offset_line, file_path))
        offset_line += n_of_lines
    # Now create and return the content with resolved includes
    res = ''.join(chunks)
    return res, lines

现在该函数应该被称为

final_xml, lines = handle_includes('path/to/the/top.xml')

推荐阅读