python - 如何在 Python 中递归地包含 XML 文件,跟踪原始文件和行?
问题描述
我正在准备一个 Python 框架来处理存储在 XML 文件中的系统描述。描述是分层的,应该允许创建带有子模块描述的库。这需要支持包含 XML 文件。我曾尝试使用xml.etree.ElementInclude模块,但它似乎无法正确处理嵌套的包含。
因此,我创建了自己的解决方案,替换了隐藏在 XML 注释中的 include 指令:
<!-- include path/to/the/included_file -->
包含文件的内容。如果包含的文件包含其他包含指令,它们将被递归处理。代码非常简单:
import os.path
import re
R1 = r"<!--\s*include\s*(?P<fname>\S+)\s*-->"
P1 = re.compile(R1)
def handle_includes(file_path,base_dir="./"):
""" Function handle_includes replaces the include directives:
<!-- include path/to/the/included_file -->
with the contents of the included file.
If the included file also contains include directives, they
are handled recursively.
The base_dir argument specifies base directory for relative
paths.
"""
# Check if the file_path is relative or absolute
if file_path[0] == '/':
# absolute
full_file_path = file_path
else:
# relative
full_file_path = base_dir + '/' + file_path
# Read the file contents
contents = open(full_file_path, 'r').read()
# Create the base directory for possible further includes
next_base_dir = os.path.dirname(full_file_path)
# Mark the start position
start_pos = 0
# List of the parts of the string
chunks = []
# Find the include directives
incl_iter = P1.finditer(contents)
for incl_instance in incl_iter:
# Find the occurence of include
include_span = incl_instance.span()
# Put the unmodified part of the string to the list
chunks.append(contents[start_pos:include_span[0]])
# Read the included file and handle nested includes
replacement = handle_includes(incl_instance.groups()[0],next_base_dir)
chunks.append(replacement)
# Adjust the start position
start_pos = include_span[1]
# Add the final text (if any)
chunks.append(contents[start_pos:])
# Now create and return the content with resolved includes
res = ''.join(chunks)
return res
该函数由简单调用
final_xml=handle_includes('path/to/top.xml')
上面的代码可以正常工作,生成的 XML 可能会由xml.etree.ElementTree.fromstring进一步处理。但是,当生成的最终 XML 变大时,很难发现深度包含的 XML 文件中可能存在的错误。是否有可能以某种方式将有关原始源文件和行号的信息附加到生成的 XML 中?
解决方案
我已经设法实现跟踪包含行的来源。该handle_includes
函数现在不仅返回包含插入的文件的内容,而且还返回存储行块来源的对象列表。每个LineLocation
对象存储:
- 生成的 XML 中块的第一行
- 生成的 XML 中块的最后一行
- 块的第一行在原始文件中的位置
- 读取块的文件的路径
如果在处理最终 XML 中的某一行期间检测到错误,则该对象列表允许在由多个文件组成的原始源中轻松找到相应行的位置。
实现只是稍微复杂一点:
import os.path
import re
R1 = r"<!--\s*include\s*(?P<fname>\S+)\s*-->"
P1 = re.compile(R1)
class LineLocation(object):
""" Class LineLocation stores the origin of the
block of source code lines.
"start" is the location of the first line of the block
"end" is the location of the last line of the block
"offset" is the position of the first line of the blok in the original file
"fpath" is the path to the file from where the lines were read.
"""
def __init__(self, start, end, offset, fpath):
self.start = start
self.end = end
self.offset = offset
self.fpath = fpath
def adjust(self, shift):
self.start += shift
self.end += shift
def tostr(self):
return str(self.start)+"-"+str(self.end)+"->"+str(self.offset)+":"+self.fpath
def handle_includes(file_path, base_dir="./"):
""" Function handle_includes replaces the include directives:
<!-- include path/to/the/included_file -->
with the contents of the included file.
If the included file also contains include directives, they
are handled recursively.
The base_dir argument specifies base directory for relative
paths.
"""
# Check if the file_path is relative or absolute
if file_path[0] == '/':
# absolute
full_file_path = file_path
else:
# relative
full_file_path = base_dir + '/' + file_path
# Read the file contents
contents = open(full_file_path, 'r').read()
# Create the base directory for possible further includes
next_base_dir = os.path.dirname(full_file_path)
# Find the include directives
# Mark the start position
start_pos = 0
# Current number of lines
start_line = 0
# Offset in lines from the beginning of the file
offset_line = 0
# List of the parts of the string
chunks = []
lines = []
incl_iter = P1.finditer(contents)
for incl_instance in incl_iter:
# Find the occurence of include
include_span = incl_instance.span()
# Put the unmodified part of the string to the list
part = contents[start_pos:include_span[0]]
chunks.append(part)
# Find the number of the end line
n_of_lines = len(part.split('\n'))-1
end_line = start_line + n_of_lines
lines.append(LineLocation(start_line,end_line,offset_line,file_path))
offset_line += n_of_lines
start_line = end_line
# Read the included file and handle nested includes
replacement, rlines = handle_includes(incl_instance.groups()[0], next_base_dir)
chunks.append(replacement)
# Now adjust the line positions accorrding to the first line of the include
for r in rlines:
r.adjust(start_line)
# Adjust the start line after the end of the include
start_line = r.end
# Append lines positions
lines += rlines
# Adjust the start position
start_pos = include_span[1]
# Add the final text (if any)
part = contents[start_pos:]
if len(part) > 0:
chunks.append(part)
# And add the final part line positions
n_of_lines = len(part.split('\n'))-1
end_line = start_line + n_of_lines
lines.append(LineLocation(start_line, end_line, offset_line, file_path))
offset_line += n_of_lines
# Now create and return the content with resolved includes
res = ''.join(chunks)
return res, lines
现在该函数应该被称为
final_xml, lines = handle_includes('path/to/the/top.xml')
推荐阅读
- swift - Swift 中的闭包:它们是什么?他们是如何工作的?
- kotlin - 在 Kotlin 中,是否有一种安全的方法来执行 ObjectInputStream.readObject?
- python - Python : "TypeError","evalue":"'list' 对象不可调用"
- python - 如何将 Selenium 点击元素与美丽的汤一起使用
- c++ - CMake 3.16.5 未添加系统包含目录
- python - 如何从自定义对象的地标检测器开始?
- vba - 在 VBA 中加速多个 ElseIf 查询的方法
- validation - 如何验证时间类型的“必需”标签
- node.js - 尝试获取资源 Passport ReactJS 和 ExpressJS 时出现 NetworkError:CORS
- javascript - 匹配以 2 个字母开头并以 3 个字母结尾的正则表达式