python - 使用 Python markdown Treeprocessor 包装一个 etree 元素
问题描述
我正在尝试编写一个 Python markdown Treeprocessor 扩展,它将一个标签包装在一个span
标签内div
;所以如果我有(降价)-
before <span>hello world</span> after
然后后处理我想结束 -
before <div><span>hello world</span></div> after
Python markdown 似乎拥有所有这些您可以使用和扩展的不同处理器 -
https://python-markdown.github.io/extensions/api/
我认为 TreeProcessor 可能是最适合的,并提出了以下建议 -
from markdown.extensions import Extension
from markdown.treeprocessors import Treeprocessor
import markdown
class MyTreeProcessor(Treeprocessor):
def run(self, doc):
def iterate(parent):
print ("%s %s :: %s" % (parent.tag, parent.attrib, parent.text))
for child in parent.getchildren():
iterate(child)
iterate(doc)
class MyTreeExtension(Extension):
def extendMarkdown(self,
md,
key="my_extension",
index=1e8):
md.registerExtension(self)
md.treeprocessors.register(MyTreeProcessor(md.parser),
key, index)
if __name__=="__main__":
md=markdown.Markdown(extensions=[MyTreeExtension()])
md.convert("before <span>hello world</span> after")
但是如果我使用索引值运行它,1e8
我会得到以下结果 -
div {} :: None
p {} :: before <span>hello world</span> after
而如果我使用索引值运行它,0
我会得到以下结果 -
div {} ::
p {} :: before wzxhzdk:0hello worldwzxhzdk:1 after
这些都不是我想要的 - 在第一种情况下span
尚未处理,在第二种情况下它已经以某种奇怪的格式处理:-/
Finding the markdown extension docs pretty turgid for what should seemingly be a simple task - could someone confirm that I am barking up the right tree in using a Treeprocessor here (or not), and if so what I am doing wrong in being unable to get this span
parsed as an etree.Element
?
TIA
解决方案
You probably don't want to use a Treeprocessor here as you are working with raw HTML.
Python-Markdown parses Markdown by converting it to an etree
object. However, it does not parse HTML (at least not completely) and etree
objects cannot hold raw HTML strings. Well, they can hold them, but they will get escaped when the etree
object gets rendered to an HTML string. Therefore, the raw HTML is identified and replaced with a placeholder (wzxhzdk:0
). After the etree
object is rendered to an HTML string, a postprocessor then finds all the placeholders and swaps them out for the raw HTML which was stored using the placeholder as a key.
The reason for the different behavior between your two indexes is that one of them is running before inlinepatterns are run (the first treeprocessor actually is a wrapper around all inlinepatterns) and the other is running after, but before the placeholders are swapped back in. Of course, as the placeholders are swapped back in by a postprocessor, you cannot access that state from a treepreprocessor.
In summary:
- Block level raw HTML is handled by a preprocessor and will never be available from a treeprocessor.
- Inline raw HTML is handled by an inlinepattern and so inline raw HTML will be unprocessed before the
inlineProcessor
treeprocessor runs, but will be unavailable from a treeprocessor after it runs.
A likely solution would be to use an inlinepattern. However, you will need to parse the HTML tags yourself.
Additionally, the output you desire is not valid HTML. Note that before <span>hello world</span> after
would get wrapped in a <p>
tag and <p>
tags cannot hold any other block level elements, including <div>
tags. Sure, there is nothing forcing you to not wrap a <div>
tag in a <p>
tag, but a browser will never interpret the HTML that way. According to the HTML spec, a block level tag automatically closes a <p>
tag. Therefore, a browser will (probably) interpret your output something like this:
<p>before </p><div><span>hello world</span></div> after<p></p>
Note that there is no <p>
wrapping after
. Yes, the closing tag would be present, but without the opening tag (immediately after the closing </div>
, the browser wouldn't know were it starts and you would get an empty <p>
where the widow closing tag is.
Now if you are okay with all this, then why not just use <div>
in your Markdown to begin with. Or are <span>
and <div>
just placeholders in the question for actual tags in real life? If so, whether they are inline or block level could change the answer.
推荐阅读
- xpath - 在 UiPath 上,如果有多个同名标签,我如何(从 XML 文件中)提取第二个标签?
- node.js - 谷歌云存储 ApiError:必填
- sql-server - 当我想添加过滤器时,EF Linq to SQL 会生成一个嵌套的 SELECT
- c# - 通过名单
任务之间 - python - 如何在 AWS Glue 中从 CSV 创建结构化 JSON
- javascript - 自定义选择下拉框 CSS 的标题
- css - Button点击时不改变界面
- python - 然后上传一个 csv 文件,将数据以 MCQs 格式存储在模型中
- react-native - 如何仅动画当前可见的项目,而不是 FlatList 水平分页中的其他隐藏项目
- wordpress - 如何在wordpress中更改默认用户ID