首页 > 解决方案 > 使用 Python markdown Treeprocessor 包装一个 etree 元素

问题描述

我正在尝试编写一个 Python markdown Treeprocessor 扩展,它将一个标签包装在一个span标签内div;所以如果我有(降价)-

before <span>hello world</span> after

然后后处理我想结束 -

before <div><span>hello world</span></div> after

Python markdown 似乎拥有所有这些您可以使用和扩展的不同处理器 -

https://python-markdown.github.io/extensions/api/

我认为 TreeProcessor 可能是最适合的,并提出了以下建议 -

from markdown.extensions import Extension

from markdown.treeprocessors import Treeprocessor

import markdown

class MyTreeProcessor(Treeprocessor):

    def run(self, doc):
        def iterate(parent):
            print ("%s %s :: %s" % (parent.tag, parent.attrib, parent.text))
            for child in parent.getchildren():
                iterate(child)
        iterate(doc)

class MyTreeExtension(Extension):

    def extendMarkdown(self,
                       md,
                       key="my_extension",
                       index=1e8):
        md.registerExtension(self)
        md.treeprocessors.register(MyTreeProcessor(md.parser),
                                   key, index)

if __name__=="__main__":
    md=markdown.Markdown(extensions=[MyTreeExtension()])
    md.convert("before <span>hello world</span> after")

但是如果我使用索引值运行它,1e8我会得到以下结果 -

div {} :: None
p {} :: before <span>hello world</span> after

而如果我使用索引值运行它,0我会得到以下结果 -

div {} :: 

p {} :: before wzxhzdk:0hello worldwzxhzdk:1 after

这些都不是我想要的 - 在第一种情况下span尚未处理,在第二种情况下它已经以某种奇怪的格式处理:-/

Finding the markdown extension docs pretty turgid for what should seemingly be a simple task - could someone confirm that I am barking up the right tree in using a Treeprocessor here (or not), and if so what I am doing wrong in being unable to get this span parsed as an etree.Element ?

TIA

标签: pythonmarkdown

解决方案


You probably don't want to use a Treeprocessor here as you are working with raw HTML.

Python-Markdown parses Markdown by converting it to an etree object. However, it does not parse HTML (at least not completely) and etree objects cannot hold raw HTML strings. Well, they can hold them, but they will get escaped when the etree object gets rendered to an HTML string. Therefore, the raw HTML is identified and replaced with a placeholder (wzxhzdk:0). After the etree object is rendered to an HTML string, a postprocessor then finds all the placeholders and swaps them out for the raw HTML which was stored using the placeholder as a key.

The reason for the different behavior between your two indexes is that one of them is running before inlinepatterns are run (the first treeprocessor actually is a wrapper around all inlinepatterns) and the other is running after, but before the placeholders are swapped back in. Of course, as the placeholders are swapped back in by a postprocessor, you cannot access that state from a treepreprocessor.

In summary:

  1. Block level raw HTML is handled by a preprocessor and will never be available from a treeprocessor.
  2. Inline raw HTML is handled by an inlinepattern and so inline raw HTML will be unprocessed before the inlineProcessor treeprocessor runs, but will be unavailable from a treeprocessor after it runs.

A likely solution would be to use an inlinepattern. However, you will need to parse the HTML tags yourself.

Additionally, the output you desire is not valid HTML. Note that before <span>hello world</span> after would get wrapped in a <p> tag and <p> tags cannot hold any other block level elements, including <div> tags. Sure, there is nothing forcing you to not wrap a <div> tag in a <p> tag, but a browser will never interpret the HTML that way. According to the HTML spec, a block level tag automatically closes a <p> tag. Therefore, a browser will (probably) interpret your output something like this:

<p>before </p><div><span>hello world</span></div> after<p></p>

Note that there is no <p> wrapping after. Yes, the closing tag would be present, but without the opening tag (immediately after the closing </div>, the browser wouldn't know were it starts and you would get an empty <p> where the widow closing tag is.

Now if you are okay with all this, then why not just use <div> in your Markdown to begin with. Or are <span> and <div> just placeholders in the question for actual tags in real life? If so, whether they are inline or block level could change the answer.


推荐阅读