python - 在 Markdown 代码块之外查找图像标签

问题描述

介绍

我有几百个包含代码块的降价文件，它们看起来像这样。

```html
<img src="fil.png">
```

- [ ] Here is another image <img src="fil.png"> and another `<img src="fil.png">`

  ```html
  <a href="scratch/index.html" id="scratch" data-original-title="" title="" aria-describedby="popover162945">
    <div class="logo-wrapper">
    </div>
    <div class="name">
      <span>Scratch</span>
    </div>
    <img src="fil.png">
  </a>
  ```

我的目标是在代码块之外找到所有没有alt 标签的IMG 标签。

不确定我是否可以使用 HTML: 解析器，因为代码块......

例子

我不是在寻找完美的解决方案，只是寻找跨越多行的简单 img 标签。

```html
<img src="fil.png">
```

不应该找到这个，因为它在 img 块内。

- [ ] Here is another image `<img src="fil.png">` and another <img src="dog.png" title: "re
aaaaaaaaaaaaaaaallllyl long title">

不应该找到第一个（因为它被`包围），但是它应该找到第二个，即使它跨越多行。

试图

我尝试了几种不同的方法，使用从 bash 和 grep 到 python 的所有方法。我可以img使用以下正则表达式获取标签

<img(\s*(?!alt)([\w\-])+=([\"\'])[^\"\']+\3)*\s*\/?>

但是我觉得更清洁的方法可能是这样

过滤掉每个代码块
找到每个 img 标签
找到每个没有 alt 标签的 img 标签

我在第一步有点卡住了。我可以使用这个正则表达式找到每个代码块：

```[a-z]*\n[\s\S]*?\n```

但是我不确定如何反转它，例如找到它之外的所有文本。我会接受任何可以在 bash 脚本或 python 中运行的解决方案。

标签： pythonregexpython-3.xbashmarkdown

您是绝对正确的，这是正则表达式垃圾桶方法的经典案例：我们 *SKIP 在整体匹配中要避免的内容，并使用捕获组来获取我们真正想要的内容，即What_I_want_to_avoid|(What_I_want_to_match)：

```.*?```|`.*?`|(<img(?!.*?alt=(['\"]).*?\2)[^>]*)(>)

这里的想法是完全忽略正则表达式引擎返回的整体匹配：那是垃圾箱。相反，我们只需要检查捕获组 $1，它在设置时包含 img-tags。

演示

此处借用了匹配不带 alt 属性的 img-tags 的模式。垃圾桶方法在此处和此处进行了描述。

示例代码：

import re
regex = r"```.*?```|`.*?`|(<img(?!.*?alt=(['\"]).*?\2)[^>]*)(>)"
test_str = ("```html\n"
    "<img src=\"fil.png\">\n"
    "```\n\n"
    "- [ ] Here is another image <img src=\"fil.png\"> and another `<img src=\"fil.png\">`\n\n"
    "  ```html\n"
    "  <a href=\"scratch/index.html\" id=\"scratch\" data-original-title=\"\" title=\"\" aria-describedby=\"popover162945\">\n"
    "    <div class=\"logo-wrapper\">\n"
    "    </div>\n"
    "    <div class=\"name\">\n"
    "      <span>Scratch</span>\n"
    "    </div>\n"
    "    <img src=\"fil.png\">\n"
    "  </a>\n"
    "  ```")

matches = re.finditer(regex, test_str, re.DOTALL)
for match in matches:
    if match.group(1):
        print ("Found at {start}-{end}: {group}".format(start = match.start(1), end = match.end(1), group = match.group(1)))

实际上，只需在完整匹配中放置一个反引号对就足够了。但是，可以说它更具可读性，并且如上所示更清晰地展示了这个想法。

python - 在 Markdown 代码块之外查找图像标签

问题描述

介绍

例子

试图

解决方案

推荐阅读