首页 > 解决方案 > 如何将html列表转换为文本列表?

问题描述

假设您有以下 html 列表:

['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']

我想查询此列表,以便输出变为以下内容:

https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf
https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf

所以我可以访问这些 url 并从这些链接迭代下载文件。

所以我开发了以下正则表达式和代码:

import re
r = re.compile(((?<=uploaden:\s).+))
newlist = list(filter(r.match, mylist))  # Note 1
print(newlist)

但是,这不会返回任何内容(我认为是因为列表是 html):

[]

当将正则表达式调整为 .* 时,一切都匹配了。这怎么可能?

所以我的问题是如何从 html 代码创建一个新的字符串列表?

标签: pythonhtmlregexpandaslist

解决方案


(?<=prefix) : 如果前面有前缀,则匹配正则表达式

(?=suffix) : 如果后跟后缀则匹配正则表达式

import re

s = ['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']


match = re.search(r'(?<=<br>Factuur uploaden: <br>)(.*)(?=<br><br><br>)', s[0])
print(match.group(1))
# https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf

要对列表中的每个项目执行此操作,您可以识别字典中的每个前缀和后缀:

ldict = {'item1': ['suffix1', 'prefix1'], 'item2': ['suffix2', 'prefix2'], 'item3': ['suffix3', 'prefix3']}

一个例子(注意我在正则表达式中添加了“?”):

另一种更pythonic的方式:

import re

s = ['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']

regex_expr = r'(?<={0})(.*?)(?={1})'

ldict = {'item1': ['<br>Factuur uploaden: <br>', '<br><br><br>'], 'item2': ['<br>Email: ', '<br>']}

def func(m):
    return m.group(1)
result = [list(map(func, [re.search(regex_expr.format(v[0], v[1]), e) for v in ldict.values()])) for e in s]


print(result)
# [['https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf', 'maxdenhil.com'], 
# ['https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf', 'maxdeil.com']]

推荐阅读