python - 如何将html列表转换为文本列表?
问题描述
假设您有以下 html 列表:
['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']
我想查询此列表,以便输出变为以下内容:
https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf
https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf
所以我可以访问这些 url 并从这些链接迭代下载文件。
所以我开发了以下正则表达式和代码:
import re
r = re.compile(((?<=uploaden:\s).+))
newlist = list(filter(r.match, mylist)) # Note 1
print(newlist)
但是,这不会返回任何内容(我认为是因为列表是 html):
[]
当将正则表达式调整为 .* 时,一切都匹配了。这怎么可能?
所以我的问题是如何从 html 代码创建一个新的字符串列表?
解决方案
(?<=prefix) : 如果前面有前缀,则匹配正则表达式
(?=suffix) : 如果后跟后缀则匹配正则表达式
import re
s = ['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']
match = re.search(r'(?<=<br>Factuur uploaden: <br>)(.*)(?=<br><br><br>)', s[0])
print(match.group(1))
# https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf
要对列表中的每个项目执行此操作,您可以识别字典中的每个前缀和后缀:
ldict = {'item1': ['suffix1', 'prefix1'], 'item2': ['suffix2', 'prefix2'], 'item3': ['suffix3', 'prefix3']}
一个例子(注意我在正则表达式中添加了“?”):
另一种更pythonic的方式:
import re
s = ['Welcome: <br>Email: maxdenhil.com<br>Bedrijfsnaam: Dternational<br>KvK-nummer (8-cijfers): 88888888<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:42 pm<br>Page URL: https://yourubl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n', 'Welcome: <br>Email: maxdeil.com<br>Bedrijfsnaam: dd<br>KvK-nummer (8-cijfers): 9999999<br>Factuur uploaden: <br>https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf<br><br><br>---<br><br>Date: May 4, 2021<br>Time: 3:49 pm<br>Page URL: https://yl.nl/Converter/<br>User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36<br>Remote IP: 62.194.173.74<br>Powered by: Elementor<br>\r\n\r\n']
regex_expr = r'(?<={0})(.*?)(?={1})'
ldict = {'item1': ['<br>Factuur uploaden: <br>', '<br><br><br>'], 'item2': ['<br>Email: ', '<br>']}
def func(m):
return m.group(1)
result = [list(map(func, [re.search(regex_expr.format(v[0], v[1]), e) for v in ldict.values()])) for e in s]
print(result)
# [['https://yourubk.nl/wp-content/uploads/elementor/forms/60916b7e4f600.pdf', 'maxdenhil.com'],
# ['https://yourubk.nl/wp-content/uploads/elementor/forms/60916d04e0d70.pdf', 'maxdeil.com']]
推荐阅读
- angular - Firebase 服务人员和 Angular
- flutter - 如何处理 Flutter 中的错误
- asp.net-mvc - 如何使用 ASP.NET MVC 在当前月份中包含上个月的日期
- mysql - Mysql - 使用 LOAD DATA LOCAL INFILE 将具有多个字段的 csv 文件导入单个字段
- python - 如何从每条等高线中找到并连接最大点
- php - 在 http 例如 myfakedomain.org(而不是 mymachinename:8082)中提供来自虚假域的所有 wiki 页面
- ios - 在 icloud 钥匙串中保存密码
- reactjs - 获取并发用户会话数放大
- applescript - 如何在applescript中获取包含引号的字符串
- pine-script - 如何关闭 Pine Script 中最后一根柱的所有订单?