python - 使用python在段落标签中提取两个句子
问题描述
在每个paragraph tag
我都将我的本地语言提取到 alist
中,我如何将含义和翻译提取到另一个列表
from bs4 import BeautifulSoup
import re
html = """
[<div class="excerpt">
<p>A ki i fi ara eni se oogun alokunna. Translation: One does not use oneself as an ingredient in a medicine requiring that the ingredients be pulverized. Meaning; Self-preservation is a compulsory project for all.</p>
</div>, <div class="excerpt">
<p>A ki i fi ai-mo-we mookun. Translation: One does not dive under water without knowing how to swim. Meaning: Never engage in a project for which you lack the requisite skills.</p>
</div>, <div class="excerpt">
<p>A fun o lobe o tami si; o gbon ju olobe lo. Translation: You are given some stew and you add water; you must be wiser than the cook. Meaning: Adding water is a means of stretching stew. A person who thus stretches the stew he or she is given would seem to know better than the person who served it how much would suffice for the meal.</p>
</div>]
"""
soup = BeautifulSoup(html,'html.parser')
yoruba = []
translation = []
meaning = []
for i in soup5.findAll("div",'excerpt'):
a = i.get_text(strip=True).split('Translation')[0].strip().replace('\xa0',' ')
yoruba.append(a)
解决方案
您可以使用正则表达式和一些字符串操作来实现这一点。
试试这个代码。
html = """
[<div class="excerpt">
<p>A ki i fi ara eni se oogun alokunna. Translation: One does not use oneself as an ingredient in a medicine requiring that the ingredients be pulverized. Meaning; Self-preservation is a compulsory project for all.</p>
</div>, <div class="excerpt">
<p>A ki i fi ai-mo-we mookun. Translation: One does not dive under water without knowing how to swim. Meaning: Never engage in a project for which you lack the requisite skills.</p>
</div>, <div class="excerpt">
<p>A fun o lobe o tami si; o gbon ju olobe lo. Translation: You are given some stew and you add water; you must be wiser than the cook. Meaning: Adding water is a means of stretching stew. A person who thus stretches the stew he or she is given would seem to know better than the person who served it how much would suffice for the meal.</p>
</div>]
"""
soup = BeautifulSoup(html,'html.parser')
yoruba = []
translation = []
meaning = []
for i in soup.findAll("div",'excerpt'):
for item in i.find_all('p'):
data=re.sub(r'Translation:\s*', '', item.get_text(strip=True))
translation.append(data.split('.')[1].strip())
data1=re.sub(r'Meaning?\s*', '', data)
if ':' in data1:
meaning.append(data1.split(':')[-1].strip())
if (';' in data1) and (':' not in data1) :
meaning.append(data1.split(';')[-1].strip())
print(translation)
print(meaning)
输出: 翻译
['One does not use oneself as an ingredient in a medicine requiring that the ingredients be pulverized', 'One does not dive under water without knowing how to swim', 'You are given some stew and you add water; you must be wiser than the cook']
意义
['Self-preservation is a compulsory project for all.', 'Never engage in a project for which you lack the requisite skills.', 'Adding water is a means of stretching stew. A person who thus stretches the stew he or she is given would seem to know better than the person who served it how much would suffice for the meal.']
推荐阅读
- sql - 如何从 SQL 的列中删除所有数据?
- python - Beautiful Soup 找到所有找到没有类的某些 div
- html - 用于从数据链接为数据的网页下载 csv 文件的 R 脚本:text/csv
- python - numpy:为什么 np.append() 会使我的数组变平?
- python - 遍历数据框名称
- ruby-on-rails - 如何让 Filterrific Javascript 和资产在 Rails 6 上运行?
- join - kubeadm join 在非默认 NIC/IP 上超时
- c++ - 启动状态机无法处理启动时的内部转换
- cuda - Nvcc 致命:不支持主机编译器 ('clang') 的版本 ('40001')
- typescript - Typescript mixin 类上的类验证器装饰器