首页 > 解决方案 > 用 beautifulsoup4 和 pyTorch 总结维基百科页面

问题描述

我想使用 beautifulsoup4 从维基百科页面中提取文本。然后,我想用torch来总结一下。

这里的代码:

import torch
!pip install transformers
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained('t5-base')
model = AutoModelWithLMHead.from_pretrained('t5-base', return_dict=True)

从 URL 中提取文本:

pip install beautifulsoup4
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Winston_Churchill"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

然后总结一下正文:

input = tokenizer.encode("summarize: " + text, return_tensors='pt', max_length=512, truncation=True)
outputs = model.generate(input,max_length=3000, min_length=200, length_penalty=5.,num_beams=2)
summary = tokenizer.decode(outputs[0])
print(summary)

关于如何解决导致文本显示为带有点的 <extra_id_1> 的根本原因的任何建议?我想这是与维基百科文本格式相关的问题,但我不知道如何解决它。

结果如下:

Blockquote Winston Churchill (1874–1965) 是 1951–1952 年的英国首相。他的继任者是内维尔·张伯伦和安东尼·伊登。他还是国防部长和财政大臣。突尼斯伯爵和伦敦伯爵继位。他还是 1939 年至 1929 年的英国首相。他也是一名政治家和<extra_id_1> <extra_id_1> <extra_id_1> <extra_id_1> <extra_id_1> <extra_id_1> <extra_id_1> <extra_id_1> <extra_id_1>....... ..................................................... ..................................................... ..................................................... ..................................................... ..................................................... .....................................................

标签: pythonmachine-learningbeautifulsouppytorch

解决方案


推荐阅读