python - Beautifulsoup 如何重新组合单词
问题描述
运行此代码时,输出的一些单词会被拆分。就像“公差”这个词被分成“公差范围”一样。我查看了 html 源代码,似乎这就是页面的创建方式。
还有很多其他的地方,这个词是分裂的。在写入文本之前如何重新组合它们?
import requests, codecs
from bs4 import BeautifulSoup
from bs4.element import Comment
path='C:\\Users\\jason\\Google Drive\\python\\'
def tag_visible(element):
if element.parent.name in ['sup']:
return False
if isinstance(element, Comment):
return False
return True
ticker = 'TSLA'
quarter = '18Q2'
mark1= 'ITEM 1A'
mark2= 'UNREGISTERED SALES'
url_new='https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-10q_20180630.htm'
def get_text(url,mark1,mark2):
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
for hr in soup.select('hr'):
hr.find_previous('p').extract()
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
text=u" ".join(t.strip() for t in visible_texts)
return text[text.find(mark1): text.find(mark2)]
text = get_text(url_new,mark1,mark2)
file=codecs.open(path + "test.txt", 'w', encoding='utf8')
file.write (text)
file.close()
解决方案
您正在处理使用 Microsoft Word 格式化的 HTML。不要提取文本并尝试在没有该上下文的情况下对其进行处理。
您要处理的部分用<a name="...">
标签清楚地描绘出来,让我们从选择所有带有<a name="ITEM_1A_RISK_FACTORS">
标记的元素开始,一直到但不包括<a name="ITEM2_UNREGISTERED_SALES">
标记:
def sec_section(soup, item_name):
"""iterate over SEC document paragraphs for the section named item_name
Item name must be a link target, starting with ITEM
"""
# ask BS4 to find the section
elem = soup.select_one('a[name={}]'.format(item_name))
# scan up to the parent text element
# html.parser does not support <text> but lxml does
while elem.parent is not soup and elem.parent.name != 'text':
elem = elem.parent
yield elem
# now we can yield all next siblings until we find one that
# is also contains a[name^=ITEM] element:
for elem in elem.next_siblings:
if not isinstance(elem, str) and elem.select_one('a[name^=ITEM]'):
return
yield elem
该函数为我们提供<text>
了 HTML 文档中从包含特定链接目标的段落开始的节点的所有子节点,一直到名为ITEM
.
接下来,通常的Word清理任务是去除<font>
标签、style
属性:
def clean_word(elem):
if isinstance(elem, str):
return elem
# remove last-rendered break markers, non-rendering but messy
for lastbrk in elem.select('a[name^=_AEIOULastRenderedPageBreakAEIOU]'):
lastbrk.decompose()
# remove font tags and styling from the document, leaving only the contents
if 'style' in elem.attrs:
del elem.attrs['style']
for e in elem: # recursively do the same for all child nodes
clean_word(e)
if elem.name == 'font':
elem = elem.unwrap()
return elem
该Tag.unwrap()
方法对您的情况最有帮助,因为文本几乎被<font>
标签任意划分。
现在,干净地提取文本突然变得微不足道了:
for elem in sec_section(soup, 'ITEM_1A_RISK_FACTORS'):
clean_word(elem)
if not isinstance(elem, str):
elem = elem.get_text(strip=True)
print(elem)
在文本的其余部分中,这会输出:
•that the equipment and processes which we have selected for Model 3 production will be able to accurately manufacture high volumes of Model 3 vehicles within specified design tolerances and with high quality;
文本现在已正确连接,不再需要重新组合。
整个部分仍在表格中,但clean_word()
现在将其清理为更合理的:
<div align="left">
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td valign="top">
<p> </p></td>
<td valign="top">
<p>•</p></td>
<td valign="top">
<p>that the equipment and processes which we have selected for Model 3 production will be able to accurately manufacture high volumes of Model 3 vehicles within specified design tolerances and with high quality;</p></td></tr></table></div>
因此您可以使用更智能的文本提取技术来进一步确保此处的文本转换干净;您可以将此类项目符号表转换为*
前缀,例如:
def convert_word_bullets(soup, text_bullet="*"):
for table in soup.select('div[align=left] > table'):
div = table.parent
bullet = div.find(string='\u2022')
if bullet is None:
# not a bullet table, skip
continue
text_cell = bullet.find_next('td')
div.clear()
div.append(text_bullet + ' ')
for i, elem in enumerate(text_cell.contents[:]):
if i == 0 and elem == '\n':
continue # no need to include the first linebreak
div.append(elem.extract())
此外,如果您运行,您可能也想跳过分页符(<p>[page number]</p>
和<hr/>
元素的组合)
for pagebrk in soup.select('p ~ hr[style^=page-break-after]'):
pagebrk.find_previous_sibling('p').decompose()
pagebrk.decompose()
这比您自己的版本更明确,您可以删除所有<hr/>
元素和前面<p>
的元素,无论它们是否实际上是同级元素。
在清理 Word HTML之前执行这两项操作。结合你的功能,一起变成:
def get_text(url, item_name):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
for pagebrk in soup.select('p ~ hr[style^=page-break-after]'):
pagebrk.find_previous_sibling('p').decompose()
pagebrk.decompose()
convert_word_bullets(soup)
cleaned_section = map(clean_word, sec_section(soup, item_name))
return ''.join([
elem.get_text(strip=True) if elem.name else elem
for elem in cleaned_section])
text = get_text(url, 'ITEM_1A_RISK_FACTORS')
with open(os.path.join(path, 'test.txt'), 'w', encoding='utf8') as f:
f.write(text)
推荐阅读
- swift - 如何从 ScrollView 生成 PDF - SwiftUI
- firebase - 云文本识别还是设备端?谷歌文本识别服务
- spring - 在heroku上的gitlab CD部署期间出现Spring Boot错误的Maven项目
- spring-boot - 将图像上传到数字海洋空间正在返回错误请求
- ffmpeg - ffmpeg在连接视频时复制音频编解码器类型
- xml - 用于零售商定价的 Google 表格导入 XML
- ios - 类型不匹配(Swift.Dictionary
, - html - 当缩小窗口大小时,如果可能的话,我需要使用媒体查询在页面上填充文本以提高可读性
- bash - 从 shell 脚本的输出中捕获和存储一个值,同时还显示进度?
- spring - Spring和Neo4j数据库:无法验证用户