python - Python BeautifulSoup - 如何以“线性”顺序从 http/xml 标签中提取文本
问题描述
我有一个像这样的文本块,我需要从中提取文本(这是模拟数据):
<text>
<table>
<tbody>
<tr><td> </td><td><content styleCode="Bold">General Adult Exam</content></td></tr>
<tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr>
<tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr>
<tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr>
<tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr>
</tbody>
</table>
</text>
读入(注意我的实际字符串没有换行符):
medsoup = '<text> <table> <tbody><tr><td> </td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text>'
medsoup
Out[358]: '<text> <table> <tbody><tr><td> </td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text>'
问题:我如何以“线性”顺序从每个标签中提取文本,人类如何阅读它,从左到右,每个文本实例之间有空格?
我试过的
如果我尝试使用 BeautifulSoup 的 get_text(),它非常接近,除了我需要在每个单独的文本条目之间使用换行符(或至少有空格)。当我需要的时候,注意所有东西是如何General Adult ExamConstitutional:General Appearance:
一起运行的General Adult Exam Constitutional: General Appearance:
parsed_soup = BeautifulSoup(medsoup, 'lxml')
parsed_soup.get_text().strip()
Out[340]: 'General Adult ExamConstitutional:General Appearance: healthy-appearing, well-nourished, well-developedLungs:Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchiCardiovascular:Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruitsMusculoskeletal::Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation'
如果我尝试迭代汤中的单个元素,希望在每段文本之后添加空格,我会得到一个奇怪的东西,似乎只有一个元素可以迭代。
for i, ele in enumerate(parsed_soup):
print(i, ele, '\n')
0 <html><body><text> <table> <tbody><tr><td> </td><td><content stylecode="Bold">General Adult Exam</content></td></tr><tr><td><content stylecode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content stylecode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content stylecode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content stylecode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text></body></html>
我也尝试过next_siblings
并next_element
尝试迭代标签,但我都无法工作。
解决方案
推荐阅读
- firebase - Firebase 性能网络仅显示聚合请求
- ruby-on-rails - Autoprefixer 不支持 Node v4.8.2。更新它
- laravel - 遍历所有复选框,然后将状态发布到数据库 laravel 5.4
- html - Angular 6 Firestore:渲染嵌入HTML标签的字符串......不显示标签
- matlab - 如何在 Ubuntu 18.04 上构建 gcc-6.3?
- javascript - 从字符串 Typescript 中删除非英语
- c# - 串行通信后,删除功能不适用于多个循环中的多个实例
- c# - 如果位置是位置
- php - Codeigniter 3 应用程序:如果数据库中没有表,则重定向到某个控制器
- node.js - 在 ubuntu 18.04 上正确配置 selenium-webdriver geckodriver?