首页 > 解决方案 > Python BeautifulSoup - 如何以“线性”顺序从 http/xml 标签中提取文本

问题描述

我有一个像这样的文本块,我需要从中提取文本(这是模拟数据):

<text>
            <table>
              <tbody>
<tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr>
<tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr>
<tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr>
<tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr>
<tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr>
</tbody>
            </table>
          </text>

读入(注意我的实际字符串没有换行符):

medsoup = '<text>            <table>              <tbody><tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody>            </table>          </text>'
medsoup  
Out[358]: '<text>            <table>              <tbody><tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody>            </table>          </text>'

问题:我如何以“线性”顺序从每个标签中提取文本,人类如何阅读它,从左到右,每个文本实例之间有空格?

我试过的

如果我尝试使用 BeautifulSoup 的 get_text(),它非常接近,除了我需要在每个单独的文本条目之间使用换行符(或至少有空格)。当我需要的时候,注意所有东西是如何General Adult ExamConstitutional:General Appearance:一起运行的General Adult Exam Constitutional: General Appearance:

parsed_soup = BeautifulSoup(medsoup, 'lxml')
parsed_soup.get_text().strip()
Out[340]: 'General Adult ExamConstitutional:General Appearance: healthy-appearing, well-nourished, well-developedLungs:Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchiCardiovascular:Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruitsMusculoskeletal::Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation'

如果我尝试迭代汤中的单个元素,希望在每段文本之后添加空格,我会得到一个奇怪的东西,似乎只有一个元素可以迭代。

for i, ele in enumerate(parsed_soup):
    print(i, ele, '\n')


0 <html><body><text> <table> <tbody><tr><td> </td><td><content stylecode="Bold">General Adult Exam</content></td></tr><tr><td><content stylecode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content stylecode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content stylecode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content stylecode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text></body></html> 

我也尝试过next_siblingsnext_element尝试迭代标签,但我都无法工作。

标签: pythonxmlbeautifulsoupxml-parsing

解决方案


推荐阅读