python - Python BeautifulSoup - 如何以“线性”顺序从 http/xml 标签中提取文本

问题描述

我有一个像这样的文本块，我需要从中提取文本（这是模拟数据）：

<text>
            <table>
              <tbody>
<tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr>
<tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr>
<tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr>
<tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr>
<tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr>
</tbody>
            </table>
          </text>

读入（注意我的实际字符串没有换行符）：

medsoup = '<text>            <table>              <tbody><tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody>            </table>          </text>'
medsoup  
Out[358]: '<text>            <table>              <tbody><tr><td>&#xA0;</td><td><content styleCode="Bold">General Adult Exam</content></td></tr><tr><td><content styleCode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content styleCode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content styleCode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content styleCode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody>            </table>          </text>'

问题：我如何以“线性”顺序从每个标签中提取文本，人类如何阅读它，从左到右，每个文本实例之间有空格？

我试过的

如果我尝试使用 BeautifulSoup 的 get_text()，它非常接近，除了我需要在每个单独的文本条目之间使用换行符（或至少有空格）。当我需要的时候，注意所有东西是如何General Adult ExamConstitutional:General Appearance:一起运行的General Adult Exam Constitutional: General Appearance:

parsed_soup = BeautifulSoup(medsoup, 'lxml')
parsed_soup.get_text().strip()
Out[340]: 'General Adult ExamConstitutional:General Appearance: healthy-appearing, well-nourished, well-developedLungs:Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchiCardiovascular:Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruitsMusculoskeletal::Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation'

如果我尝试迭代汤中的单个元素，希望在每段文本之后添加空格，我会得到一个奇怪的东西，似乎只有一个元素可以迭代。

for i, ele in enumerate(parsed_soup):
    print(i, ele, '\n')


0 <html><body><text> <table> <tbody><tr><td> </td><td><content stylecode="Bold">General Adult Exam</content></td></tr><tr><td><content stylecode="Bold">Constitutional:</content></td><td>General Appearance: healthy-appearing, well-nourished, well-developed</td></tr><tr><td><content stylecode="Bold">Lungs:</content></td><td>Respiratory effort: no dyspnea. Auscultation: breath sounds normal, good air movement, CTA except as noted, no wheezing, no rales/crackles, no rhonchi</td></tr><tr><td><content stylecode="Bold">Cardiovascular:</content></td><td>Heart Auscultation: RRR, normal S1, normal S2, no murmurs, no rubs, no gallops. Neck vessels: no carotid bruits</td></tr><tr><td><content stylecode="Bold">Musculoskeletal::</content></td><td>Joints, Bones, and Muscles: ; She has decreased range of motion especially to abduction with some pain on internal rotation</td></tr></tbody> </table> </text></body></html>

我也尝试过next_siblings并next_element尝试迭代标签，但我都无法工作。

标签： pythonxmlbeautifulsoupxml-parsing

python - Python BeautifulSoup - 如何以“线性”顺序从 http/xml 标签中提取文本

问题描述

解决方案

推荐阅读