首页 > 解决方案 > 如何删除使用 beautifulsoup 创建的列表中的 html 标签?

问题描述

所以我从维基百科页面上抓取了不同的标题,例如: https ://en.wikipedia.org/wiki/Sun

我刮掉了所有的 mw 头条新闻

titles = soup.find_all('span', {"class":"mw-headline"})

现在我想制作一个标题列表并打印出来

print(list(titles))

我的结果是一个包含所有 html 代码的列表

[<span class="mw-headline" id="Name_and_etymology">Name and etymology</span>, <span class="mw-headline" id="General_characteristics">General characteristics</span>, <span class="mw-headline" id="Sunlight">Sunlight</span>, <span class="mw-headline" id="Composition">Composition</span>, <span class="mw-headline" id="Singly_ionized_iron-group_elements">Singly ionized iron-group elements</span>, <span class="mw-headline" id="Isotopic_composition">Isotopic composition</span>, <span class="mw-headline" id="Structure_and_fusion">Structure and fusion</span>, <span class="mw-headline" id="Core">Core</span>, <span class="mw-headline" id="Radiative_zone">Radiative zone</span>, <span class="mw-headline" id="Tachocline">Tachocline</span>, <span class="mw-headline" id="Convective_zone">Convective zone</span>, <span class="mw-headline" id="Photosphere">Photosphere</span>, <span class="mw-headline" id="Atmosphere">Atmosphere</span>, <span class="mw-headline" id="Photons_and_neutrinos">Photons and neutrinos</span>, <span class="mw-headline" id="Magnetic_activity">Magnetic activity</span>, <span class="mw-headline" id="Magnetic_field">Magnetic field</span>, <span class="mw-headline" id="Variation_in_activity">Variation in activity</span>, <span class="mw-headline" id="Long-term_change">Long-term change</span>, <span class="mw-headline" id="Life_phases">Life phases</span>, <span class="mw-headline" id="Formation">Formation</span>, <span class="mw-headline" id="Main_sequence">Main sequence</span>, <span class="mw-headline" id="After_core_hydrogen_exhaustion">After core hydrogen exhaustion</span>, <span class="mw-headline" id="Orbit_and_location">Orbit and location</span>, <span class="mw-headline" id="Orbit_in_Milky_Way">Orbit in Milky Way</span>, <span class="mw-headline" id="Theoretical_problems">Theoretical problems</span>, <span class="mw-headline" id="Coronal_heating_problem">Coronal heating problem</span>, <span class="mw-headline" id="Faint_young_Sun_problem">Faint young Sun problem</span>, <span class="mw-headline" id="Observational_history">Observational history</span>, <span class="mw-headline" id="Early_understanding">Early understanding</span>, <span class="mw-headline" id="Development_of_scientific_understanding">Development of scientific understanding</span>, <span class="mw-headline" id="Solar_space_missions">Solar space missions</span>, <span class="mw-headline" id="Observation_and_effects">Observation and effects</span>, <span class="mw-headline" id="Planetary_system">Planetary system</span>, <span class="mw-headline" id="Religious_aspects">Religious aspects</span>, <span class="mw-headline" id="See_also">See also</span>, <span class="mw-headline" id="Notes">Notes</span>, <span class="mw-headline" id="References">References</span>, <span class="mw-headline" id="Further_reading">Further reading</span>, <span class="mw-headline" id="External_links">External links</span>]

如何删除标签,以便我只有一个包含所有标题的列表?

标签: pythonhtmlbeautifulsoup

解决方案


titles您可以迭代它们并使用text标签上的属性获取文本元素,而不是将可迭代对象转换为列表:

titles = [tag.text for tag in soup.find_all('span', {"class":"mw-headline"})]

推荐阅读