python - 如何删除使用 beautifulsoup 创建的列表中的 html 标签?
问题描述
所以我从维基百科页面上抓取了不同的标题,例如: https ://en.wikipedia.org/wiki/Sun
我刮掉了所有的 mw 头条新闻
titles = soup.find_all('span', {"class":"mw-headline"})
现在我想制作一个标题列表并打印出来
print(list(titles))
我的结果是一个包含所有 html 代码的列表
[<span class="mw-headline" id="Name_and_etymology">Name and etymology</span>, <span class="mw-headline" id="General_characteristics">General characteristics</span>, <span class="mw-headline" id="Sunlight">Sunlight</span>, <span class="mw-headline" id="Composition">Composition</span>, <span class="mw-headline" id="Singly_ionized_iron-group_elements">Singly ionized iron-group elements</span>, <span class="mw-headline" id="Isotopic_composition">Isotopic composition</span>, <span class="mw-headline" id="Structure_and_fusion">Structure and fusion</span>, <span class="mw-headline" id="Core">Core</span>, <span class="mw-headline" id="Radiative_zone">Radiative zone</span>, <span class="mw-headline" id="Tachocline">Tachocline</span>, <span class="mw-headline" id="Convective_zone">Convective zone</span>, <span class="mw-headline" id="Photosphere">Photosphere</span>, <span class="mw-headline" id="Atmosphere">Atmosphere</span>, <span class="mw-headline" id="Photons_and_neutrinos">Photons and neutrinos</span>, <span class="mw-headline" id="Magnetic_activity">Magnetic activity</span>, <span class="mw-headline" id="Magnetic_field">Magnetic field</span>, <span class="mw-headline" id="Variation_in_activity">Variation in activity</span>, <span class="mw-headline" id="Long-term_change">Long-term change</span>, <span class="mw-headline" id="Life_phases">Life phases</span>, <span class="mw-headline" id="Formation">Formation</span>, <span class="mw-headline" id="Main_sequence">Main sequence</span>, <span class="mw-headline" id="After_core_hydrogen_exhaustion">After core hydrogen exhaustion</span>, <span class="mw-headline" id="Orbit_and_location">Orbit and location</span>, <span class="mw-headline" id="Orbit_in_Milky_Way">Orbit in Milky Way</span>, <span class="mw-headline" id="Theoretical_problems">Theoretical problems</span>, <span class="mw-headline" id="Coronal_heating_problem">Coronal heating problem</span>, <span class="mw-headline" id="Faint_young_Sun_problem">Faint young Sun problem</span>, <span class="mw-headline" id="Observational_history">Observational history</span>, <span class="mw-headline" id="Early_understanding">Early understanding</span>, <span class="mw-headline" id="Development_of_scientific_understanding">Development of scientific understanding</span>, <span class="mw-headline" id="Solar_space_missions">Solar space missions</span>, <span class="mw-headline" id="Observation_and_effects">Observation and effects</span>, <span class="mw-headline" id="Planetary_system">Planetary system</span>, <span class="mw-headline" id="Religious_aspects">Religious aspects</span>, <span class="mw-headline" id="See_also">See also</span>, <span class="mw-headline" id="Notes">Notes</span>, <span class="mw-headline" id="References">References</span>, <span class="mw-headline" id="Further_reading">Further reading</span>, <span class="mw-headline" id="External_links">External links</span>]
如何删除标签,以便我只有一个包含所有标题的列表?
解决方案
titles
您可以迭代它们并使用text
标签上的属性获取文本元素,而不是将可迭代对象转换为列表:
titles = [tag.text for tag in soup.find_all('span', {"class":"mw-headline"})]
推荐阅读
- c++11 - 无法理解 C++ stl 中的向量函数及其差异
- ios - 创建帖子后发送自动聊天消息。火力基地
- user-interface - 如何在 TButton 中显示图标?
- c# - TryInvokeMember 上的异步任务(DynamicObject)
- javascript - 尝试学习 React 和 Javascript,但坚持使用这种奇怪的 map 语法并将匿名函数传递给它
- django - Django:将表单添加到我的扩展用户模型
- python - Python 循环一直在 Window 的解释器中停止
- python - 将 pdf 转换为 excel(使用 Camelot 获取特定表格)
- cloudera - Errno 14 PYCURL 错误 6 ;在 Cloudera Manager 7.x 升级中无法解析主机
- javascript - 页面加载到容器中时 UI 滑块不可见