python - 使用 Beautiful Soup 抓取网站的十大故事
问题描述
我正在尝试抓取网站:http ://edition.cnn.com/EVENTS/1996/year.in.review/ 并尝试获取前 10 名故事,这是我迄今为止的尝试,我想知道是否有是我忽略的一种更简单的方法来一次完成吗?另外,我正在尝试找到一种方法来删除每次打印之间的换行符,因为我不知道为什么每个标题之间存在间隙。
import requests
from bs4 import BeautifulSoup
import lxml
html = """
<HTML>
<HEAD>
<TITLE>Top Ten Stories From 1996</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFCC" LINK="#162323" ALINK="#FFFFCE" VLINK="#162323">
<CENTER>
<P><BR>
<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0">
<TR>
<TD><IMG SRC="logos.gif" WIDTH="112" HEIGHT="60" ALIGN="TOP"></TD>
<TD><IMG SRC="banner.gif" WIDTH="360" HEIGHT="60" ALIGN="TOP"></TD>
</TR>
</TABLE>
</P>
</CENTER>
<BLOCKQUOTE>
<CENTER>
<TABLE BORDER="0" CELLPADDING="2">
<TR>
<TD WIDTH="90" VALIGN="TOP" ROWSPAN="11">
<P ALIGN="RIGHT"><B><TT>What were the biggest stories of the year?</TT></B><BR>
<BR>
<FONT SIZE="2">It's a question journalists like to ask themselves at the end of every
year. Now you can join in the process. Here are our selections for the top ten news
stories of 1996.<BR>
<BR>
Disagree with our choices? Then tell us what stories you think were most compelling
in the poll below.</FONT>
</TD>
<TD WIDTH="4" ROWSPAN="11"></TD>
<TD VALIGN="MIDDLE" ROWSPAN="11"><IMG SRC="generic/dot.gif" WIDTH="1" HEIGHT="250" ALIGN="MIDDLE"></TD>
<TD WIDTH="10" ROWSPAN="11"></TD>
<TD COLSPAN="4" VALIGN=TOP>
<P ALIGN="CENTER"><IMG SRC="generic/topten.gif" WIDTH="263" HEIGHT="24" ALIGN="MIDDLE" VSPACE="5">
</TD>
</TR>
<TR>
<TD><A HREF="topten/israel/israel.index.html" TARGET=_top><IMG SRC="generic/1.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/israel/israel.index.html" TARGET=_top><B>Israel</B> elects <B>Netanyahu</A></B></TD>
</TR>
<TR>
<TD><A HREF="topten/twa/twa.index.html" TARGET=_top><IMG SRC="generic/2.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/twa/twa.index.html" TARGET=_top>Crash of TWA Flight 800</A></TD>
</TR>
<TR>
<TD><A HREF="topten/yeltsin/yeltsin.index.html" TARGET=_top><IMG SRC="generic/3.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/yeltsin/yeltsin.index.html" TARGET=_top><B>Russia</B> elects <B>Yeltsin</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/clinton/clinton.index.html" TARGET=_top><IMG SRC="generic/4.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/clinton/clinton.index.html" TARGET=_top><B>U.S</B>. elects <B>Clinton</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/hutu/hutu.index.html" TARGET=_top><IMG SRC="generic/5.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/hutu/hutu.index.html" TARGET=_top><B>Hutu-Tutsi</B> conflict in central Africa</A></TD>
</TR>
<TR>
<TD><A HREF="topten/bosnia/bosnia.index.html" TARGET=_top><IMG SRC="generic/6.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/bosnia/bosnia.index.html" TARGET=_top>Peace, elections in <B>Bosnia</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/saudi/saudi.index.html" TARGET=_top><IMG SRC="generic/7.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/saudi/saudi.index.html" TARGET=_top><B>U.S</B>. base bombed in <B>Saudi Arabia</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/olympics/olympics.index.html" TARGET=_top><IMG SRC="generic/8.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/olympics/olympics.index.html" TARGET=_top>Centennial <B>Olympic</B> Games</A></TD>
</TR>
<TR>
<TD><A HREF="topten/aids/aids.index.html" TARGET=_top><IMG SRC="generic/9.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/aids/aids.index.html" TARGET=_top>Advances against <B>AIDS</B></A></TD>
</TR>
<TR>
<TD><A HREF="topten/unabomb/unabomb.index.html" TARGET=_top><IMG SRC="generic/10.gif" WIDTH="17" HEIGHT="17" ALIGN="MIDDLE" BORDER=0></A></TD>
<TD><A HREF="topten/unabomb/unabomb.index.html" TARGET=_top><B>Unabomb</B> suspect <B>Ted Kaczynski</B> arrested</A></TD>
</TR>
</TABLE>
<BR clear = "all">
<TABLE WIDTH=300>
<TR>
<TD>
<CENTER><A HREF="topten/poll.html" TARGET=_top><IMG SRC="poll.gif" WIDTH="120" HEIGHT="60" ALIGN="MIDDLE" BORDER="0"></CENTER></A>
</TD>
<TD>
<CENTER><A HREF="http://www-cgi.cnn.com/cgi-bin/quiz/yir_main/go.pl/main" TARGET=_top><IMG SRC="quiz.gif" WIDTH="120" HEIGHT="60" ALIGN="MIDDLE" BORDER="0"></CENTER></A>
</TD>
</TR>
<TR><TD COLSPAN=2><CENTER><A TARGET=_top HREF="http://www-cgi.cnn.com/cgi-bin/poll/heavypoll.pl?slug=9612%2Fyir_top_10">The top 10 stories according to our users</A></CENTER></TD></TR>
</TABLE>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE"><BR>
<BR><IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE"><BR>
<BR>
<CENTER>
<A HREF="http://pathfinder.com/time/bestof1996/index.html" TARGET=_top>
T I M E: The Best of 1996</A>
<BR clear = "all"><BR>
<A HREF="http://pathfinder.com/@@qsdFOQcA62PJWEWu/time/moy/index.html" TARGET=_top>
T I M E: Man of the Year</A>
<BR clear = "all"><BR>
<A HREF="http://pathfinder.com/time/1996/" TARGET=_top>
<IMG SRC="time.gif" WIDTH="540" HEIGHT="50" ALIGN="MIDDLE" BORDER="0"></A>
<BR clear = "all"><BR><BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
</CENTER>
<BR clear = "all">
<TABLE BORDER="0" CELLPADDING="0" CELLSPACING="0" WIDTH="63%">
<TR>
<TD WIDTH="100%">
<P><B><TT>What makes a </TT></B><FONT SIZE="5"><TT><B>big</B></TT></FONT><TT><B>
story </B></TT><FONT SIZE="5"><TT><B>BIG?</B></TT></FONT>
<BLOCKQUOTE>
<P>It depends on your criteria, of course, and your perspective. That's why we offered
a poll to find out what you think.</P>
<P>For our list, we polled producers throughout the CNN/Pathfinder family of networks
and publications, and weighed such criteria as a story's long-term implications,
geopolitical significance, user interest, amount of coverage, and old-fashioned newsworthiness.
All these things help make a "big" story big.</P>
<P>By no means do we think our lists are the final word. Even our polls among CNN
producers turned up a wide variety of responses. The process is meant to encourage
you to reconsider the stories that dominated the media during the past year and determine
for yourself which were mere sensations and which were truly significant.
</BLOCKQUOTE>
</TD>
</TR>
</TABLE>
<BR CLEAR=ALL>
<BR>
<CENTER>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
<TABLE WIDTH=300><TR VALIGN=CENTER>
<TD ALIGN=CENTER><IMG SRC="what_you_think.gif" ALT="What you think" WIDTH="60" HEIGHT="59" BORDER="0"></TD>
<TD><STRONG><A NAME="_top" HREF="/feedback/index.html">Tell us what you think</A></STRONG><BR><BR>
<STRONG><A NAME="_top" HREF="/feedback/comments.html">You said it...</A></STRONG></TD>
</TR></TABLE>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
</CENTER>
<CENTER><A HREF="generic/credits.index.html" TARGET=_top><TT><B>C R E D I T S</B></TT></A></CENTER>
<BR CLEAR=ALL>
<BR>
<IMG SRC="generic/dot.gif" WIDTH="450" HEIGHT="1" ALIGN="MIDDLE">
<BR CLEAR=ALL><BR>
<CENTER><A HREF="#TOP"><TT><B>Back to top</B></TT></A></CENTER>
<BR CLEAR=ALL><BR>
<FONT SIZE=-1><P>© 1996 Cable News Network, Inc.<BR>
All Rights Reserved.</FONT>
<H6><A HREF="http://cnn.com/interactive_legal.html" target=_top>Terms</A> under which this
service is provided to you.</H6>
</CENTER>
</CENTER>
</BLOCKQUOTE>
</BODY>
</HTML>
"""
soup = BeautifulSoup(html, "lxml")
td_list = soup.find_all('td')
count = 0
for link in td_list:
if count == 20:
pass
elif link.a is not None:
print(link.text.strip())
count += 1
输出:
Israel elects Netanyahu
Crash of TWA Flight 800
Russia elects Yeltsin
U.S. elects Clinton
Hutu-Tutsi conflict in central Africa
Peace, elections in Bosnia
U.S. base bombed in Saudi Arabia
Centennial Olympic Games
Advances against AIDS
Unabomb suspect Ted Kaczynski arrested
解决方案
好吧,我习惯于re
缩短选择值以 开头的所有标签a
的路径,你也可以用不同的方式来做到这一点,例如。href
topten
for item in soup.select("a[href^=topten]"):
然后我得到了标签中的所有文本,然后stripped
将它放在strip=True
一个空的位置separator
,这样text
就不会在其中一起分配。
import requests
from bs4 import BeautifulSoup
import re
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
for item in soup.findAll("a", href=re.compile("^topten")):
item = item.get_text(strip=True, separator=" ")
if item:
print(item)
main("http://edition.cnn.com/EVENTS/1996/year.in.review/main.html")
输出:
Israel elects Netanyahu
Crash of TWA Flight 800
Russia elects Yeltsin
U.S . elects Clinton
Hutu-Tutsi conflict in central Africa
Peace, elections in Bosnia
U.S . base bombed in Saudi Arabia
Centennial Olympic Games
Advances against AIDS
Unabomb suspect Ted Kaczynski arrested
推荐阅读
- node.js - npm 错误!代码 ELIFECYCLE - 从 jenkins 运行 webdriver IO 测试
- vue.js - Nuxt/Vuetify - v-autocomplete not displaying the v-menu
- excel - 如果列包含特定单词,则使用 Excel 宏更改列的颜色
- android - 长时间后从后台打开应用程序崩溃
- javascript - 如何运行嵌套在两个 setTimeout() 之间的函数?
- python - 在 Python 中重命名文件名的开头
- bash - 如何在终端中显示我的 Git 分支并自动更新?
- python - “就地” Cython 功能是否可能?
- pdf - Google CSE(自定义搜索引擎)不返回 PDF/Youtube 结果
- python - 如何将 Flask 表格单元格值传递给 python 函数?