python-3.x - 通过beautifulsoup 将两张单独的图表合并为一张
问题描述
我试图在这个网站上抓取 BoxOffice 图表,并被困在将两个单独的图表制作成一个 DataFrame 中。(我知道为什么它已经分开但那些应该合并到一个相同的图表中)
URL: https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019
当涉及到有两个单独的图表但每个图表不包含任何特定代码名称时,我该如何处理这些列?
当我使用 刮柱时soup.select('table>thead>tr>th')
,它显示双倍,所以我只想在重复前面切割柱。
例子。
Columns: [Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare, Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare]
import requests
from bs4 import BeautifulSoup as bs
URL = "https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019"
rq = requests.get(URL)
soup = bs(rq.content,'html.parser')
columns=soup.select('table > thead > tr > th')
columnlist=[]
for column in columns:
columnlist.append(column.text)
df=pd.DataFrame(columns=columnlist)
contents=soup.find_all('table')
contents=soup.select('tbody > tr')
dfcontent=[]
alldfcontents=[]
for content in contents:
tds = content.find_all('td')
for td in tds:
dfcontent.append(td.text)
alldfcontents.append(dfcontent)
dfcontent=[]
df = pd.DataFrame(columns=columnlist)
这就是我想做的 DataFrame:
Columns: Rank, Movie, Worldwide Box Office, Domestic Box Office, International Box Office, DomesticShare
Factors: 1, Avengers Endgame, ...
...
100, ~, ...
所以希望我可以用它来机器学习。
解决方案
#Read url
URL = "https://www.the-numbers.com/box-office-records/worldwide/all-movies/cumulative/released-in-2019"
data = requests.get(URL).text
#parse url
soup = BeautifulSoup(data, "html.parser")
#find the tables you want
table = soup.findAll("table")[1:]
#read it into pandas
df = pd.read_html(str(table))
#concat both the tables
df = pd.concat([df[0],df[1]])
df
Rank Movie Worldwide Box OfficeDomestic Box Office International Box Office DomesticShare
0 1 Avengers: Endgame $2,615,368,375 $771,368,375 $1,844,000,000 29.49%
1 2 Captain Marvel $1,122,281,059 $425,152,517 $697,128,542 37.88%
2 3 Liu Lang Di Qiu $692,163,684 NaN $692,163,684 NaN
3 4 How to Train Your Dragon: The Hidden World $518,846,075 $160,346,075 $358,500,000 30.90%
4 5 Alita: Battle Angel $402,976,036 $85,710,210 $317,265,826 21.27%
5 6 Shazam! $358,308,992 $138,067,613 $220,241,379 38.53%
这应该可以满足您的要求,您只需在使用 pandas 读取正确的 html 标记后将 2 个表连接在一起。
推荐阅读
- angularjs - 来自控制器的 Angular 1.x 调用函数
- java - 如何在java中以dd/mm格式打印日期
- python - pygraphviz 1.5 默认边缘没有箭头?
- wordpress - 在 Wordpress 中自定义框搜索
- tableau-api - 在 Tableau 中减去聚合和非聚合
- android - 从数组中添加 android 芯片
- php - 未定义的变量,以 SELECT 形式填充。拉拉维尔
- swift - 如何运行 shell 命令并在命令运行时打印输出?
- php - 如果 php $_SESSION 变量有数据
- python - 使用正则表达式过滤在 PYTHON 3 中嵌入了 - 或 * 之间的文本部分的行