python - 如何解析 div 内的 HTML 表而不是 Python 中的表
问题描述
我正在尝试从这个网站解析表格。我从Username
专栏开始,在 stackoverflow 的帮助下,我能够Username
使用以下代码获取内容:
with open("Top 50 TikTok users sorted by Followers - Socialblade TikTok Stats _ TikTok Statistics.html", "r", encoding="utf-8") as file:
soup = BeautifulSoup(str(file.readlines()), "html.parser")
tiktok = []
for tag in soup.select("div div:nth-of-type(n+5) > div > a"):
tiktok.append(tag.text)
这给了我
['addison rae',
'Bella Poarch',
'Zach King',
'TikTok',
'Spencer X',
'Will Smith',
'Loren Gray',
'dixie',
'Michael Le',
'Jason Derulo',
'Riyaz',
.
.
.
我的最终目标是用[Rank, Grade, Username, Uploads, Followers, Following, Likes]
我已经阅读了一些关于Parsing HTML Tables in Python with BeautifulSoup and pandas
但它没有工作的文章,因为它没有被定义为源中的表。有哪些替代方法可以将其作为 Python 中的表?
解决方案
您可以使用此代码如何将 HTML 从文件加载到汤,然后将表格解析为数据框:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")
data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
data.append(
[
d.get_text(strip=True)
for d in div.find_all("div", recursive=False)[:8]
]
)
df = pd.DataFrame(
data,
columns=[
"Rank",
"Grade",
"Username",
"Uploads",
"Followers",
"Following",
"Likes",
"Interactions",
],
)
print(df)
df.to_csv("data.csv", index=False)
印刷:
Rank Grade Username Uploads Followers Following Likes Interactions
0 1st A++ charli d’amelio 1,755 113,600,000 1,210 9,200,000,000 --
1 2nd A++ addison rae 1,411 79,900,000 2,454 5,100,000,000 --
2 3rd A++ Bella Poarch 282 63,600,000 588 1,400,000,000 --
3 4th A++ Zach King 277 58,800,000 41 723,400,000 --
4 5th A++ TikTok 139 52,900,000 495 250,300,000 91
5 6th A++ Spencer X 1,250 52,700,000 7,206 1,300,000,000 --
6 7th A++ Will Smith 73 52,500,000 23 314,400,000 --
7 8th A++ Loren Gray 2,805 52,100,000 221 2,800,000,000 --
8 9th A++ dixie 120 51,200,000 1,267 2,900,000,000 --
9 10th A++ Michael Le 1,158 47,400,000 93 1,300,000,000 --
10 11th A+ Jason Derulo 675 44,900,000 12 1,000,000,000 --
11 12th A+ Riyaz 2,056 44,100,000 43 2,100,000,000 --
12 13th A+ Kimberly Loaiza ✨ 1,150 41,000,000 123 2,200,000,000 --
13 14th A+ Brent Rivera 955 37,800,000 272 1,200,000,000 --
14 15th A+ cznburak 1,301 37,300,000 1 688,700,000 --
15 16th A+ The Rock 42 36,200,000 1 200,300,000 --
16 17th A+ James Charles 238 36,200,000 148 881,400,000 --
17 18th A+ BabyAriel 2,365 35,300,000 326 1,900,000,000 --
18 19th A+ JoJo Siwa 1,206 33,500,000 346 1,100,000,000 --
19 20th A+ avani 5,347 33,300,000 5,003 2,400,000,000 --
20 21st A+ GIL CROES 693 32,900,000 454 803,200,000 --
21 22nd A+ Faisal shaikh 461 32,200,000 -- 2,000,000,000 --
22 23rd A+ BTS 39 32,000,000 -- 557,100,000 255
23 24th A+ LILHUDDY 4,187 30,500,000 8,652 1,600,000,000 --
24 25th A+ Stokes Twins 548 30,100,000 21 781,000,000 --
25 26th A+ Joe 1,487 29,800,000 8,402 1,200,000,000 --
26 27th A+ ROD 1,792 29,500,000 536 1,700,000,000 --
27 28th A+ 899 29,400,000 216 1,700,000,000 --
28 29th A+ Kylie Jenner 69 29,400,000 14 318,800,000 --
29 30th A+ Junya/じゅんや 2,823 29,000,000 1,934 533,800,000 12,200
30 31st A+ YZ 816 28,900,000 563 554,700,000 --
31 32nd A+ Arishfa Khan 2,026 28,600,000 27 1,100,000,000 --
32 33rd A+ Lucas and Marcus 1,248 28,500,000 158 806,500,000 --
33 34th A+ jannat_zubair29 1,054 28,200,000 6 746,300,000 47
34 35th A+ Nisha Guragain 1,751 28,000,000 33 756,300,000 --
35 36th A+ Selena Gomez 40 27,800,000 17 82,300,000 --
36 37th A+ Kris HC 1,049 27,800,000 1,405 1,200,000,000 --
37 38th A+ flighthouse 4,200 27,600,000 488 2,300,000,000 --
38 39th A+ wigofellas 1,251 27,500,000 812 707,200,000 --
39 40th A+ Savannah LaBrant 1,860 27,300,000 155 1,400,000,000 --
40 41st A+ noah beck 1,395 26,900,000 2,297 1,700,000,000 --
41 42nd A+ Liza Koshy 155 26,700,000 104 321,900,000 --
42 43rd A+ Kirya Kolesnikov 1,338 26,400,000 78 543,200,000 --
43 44th A+ Awez Darbar 2,708 26,100,000 208 1,100,000,000 --
44 45th A+ Carlos Feria 2,522 25,700,000 138 1,200,000,000 --
45 46th A+ Kira Kosarin 837 25,700,000 401 447,000,000 --
46 47th A+ Naim Darrechi 2,634 25,300,000 527 2,200,000,000 --
47 48th A+ Josh Richards 1,899 24,900,000 9,847 1,600,000,000 --
48 49th A+ Q Park 231 24,800,000 3 294,100,000 --
49 50th A+ TikTok_India 186 24,500,000 191 40,100,000 --
并保存data.csv
(来自 LibreOffice 的屏幕截图):
编辑:获取 URL 用户名:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("page.html", "r").read(), "html.parser")
data = []
for div in soup.select('div[style*="fafafa"], div[style*="f8f8f8"]'):
data.append(
[
d.get_text(strip=True)
for d in div.find_all("div", recursive=False)[:8]
]
+ [div.a["href"].split("/")[-1]]
)
df = pd.DataFrame(
data,
columns=[
"Rank",
"Grade",
"Username",
"Uploads",
"Followers",
"Following",
"Likes",
"Interactions",
"URL username",
],
)
print(df)
df.to_csv("data.csv", index=False)
推荐阅读
- angular - 量角器测试中的正确用法是什么?等待期望(someFn()).toBe.. 或期望(await someFn()).toBe
- php - 在foreach php上更改对象的值
- python - 由于 __init__() 中未定义(事件)属性中的 nameError 导致无法访问类,我不知道如何定义它!(Python)
- javascript - jquery添加到div而不在移动设备中滚动到顶部
- javascript - 等价于 try...catch 作为表达式
- javascript - 如何在使用 javascript 过滤表格后保持 TD 标头处于活动状态
- r - r 每周查找到每日数据集
- javascript - 无法使用 mongoose 更新子文档
- jenkins - 在詹金斯声明性管道中具有插值的条件环境变量
- reactjs - 无法从 Firebase 数据库读取数据