python - 使用 Beautiful Soup Python 抓取网页表
问题描述
我正在尝试从Apple 维基百科页面对表格及其内容进行网络抓取。我正在使用 Beautiful Soup 来提取数据。我有以下代码:
from bs4 import BeautifulSoup
appleurl="https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products"
import requests
import pandas as pad
import lxml.html as html
_content = requests.get(appleurl)
soup = BeautifulSoup(_content.content)
_table = soup.findChildren('table')
rows = _table[0].findChildren(['th','tr'])
for row in rows:
cells = row.findChildren('td')
for cell in cells:
value = cell.string
print ("The value in this cell is %s"% value)
我有以下值:
The value in this cell is 1976
The value in this cell is April 11
The value in this cell is Apple I
The value in this cell is Apple I
The value in this cell is September 1, 1977
The value in this cell is 1977
The value in this cell is April 1
The value in this cell is Apple II
The value in this cell is Apple II
The value in this cell is June 1, 1979
The value in this cell is 1978
The value in this cell is June 1
The value in this cell is Disk II
The value in this cell is Drives
The value in this cell is May 1, 1984
The value in this cell is 1979
The value in this cell is June 1
The value in this cell is Apple II Plus
The value in this cell is Apple II series
The value in this cell is December 1, 1982
The value in this cell is None
The value in this cell is None
The value in this cell is None
The value in this cell is Bell & Howell Disk II
The value in this cell is None
The value in this cell is Apple SilenType
The value in this cell is Printers
The value in this cell is October 1, 1982
问题是这一年1979
的模型数量是多个,在我的例子中没有被提取。我需要今年的所有模型1979
。如果每年只有一行,我的代码可以很好地提取。如果在我提供的链接的第一个表中一年有多行,我该怎么办。我需要的值是年份、发布日期、型号。其他两列可以去掉。我将非常感谢您的帮助。
解决方案
哟可以简单地使用熊猫来做到这一点。使用pad.read_html()
import pandas as pad
df=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')[0]
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
输出:
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
更新所有表。
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
for df in dfs:
print(pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False))
如果您想在单个数据框中执行此操作,请使用此代码。
import pandas as pad
dfs=pad.read_html('https://en.m.wikipedia.org/wiki/Timeline_of_Apple_Inc._products')
dffinal=pd.DataFrame()
for df in dfs:
df1=pd.concat([df['Year'],df['Release Date'],df['Model']], axis=1, sort=False)
dffinal = dffinal.append(df1, ignore_index=True)
print(dffinal)
输出:
Year Release Date Model
0 1976 April 11 Apple I
1 1977 April 1 Apple II
2 1978 June 1 Disk II
3 1979 June 1 Apple II Plus
4 1979 June 1 Apple II EuroPlus
5 1979 June 1 Apple II J-Plus
6 1979 June 1 Bell & Howell
7 1979 June 1 Bell & Howell Disk II
8 1979 June 1 Apple SilenType
9 1980 September 1 Apple III
10 1980 September 1 Modem IIB (Novation CAT)
11 1980 September 1 Printer IIA (Centronics 779)
12 1980 September 1 Monitor III
13 1980 September 1 Monitor II (various third party)
14 1980 September 1 Disk III
15 1981 September 1 Apple ProFile
16 1981 December 1 Apple III Revised[1]
17 1982 October 1 Apple Dot Matrix Printer
18 1982 October 1 Apple Daisy Wheel Printer
19 1983 January 1 Apple IIe
20 1983 January 1 Apple Lisa[2]
21 1983 December 1 Apple III Plus
22 1983 December 1 Apple ImageWriter
23 1984 January 1 Apple Lisa 2
24 1984 January 24 Macintosh (128K)
25 1984 January 24 Macintosh External Disk Drive (400K)
26 1984 January 24 Apple Modem 300
27 1984 January 24 Apple Modem 1200
28 1984 April 1 Apple IIc
29 1984 April 1 Apple Scribe Printer
.. ... ... ...
606 2019 March 18 iPad Mini (5th gen)
607 2019 March 19 iMac with Retina 4K display (21.5") (Early 2019)
608 2019 March 19 iMac with Retina 5K display (27") (Early 2019)
609 2019 March 20 AirPods (2nd gen)
610 2019 May 21 MacBook Pro with Touch Bar (4th gen) (13") (Mi...
611 2019 May 21 MacBook Pro with Touch Bar (4th gen) (15") (Mi...
612 2019 May 28 iPod Touch (7th gen)
613 2019 July 9 MacBook Air (13") (2019)
614 2019 July 9 Macbook Pro with Touch Bar (4th gen) (13") (Mi...
615 2019 September 20 Apple Watch Series 5
616 2019 September 20 Apple Watch Hermès Series 5
617 2019 September 20 Apple Watch Nike Series 5
618 2019 September 20 Apple Watch Edition Series 5
619 2019 September 20 iPhone 8 (128 GB)
620 2019 September 20 iPhone 8 Plus (128 GB)
621 2019 September 20 iPhone 11
622 2019 September 20 iPhone 11 Pro
623 2019 September 20 iPhone 11 Pro Max
624 2019 September 25 iPad (2019)
625 2019 October 30 AirPods Pro
626 2019 November 13 MacBook Pro with Touch Bar (16")
627 2019 December 10 Mac Pro (Late 2019)
628 2019 December 10 Pro Display XDR
629 2020 March 18 NaN
630 2020 March 18 iPad Pro (11") (2nd gen)
631 2020 March 18 iPad Pro (12.9") (4th gen)
632 2020 March 18 Magic Keyboard for iPad Pro
633 2020 March 18 MacBook Air (Early 2020)
634 2020 April 24 iPhone SE (2nd gen)
635 2020 May 4 MacBook Pro with Magic Keyboard (Mid 2020)
[636 rows x 3 columns]