python - 使用 Beautifulsoup 对多个页面中的表格进行 Web 抓取
问题描述
我正在尝试从多个页面中抓取不同周的表格,但是我继续从这个网址https://www.boxofficemojo.com/weekly/2018W52/获取结果,这是我正在使用的代码:
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from time import sleep
from random import randint
import re
pages = np.arange(2015,2016)
week = ['01','02','03','04','05','06','07','08','09']
week1 = np.arange(10,11)
for x in week1:
week.append(x)
week
mov = soup.find_all("table", attrs={"class": "a-bordered"})
print("Number of tables on site: ",len(mov))
all_rows= []
all_rows= []
for page in pages:
for x in week:
url = requests.get('https://www.boxofficemojo.com/weekly/'+str(page)+'W'+str(x)+'/')
soup = BeautifulSoup(url.text, 'lxml')
mov = soup.find_all("table", attrs={"class": "a-bordered"})
table1 = mov[0]
body = table1.find_all("tr")
head = body[0]
body_rows = body[1:]
sleep(randint(2,10))
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all("td"):
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(aa)
all_rows.append(row)
print('Page', page, x)
解决方案
假设您想要每年 52 周,为什么不提前生成链接,然后使用 pandas 检索表,创建此类数据框的列表并将它们连接成最终数据框?
import pandas as pd
def get_table(url):
year = int(url[37:41])
week_yr = int(url[42:44])
df = pd.read_html(url)[0]
df['year'] = year
df['week_yr'] = week_yr
return df
years = ['2015','2016']
weeks = [str(i).zfill(2) for i in range(1, 53)]
base = 'https://www.boxofficemojo.com/weekly'
urls = [f'{base}/{year}W{week}' for week in weeks for year in years]
results = pd.concat([get_table(url, int(url.split('/')[-1][:4])) for url in urls])
然后,您可能会考虑加快速度的方法,例如
from multiprocessing import Pool, cpu_count
import pandas as pd
def get_table(url):
year = int(url[37:41])
week_yr = int(url[42:44])
df = pd.read_html(url)[0]
df['year'] = year
df['week_yr'] = week_yr
return df
if __name__ == '__main__':
years = ['2015','2016']
weeks = [str(i).zfill(2) for i in range(1, 53)]
base = 'https://www.boxofficemojo.com/weekly'
urls = [f'{base}/{year}W{week}' for week in weeks for year in years]
with Pool(cpu_count()-1) as p:
results = p.map(get_table, urls)
final = pd.concat(results)
print(final)
推荐阅读
- javascript - 如何通过 {{{ }}}(模板助手)将操作绑定到 ember 模板中的标记?
- json - 如果 API 响应是 Ionic + Angular 中的字符串而不是 JSON,如何处理
- android - 由于 Playstore 缓存问题,应用内更新无法正常工作
- java - 从 firebase 检索数据后获取空白 ListView
- python-3.x - 如何在 .py 文件中将异步图像定位在 KivyMD 的 MDCard 中
- spring-boot - 如何使用 webclient 请求发送 java.security.Principal
- function - 如何使用 Kotlin 在此函数中实现非空返回?
- flutter - 如何在 Flutter 中根据子 Text 小部件包装 Row
- amazon-web-services - Dynamo DB 在拥有数十亿数据的现有 GSI 中投影附加属性
- c# - 实例化多个视图并将数据传递给每个视图