python - 使用 selenium 和 google colab 抓取 youtube 评论很慢
问题描述
我正在使用 selenium 和 google Colab 从 YouTube 上抓取视频评论。无论是 1000 条评论还是 38 条评论,整个抓取过程大约需要一个小时。我可以做些什么来改进我的代码以提高处理速度?谢谢!
感谢以下有助于构建代码的资源。1:https ://colab.research.google.com/drive/1GFJKhpOju_WLAgiVPCzCGTBVGMkyAjtk#scrollTo=4Ylzd_l6fXGv 2:https ://www.tfzx.net/article/2719742.html 3:https ://towardsdatascience.com/web-使用 selenium-python-8a60f4cf40ab 抓取
输出#1:
Completed scraping 1000 comments in 3089.1585 seconds from YouTube Entertainment Tonight channel.
输出#2:
Completed scraping 38 comments in 3011.5525 seconds from YouTube Anne Schmidt channel.
输入:
!apt-get update
!apt install chromium-chromedriver
%pip install selenium
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=chrome_options)
import time
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
def scrapecomments(url):
tic = time.perf_counter()
wait = WebDriverWait(wd,15)
wd.get(url)
data1=[]
data2=[]
data3=[]
for item in range(200):
wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.END)
time.sleep(15)
for author in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#author-text"))):
if len(data1) == 1000:
break
else:
data1.append(author.text)
for comment in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#content-text"))):
data2.append(comment.text)
for likes in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#vote-count-middle"))):
data3.append(likes.text)
def merge(list1, list2, list3):
merged_list = [(list1[i], list2[i], list3[i]) for i in range(0, len(list1))]
return merged_list
alldata = merge(data1,data2,data3)
comments = pd.DataFrame(alldata,columns=['user_id','comment','likes'])
comments['rank'] = comments.reset_index().index +1
channel_name = wd.find_element_by_id('channel-name').text
comments['source'] = channel_name
toc = time.perf_counter()
print(f"Completed scraping {len(data1)} comments in {toc - tic:0.4f} seconds from YouTube {channel_name} channel.")
return comments
解决方案
也可能是您每次运行代码时都在安装 chromedriver 和 selenium
推荐阅读
- javascript - CEP 扩展中的 react-router(例如 Premiere Pro)
- phoenix-framework - 如何使用 Phoenix LiveView 处理表单
- javascript - Leaflet.js 奇怪的边框
- c# - 如何为属性中的异常创建单元测试
- google-apps-activity - 如何通过电子邮件或 GoogleDrive permissionId 获取 google UserID
- orm - 在 CreateAdapter 中构建查询时 Omines 数据表上的错误
- java - 列出 CompletableFuture
- puppeteer - 是否有可能在一个会话中为多个屏幕尺寸进行 CSS 覆盖?
- python - 如何在python的tkinter中的其他函数中使用组合框
- java - 使用不同的布局管理器不起作用