首页 > 解决方案 > 使用 selenium 和 google colab 抓取 youtube 评论很慢

问题描述

我正在使用 selenium 和 google Colab 从 YouTube 上抓取视频评论。无论是 1000 条评论还是 38 条评论,整个抓取过程大约需要一个小时。我可以做些什么来改进我的代码以提高处理速度?谢谢!

感谢以下有助于构建代码的资源。1:https ://colab.research.google.com/drive/1GFJKhpOju_WLAgiVPCzCGTBVGMkyAjtk#scrollTo=4Ylzd_l6fXGv 2:https ://www.tfzx.net/article/2719742.html 3:https ://towardsdatascience.com/web-使用 selenium-python-8a60f4cf40ab 抓取

输出#1:

Completed scraping 1000 comments in 3089.1585 seconds from YouTube Entertainment Tonight channel.

输出#2:

Completed scraping 38 comments in 3011.5525 seconds from YouTube Anne Schmidt channel.

输入:

!apt-get update
!apt install chromium-chromedriver
%pip install selenium
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
wd = webdriver.Chrome('chromedriver',options=chrome_options)
import time
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time

def scrapecomments(url):
  tic = time.perf_counter()
  wait = WebDriverWait(wd,15)
  wd.get(url)
  data1=[]
  data2=[]
  data3=[]
  for item in range(200): 
          wait.until(EC.visibility_of_element_located((By.TAG_NAME,                "body"))).send_keys(Keys.END)
          time.sleep(15)
  for author in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#author-text"))):
    if len(data1) == 1000:
      break
    else:
      data1.append(author.text)
  for comment in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#content-text"))):
          data2.append(comment.text)
  for likes in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#vote-count-middle"))):
          data3.append(likes.text)

  def merge(list1, list2, list3):
    merged_list = [(list1[i], list2[i], list3[i]) for i in range(0, len(list1))] 
    return merged_list
  
  alldata = merge(data1,data2,data3)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
  comments = pd.DataFrame(alldata,columns=['user_id','comment','likes'])
  comments['rank'] = comments.reset_index().index +1
  channel_name = wd.find_element_by_id('channel-name').text
  comments['source'] = channel_name
  toc = time.perf_counter()
  print(f"Completed scraping {len(data1)} comments in {toc - tic:0.4f} seconds from YouTube {channel_name} channel.")
  return comments

标签: pythonseleniumyoutubegoogle-colaboratory

解决方案


也可能是您每次运行代码时都在安装 chromedriver 和 selenium


推荐阅读