首页 > 解决方案 > 如何使用 YouTube Data API v3 从频道中提取超过 20000 个视频的元数据?

问题描述

我想使用 Youtube Data API v3 为频道中的所有视频提取视频元数据(尤其是标题和发布日期)。目前,我只能使用playlistItems()端点提取最后 20000 个视频的详细信息。有没有办法从单个频道中提取超过 20000 个视频的元数据?

这是我用来提取 20000 个视频的元数据的 python 代码。

youtube = build('youtube','v3',developerKey= "YOUTUBE_API_KEY")
channelId = "CHANNEL_ID"

# getting all video details
contentdata = youtube.channels().list(id=channelId,part='contentDetails').execute()
playlist_id = contentdata['items'][0]['contentDetails']['relatedPlaylists']['uploads']
videos = [ ]
next_page_token = None

while 1:
    res = youtube.playlistItems().list(playlistId=playlist_id,part='snippet',maxResults=50,pageToken=next_page_token).execute()
    videos += res['items']
    next_page_token = res.get('nextPageToken')
    if next_page_token is None:
        break

# getting video id for each video
video_ids = list(map(lambda x:x['snippet']['resourceId']['videoId'], videos))

此问题的解决方案可以是强制 API 从频道中提取超过 20000 个视频的元数据,也可以指定上传视频的时间段。这样,代码可以在多个时间段内一次又一次地运行,以提取所有视频的元数据。

标签: pythonyoutubeyoutube-apiyoutube-data-api

解决方案


此问题的解决方案可以是强制 API 从频道中提取超过 20000 个视频的元数据,也可以指定上传视频的时间段。这样,代码可以在多个时间段内一次又一次地运行,以提取所有视频的元数据。

我试过这个没有成功。

我的 YouTube API 后端失败的解决方案是使用这个 Python 脚本:

它包括在浏览 YouTube 频道上的“视频”选项卡时完成的伪造请求。

import urllib.request, json, subprocess
from urllib.error import HTTPError

def getURL(url):
    res = ""
    try:
        res = urllib.request.urlopen(url).read()
    except HTTPError as e:
        res = e.read()
    return res.decode('utf-8')

def exec(cmd):
    return subprocess.check_output(cmd, shell = True)

youtuberId = 'CHANNEL_ID'
videosIds = []
errorsCount = 0

def retrieveVideosFromContent(content):
    global videosIds
    wantedPattern = '"videoId":"'
    content = content.replace('"videoId": "', wantedPattern).replace("'videoId': '", wantedPattern)
    contentParts = content.split(wantedPattern)
    contentPartsLen = len(contentParts)
    for contentPartsIndex in range(contentPartsLen):
        contentPart = contentParts[contentPartsIndex]
        contentPartParts = contentPart.split('"')
        videoId = contentPartParts[0]
        videoIdLen = len(videoId)
        if not videoId in videosIds and videoIdLen == 11:
            videosIds += [videoId]

def scrape(token):
    global errorsCount, data
    # YOUR_KEY can be obtained by browsing a videos channel section (like https://www.youtube.com/c/BenjaminLoison/videos) while checking your "Network" tab using for instance Ctrl+Shift+E
    cmd = 'curl -s \'https://www.youtube.com/youtubei/v1/browse?key=YOUR_KEY\' -H \'Content-Type: application/json\' --data-raw \'{"context":{"client":{"clientName":"WEB","clientVersion":"2.20210903.05.01"}},"continuation":"' + token + '"}\''
    cmd = cmd.replace('"', '\\"').replace("\'", '"')
    content = exec(cmd).decode('utf-8')

    retrieveVideosFromContent(content)

    data = json.loads(content)
    if not 'onResponseReceivedActions' in data:
        print('no token found let\'s try again')
        errorsCount += 1
        return scrape(token)
    entry = data['onResponseReceivedActions'][0]['appendContinuationItemsAction']['continuationItems'][-1]
    if not 'continuationItemRenderer' in entry:
        return ''
    newToken = entry['continuationItemRenderer']['continuationEndpoint']['continuationCommand']['token']
    return newToken

url = 'https://www.youtube.com/channel/' + youtuberId + '/videos'
content = getURL(url)
content = content.split('var ytInitialData = ')[1].split(";</script>")[0]
dataFirst = json.loads(content)

retrieveVideosFromContent(content)

token = dataFirst['contents']['twoColumnBrowseResultsRenderer']['tabs'][1]['tabRenderer']['content']['sectionListRenderer']['contents'][0]['itemSectionRenderer']['contents'][0]['gridRenderer']['items'][-1]['continuationItemRenderer']['continuationEndpoint']['continuationCommand']['token']

while True:
    videosIdsLen = len(videosIds)
    print(videosIdsLen, token)
    if token == '':
        break
    newToken = scrape(token)
    token = newToken

print(videosIdsLen, videosIds)

注意修改“CHANNEL_ID”和“YOUR_KEY”的值。还要注意从你的 shell 中使用 curl 命令。


推荐阅读