首页 > 解决方案 > 抓取 youtube 元数据时出现“NoneType”错误(python 的新功能)

问题描述

我正在解析 youtube 视频的元数据。我收到此错误:

{'title': 'Ethics in the age of technology | Juan Enriquez | TEDxBerlin', 'view': 66458, 'tags': 'TEDxTalks, English, Technology, Business, Ethics, International Affairs, Social Science, Society'}
Traceback (most recent call last):
  File "youtubescraper.py", line 38, in <module>
    list_output.append(get_video_metadata("https://www.youtube.com/watch?v=" + data[i][0])) # index the contents of the ith member of the list of lists
  File "youtubescraper.py", line 27, in get_video_metadata
    video_meta["view"] = int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))
AttributeError: 'NoneType' object has no attribute 'text'

虽然这发生在输出可变数量的条目之后,如下所示:

{'title': 'Ethics in the age of technology | Juan Enriquez | TEDxBerlin', 'view': 66454, 'tags': 'TEDxTalks, English, Technology, Business, Ethics, International Affairs, Social Science, Society'}
{'title': 'What is Technology Ethics?', 'view': 13905, 'tags': 'Santa Clara University, Markkula Center for Applied Ethics, Silicon Valley Ethics, technology ethics, human enhancements, artificial intelligence, synthetic biology'}
{'title': 'The ethical dilemma we face on AI and autonomous tech | Christine Fox | TEDxMidAtlantic', 'view': 63516, 'tags': 'TEDxTalks, English, United States, Technology, AI, Big Data, Big problems, Decision making, Government, Hack, Morality, Policy, Progress, Public Policy, Robots'}
{'title': 'Does Technology Need to Be Ethical?', 'view': 27446, 'tags': 'anil dash, technology, tech, ethics, facebook, mark zuckerberg, information, passwords, politics. aspen ideas, entrepreneur'}
Traceback (most recent call last):
  File "youtubescraper.py", line 39, in <module>
    list_output.append(get_video_metadata("https://www.youtube.com/watch?v=" + data[i][0])) # index the contents of the ith member of the list of lists
  File "youtubescraper.py", line 28, in get_video_metadata
    video_meta["view"] = int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))
AttributeError: 'NoneType' object has no attribute 'text'

我很好奇(作为新手)1.为什么会发生这个 NoneType 错误(特别是为什么我每次都没有在相同的条目上得到 NoneType )和 2.我如何解决 NoneType 错误。

这是我的完整代码。首先,我的 csv 文件中的条目列表(这些是视频 id,为简洁起见只放了 50 个条目,但请注意这个程序真的很慢):

[['iiAirfn-lBI'], ['UISZx6K9enQ'], ['3oE88_6jAwc'], ['RoZ-WF5Z_1E'], ['RZB9PtUHfBE'], ['WMfbqHlrtEQ'], ['CR9kb6lvBmk'], ['duVoVZnWB2w'], ['1LyacmzB1Og'], ['X1mBUO8O654'], ['q-nhktqMoT4'], ['5YeK72q2CRQ'], ['AOqIiofqp3E'], ['IjRm6rxWyns'], ['phEuB6aYOho'], ['bZn0IfOb61U'], ['2SdpzTZTznw'], ['k1a2larfMIA'], ['S8a1DascnZg'], ['ixIoDYVfKA0'], ['X5WXSK_wm6s'], ['IFKhlxgoU58'], ['tzSoC_3y09s'], ['rTVta3BZfHU'], ['UbQlS6Rer6w'], ['EmzEnjrMB1Y'], ['Ji4Eu30VRoc'], ['pE5sF9SrnSI'], ['LYRKqnLeIDo'], ['p9VUBKiVM-k'], ['BxsJYEElcXY'], ['dBmUf5lQR98'], ['VYexAg2J6Kc'], ['eK_rhC25GAg'], ['cOCxbLaIu48'], ['_awIE_9vkV8'], ['P0fwUtChkd0'], ['cRx4ezY5KaY'], ['Hq--Sbdo9ls'], ['luabqeFCxzI'], ['mX0CpKWbAXU'], ['DkbXhhipcis'], ['LhzF-Y8xXBc'], ['5KZx81crb48'], ['KFgrns8dsis'], ['V3qVKndb7wA'], ['PoFAQi_DWsE'], ['3uYrPrn8Bzo'], ['YBIz8ouOMGk'], ['etr6sUHILKY']]

和蟒蛇:

from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs
import pandas as pd
import csv

# load my csv into memory
with open('ethcsvtest.csv', newline='') as f:
    reader = csv.reader(f)
    data = list(reader) # data is a list of single-member lists with video ids

# get video metadata from beautifulsoup
session = HTMLSession()

def get_video_metadata(url): # defining the function to get video metadata
    response = session.get(url)
    # exe javascript
    response.html.render(sleep=1)

    soup = bs(response.html.html, "html.parser")

    video_meta = {}

    # get titles
    video_meta["title"] = soup.find("h1").text.strip()

    # Video Views
    video_meta["view"] = int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))

    # Video Tags
    video_meta["tags"] = ', '.join([ meta.attrs.get("content") for meta in soup.find_all("meta", {"property": "og:video:tag"}) ])

    print(video_meta)

# loop through the array of ids, put the metadata dictionaries into a list, then turn the list of dictionaries into a dataframe
i = 0
list_output = []
while i < len(data):
    list_output.append(get_video_metadata("https://www.youtube.com/watch?v=" + data[i][0])) # index the contents of the ith member of the list of lists
    i += 1
df = pd.DataFrame(list_output) # turn list of dictionaries into dataframe
print(df)

标签: pythonbeautifulsouppython-requests

解决方案


出于某种原因,您soup找不到任何带有 的spanclass == "view-count"。您可以通过例如soup每次打印来调试它,或者使用可视化调试器来查看发生了什么。我没有确切的数据,我无法调试:)。

这是一个带有try/except块的解决方案。它会尝试找到那个 span,但是如果找不到,我们不会让它退出程序,而是报告它并继续:

...
try:
    video_meta["view"] = int(''.join(c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit()))
except AttributeError:
    print(f"couldn't find the view count thing for this url: {url}")
    return  # exit the function, not the entire program via erroring as before!
...

推荐阅读