python - 抓取 youtube 元数据时出现“NoneType”错误(python 的新功能)
问题描述
我正在解析 youtube 视频的元数据。我收到此错误:
{'title': 'Ethics in the age of technology | Juan Enriquez | TEDxBerlin', 'view': 66458, 'tags': 'TEDxTalks, English, Technology, Business, Ethics, International Affairs, Social Science, Society'}
Traceback (most recent call last):
File "youtubescraper.py", line 38, in <module>
list_output.append(get_video_metadata("https://www.youtube.com/watch?v=" + data[i][0])) # index the contents of the ith member of the list of lists
File "youtubescraper.py", line 27, in get_video_metadata
video_meta["view"] = int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))
AttributeError: 'NoneType' object has no attribute 'text'
虽然这发生在输出可变数量的条目之后,如下所示:
{'title': 'Ethics in the age of technology | Juan Enriquez | TEDxBerlin', 'view': 66454, 'tags': 'TEDxTalks, English, Technology, Business, Ethics, International Affairs, Social Science, Society'}
{'title': 'What is Technology Ethics?', 'view': 13905, 'tags': 'Santa Clara University, Markkula Center for Applied Ethics, Silicon Valley Ethics, technology ethics, human enhancements, artificial intelligence, synthetic biology'}
{'title': 'The ethical dilemma we face on AI and autonomous tech | Christine Fox | TEDxMidAtlantic', 'view': 63516, 'tags': 'TEDxTalks, English, United States, Technology, AI, Big Data, Big problems, Decision making, Government, Hack, Morality, Policy, Progress, Public Policy, Robots'}
{'title': 'Does Technology Need to Be Ethical?', 'view': 27446, 'tags': 'anil dash, technology, tech, ethics, facebook, mark zuckerberg, information, passwords, politics. aspen ideas, entrepreneur'}
Traceback (most recent call last):
File "youtubescraper.py", line 39, in <module>
list_output.append(get_video_metadata("https://www.youtube.com/watch?v=" + data[i][0])) # index the contents of the ith member of the list of lists
File "youtubescraper.py", line 28, in get_video_metadata
video_meta["view"] = int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))
AttributeError: 'NoneType' object has no attribute 'text'
我很好奇(作为新手)1.为什么会发生这个 NoneType 错误(特别是为什么我每次都没有在相同的条目上得到 NoneType )和 2.我如何解决 NoneType 错误。
这是我的完整代码。首先,我的 csv 文件中的条目列表(这些是视频 id,为简洁起见只放了 50 个条目,但请注意这个程序真的很慢):
[['iiAirfn-lBI'], ['UISZx6K9enQ'], ['3oE88_6jAwc'], ['RoZ-WF5Z_1E'], ['RZB9PtUHfBE'], ['WMfbqHlrtEQ'], ['CR9kb6lvBmk'], ['duVoVZnWB2w'], ['1LyacmzB1Og'], ['X1mBUO8O654'], ['q-nhktqMoT4'], ['5YeK72q2CRQ'], ['AOqIiofqp3E'], ['IjRm6rxWyns'], ['phEuB6aYOho'], ['bZn0IfOb61U'], ['2SdpzTZTznw'], ['k1a2larfMIA'], ['S8a1DascnZg'], ['ixIoDYVfKA0'], ['X5WXSK_wm6s'], ['IFKhlxgoU58'], ['tzSoC_3y09s'], ['rTVta3BZfHU'], ['UbQlS6Rer6w'], ['EmzEnjrMB1Y'], ['Ji4Eu30VRoc'], ['pE5sF9SrnSI'], ['LYRKqnLeIDo'], ['p9VUBKiVM-k'], ['BxsJYEElcXY'], ['dBmUf5lQR98'], ['VYexAg2J6Kc'], ['eK_rhC25GAg'], ['cOCxbLaIu48'], ['_awIE_9vkV8'], ['P0fwUtChkd0'], ['cRx4ezY5KaY'], ['Hq--Sbdo9ls'], ['luabqeFCxzI'], ['mX0CpKWbAXU'], ['DkbXhhipcis'], ['LhzF-Y8xXBc'], ['5KZx81crb48'], ['KFgrns8dsis'], ['V3qVKndb7wA'], ['PoFAQi_DWsE'], ['3uYrPrn8Bzo'], ['YBIz8ouOMGk'], ['etr6sUHILKY']]
和蟒蛇:
from requests_html import HTMLSession
from bs4 import BeautifulSoup as bs
import pandas as pd
import csv
# load my csv into memory
with open('ethcsvtest.csv', newline='') as f:
reader = csv.reader(f)
data = list(reader) # data is a list of single-member lists with video ids
# get video metadata from beautifulsoup
session = HTMLSession()
def get_video_metadata(url): # defining the function to get video metadata
response = session.get(url)
# exe javascript
response.html.render(sleep=1)
soup = bs(response.html.html, "html.parser")
video_meta = {}
# get titles
video_meta["title"] = soup.find("h1").text.strip()
# Video Views
video_meta["view"] = int(''.join([ c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit() ]))
# Video Tags
video_meta["tags"] = ', '.join([ meta.attrs.get("content") for meta in soup.find_all("meta", {"property": "og:video:tag"}) ])
print(video_meta)
# loop through the array of ids, put the metadata dictionaries into a list, then turn the list of dictionaries into a dataframe
i = 0
list_output = []
while i < len(data):
list_output.append(get_video_metadata("https://www.youtube.com/watch?v=" + data[i][0])) # index the contents of the ith member of the list of lists
i += 1
df = pd.DataFrame(list_output) # turn list of dictionaries into dataframe
print(df)
解决方案
出于某种原因,您soup
找不到任何带有 的spanclass == "view-count"
。您可以通过例如soup
每次打印来调试它,或者使用可视化调试器来查看发生了什么。我没有确切的数据,我无法调试:)。
这是一个带有try/except
块的解决方案。它会尝试找到那个 span,但是如果找不到,我们不会让它退出程序,而是报告它并继续:
...
try:
video_meta["view"] = int(''.join(c for c in soup.find("span", attrs={"class": "view-count"}).text if c.isdigit()))
except AttributeError:
print(f"couldn't find the view count thing for this url: {url}")
return # exit the function, not the entire program via erroring as before!
...
推荐阅读
- .net - 关于 String.Format 显示时间的问题
- wordpress - 如何在 PHP/Wordpress 中修复“加载资源失败:net::ERR_NAME_NOT_RESOLVED”
- java - 在 For 循环迭代中将 HashMap 添加到 ArrayList
- css - 如何重置填充以便正确显示列表填充?
- javascript - 如何使用加减按钮和输入框创建总计
- html - VBA:检查单选按钮后使隐藏字段可见
- c - 无符号字符和 sprintf() C
- java - 使用给定的一个变量来更改用方法给定的另一个变量
- objective-c - Objective-C 中的 Swift 单例访问触发 EXC_BAD_INSTRUCTION (EXC_I386_INVOP)
- asp.net - 为什么在 FindControl 中找不到我的 GridView 中的 TextBox 列