python - 转动放入列表时抛出错误
问题描述
我目前正在制作一个 youtube 网络抓取工具以获取评论。
我想取消评论并将它们放入数据框中。我的代码只能打印文本,但我无法将文本放入数据框中。当我检查输出的类型时,它是一个 ' <class 'str'> ' 我可以通过这段代码获取文本:
try:
# Extract the elements storing the usernames and comments.
username_elems = driver.find_elements_by_xpath('//*[@id="author-text"]')
comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
except exceptions.NoSuchElementException:
error = "Error: Double check selector OR "
error += "element may not yet be on the screen at the time of the find operation"
print(error)
for com_text in comment_elems:
print(com_text.text)
如果我在函数结束时通过此代码检查文本。
for com_text in comment_elems:
print(type(com_text.text)
那么结果是<class 'str'>。然后我无法将其放入数据框中。
当我尝试将此 <class 'str'> 对象放入数据框中时,出现错误:TypeError: 'WebElement' object does not support item assignment
这是我尝试将文本放入数据框中时使用的代码:
for username, comment in zip(username_elems, comment_elems):
comment_section['comment'] = comment.text
data.append(comment_section)
我希望有一种方法可以将 <class 'str'> 对象转换为常规字符串类型,或者如果我可以采取另一个步骤从对象中提取文本。
这是我的完整代码
def gitscrape(url):
# Note: replace argument with absolute path to the driver executable.
driver = webdriver.Chrome('chromedriver/windows/chromedriver.exe')
# Navigates to the URL, maximizes the current window, and
# then suspends execution for (at least) 5 seconds (this gives time for the page to load).
driver.get(url)
driver.maximize_window()
time.sleep(5)
#empty subjects
comment_section =[]
comment_data = []
try:
# Extract the elements storing the video title and
# comment section.
title = driver.find_element_by_xpath('//*[@id="container"]/h1/yt-formatted-string').text
comment_section = driver.find_element_by_xpath('//*[@id="comments"]')
except exceptions.NoSuchElementException:
# Note: Youtube may have changed their HTML layouts for videos, so raise an error for sanity sake in case the
# elements provided cannot be found anymore.
error = "Error: Double check selector OR "
error += "element may not yet be on the screen at the time of the find operation"
print(error)
# Scroll into view the comment section, then allow some time
# for everything to be loaded as necessary.
driver.execute_script("arguments[0].scrollIntoView();", comment_section)
time.sleep(7)
# Scroll all the way down to the bottom in order to get all the
# elements loaded (since Youtube dynamically loads them).
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
# Scroll down 'til "next load".
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
# Wait to load everything thus far.
time.sleep(2)
# Calculate new scroll height and compare with last scroll height.
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# One last scroll just in case.
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
try:
# Extract the elements storing the usernames and comments.
username_elems = driver.find_elements_by_xpath('//*[@id="author-text"]')
comment_elems = driver.find_elements_by_xpath('//*[@id="content-text"]')
except exceptions.NoSuchElementException:
error = "Error: Double check selector OR "
error += "element may not yet be on the screen at the time of the find operation"
print(error)
# for com_text in comment_elems:
# print(type(com_text.text)
# data.append(comment_section)
for username, comment in zip(username_elems, comment_elems):
comment_section['comment'] = comment.text
data.append(comment_section)
video1_comments = pd.DataFrame(data)
解决方案
您的错误发生在行中comment_section['comment'] = comment.text
。您在文本中写道,当您尝试将字符串放入数据框时遇到该错误,但数据框既不是comment_section
也不comment
是数据框。在您的标题中,您写道将字符串添加到引发错误的列表中,但comment_section
也不是列表(如果它在哪里,语法将没有任何意义)。编码对您实际在做什么非常敏感,因此拥有数据框或列表会产生很大的不同。
comment_section
类型实际上是什么?如果您向上滚动代码,则最后的分配如下:comment_section = driver.find_element_by_xpath('//*[@id="comments"]')
实际上comment_section
既不是数据框也不是列表,而是网络元素!现在你得到的错误也很有意义,它说TypeError: 'WebElement' object does not support item assignment
并且确实你正在尝试分配comment.text
给 WebElement 的comment
键comment_section
,但 WebElement 不支持这一点。
您可以通过不覆盖comment_sectin
但使用不同的名称来修复此问题。
推荐阅读
- json - JSON 解析错误 IBM API Connect JSON 的非法字符
- java - 来自 MongoDB BSON 的 Jackson ObjectMapper
- swift - 如何在 macOS 的自定义视图中正确添加 NSTextField 的 IBOutlet?
- html - 从元素/标签获取 ID
- php - 未调用 CI 的 API 响应获取方法
- powershell - 合并两个 CSV,然后在输出中重新排序列
- excel - Using Excel & Access Together passing a variable from excel to access
- sharepoint - 我们可以在一个应用程序中使用两个 SPFx 扩展吗
- javascript - 用于使用文本框提取特定图片的 javascript
- php - HTML 将宽度为 110% 的表格行居中