python-3.x - 如果 beautifulsoup 中没有数据,如何要求 f.write() 放入 NA?
问题描述
我的目标是在可汗学院的多个个人资料页面上抓取一些特定数据。并将数据放在 csv 文件中。
这是抓取一个特定的个人资料页面并将其放在 csv 上的代码:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n")
f.close()
这段代码与这个特定的链接('https://www.khanacademy.org/profile/DFletcher1990/'
)一起工作得很好。
现在,当我将链接更改为可汗学院的其他个人资料时,例如:'https://www.khanacademy.org/profile/Kkasparas/'
我收到此错误:
KeyError: 'project help requests'
这是正常的,因为在此配置文件"https://www.khanacademy.org/profile/Kkasparas/"
中没有project help requests
值(也没有project help replies
)。
因此data['project help requests']
并且data['project help replies']
不存在,因此不能写入 csv 文件。
我的目标是使用许多个人资料页面运行此脚本。所以我想知道如何NA
在每种情况下放置一个我不会得到每个变量的数据。然后将 te 打印NA
到 csv 文件中。
换句话说:我想让我的脚本适用于任何类型的用户个人资料页面。
非常感谢您的贡献:)
解决方案
您可以定义一个包含所有可能标题的新列表,并将不存在的键的值设置为“NA”,然后再将其写入文件。
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
还温柔地提醒您在您的问题中提供完整的工作代码。user_socio_table
问题中没有定义。我必须查看你之前的问题才能得到它。
完整的代码是
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/Kkasparas/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
data = {}
user_socio_table=soup.find_all('div', class_='discussion-stat')
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n")
f.close()
输出 - khanscraptry1.csv
date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0
如果 user_info_table 不存在,则更改为以下行
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'
推荐阅读
- python - 如何将字典中的二维数组转换为一个数组?
- next.js - 如何在 Strapi 组件中使集合类型的子级可用?
- javascript - React useState 导致 if-else 无法正常工作
- spring - EnumerablePropertySource 类的 getPropertyNames() 方法,从 spring-boot 2.1.9.RELEASE 升级到 2.4.2 后没有被调用
- javascript - 排版文本没有换行
- python - 如何给所有线程一个新变量?
- mysql - uuid v4 vs v6(有序),哪个对 MySQL 更有效?
- bots - 当安装在不同的工作区中时,我的 slack 机器人无法正常工作
- r - 如何按 id 分组并显示部分总数?
- reactjs - react-native-image-crop-picker 错误显示