python - 如何将熊猫数据框从 praw 保存到 xlsx?
问题描述
我正在尝试将数据框保存到 Colab 中的 xlsx。我用 praw 获取数据:
sm = reddit.submission(url="https://www.reddit.com/r/AskReddit/comments/1irtkq/taxi_drivers_whats_the_deepest_secret_youve/")
sm.comments.replace_more(limit=0)
data = []
for top_level_comment in sm.comments.list():
data.append([top_level_comment.body,
top_level_comment.author,
top_level_comment.score,
top_level_comment.created_utc,
top_level_comment.depth,
top_level_comment.id,
top_level_comment.parent_id])
df = pd.DataFrame(data, columns=['body', 'author', 'score', 'created_utc', 'depth', 'id', 'parent_id'])
df
一切看起来都很好,我得到了所有的数据。但是当我保存它时,我在 praw 库中得到一个错误:
directory = '/content/downloads'
file_path = posixpath.join(directory, 'reddit.xlsx')
if not os.path.exists(directory):
os.makedirs(directory)
with pd.ExcelWriter(file_path, engine='xlsxwriter') as writer:
df.to_excel(writer, sheet_name='Sheet1', index=False)
worksheet = writer.sheets['Sheet1']
writer.save()
---------------------------------------------------------------------------
NotFound Traceback (most recent call last)
<ipython-input-9-b8157734da77> in <module>()
5
6 with pd.ExcelWriter(file_path, engine='xlsxwriter') as writer:
----> 7 df.to_excel(writer, sheet_name='Sheet1', index=False)
8 worksheet = writer.sheets['Sheet1']
10 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes)
2254 startcol=startcol,
2255 freeze_panes=freeze_panes,
-> 2256 engine=engine,
2257 )
2258
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/excel.py in write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine)
737 startrow=startrow,
738 startcol=startcol,
--> 739 freeze_panes=freeze_panes,
740 )
741 if need_save:
/usr/local/lib/python3.6/dist-packages/pandas/io/excel/_xlsxwriter.py in write_cells(self, cells, sheet_name, startrow, startcol, freeze_panes)
212 wks.freeze_panes(*(freeze_panes))
213
--> 214 for cell in cells:
215 val, fmt = self._value_with_fmt(cell.val)
216
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/excel.py in get_formatted_cells(self)
685 def get_formatted_cells(self):
686 for cell in itertools.chain(self._format_header(), self._format_body()):
--> 687 cell.val = self._format_value(cell.val)
688 yield cell
689
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/excel.py in _format_value(self, val)
433 elif self.float_format is not None:
434 val = float(self.float_format % val)
--> 435 if getattr(val, "tzinfo", None) is not None:
436 raise ValueError(
437 "Excel does not support datetimes with "
/usr/local/lib/python3.6/dist-packages/praw/models/reddit/base.py in __getattr__(self, attribute)
31 """Return the value of `attribute`."""
32 if not attribute.startswith("_") and not self._fetched:
---> 33 self._fetch()
34 return getattr(self, attribute)
35 raise AttributeError(
/usr/local/lib/python3.6/dist-packages/praw/models/reddit/redditor.py in _fetch(self)
173
174 def _fetch(self):
--> 175 data = self._fetch_data()
176 data = data["data"]
177 other = type(self)(self._reddit, _data=data)
/usr/local/lib/python3.6/dist-packages/praw/models/reddit/redditor.py in _fetch_data(self)
170 name, fields, params = self._fetch_info()
171 path = API_PATH[name].format(**fields)
--> 172 return self._reddit.request("GET", path, params)
173
174 def _fetch(self):
/usr/local/lib/python3.6/dist-packages/praw/reddit.py in request(self, method, path, params, data, files)
630 """
631 return self._core.request(
--> 632 method, path, data=data, files=files, params=params
633 )
634
/usr/local/lib/python3.6/dist-packages/prawcore/sessions.py in request(self, method, path, data, files, json, params)
183 return self._request_with_retries(
184 data=data, files=files, json=json, method=method,
--> 185 params=params, url=url)
186
187
/usr/local/lib/python3.6/dist-packages/prawcore/sessions.py in _request_with_retries(self, data, files, json, method, params, url, retries)
128 retries, saved_exception, url)
129 elif response.status_code in self.STATUS_EXCEPTIONS:
--> 130 raise self.STATUS_EXCEPTIONS[response.status_code](response)
131 elif response.status_code == codes['no_content']:
132 return
NotFound: received 404 HTTP response
我很困惑:我已经有了数据。我不再需要http requests
了。
我发现最后一个 pandas 错误大约是timezones
. 怎么了?
raise ValueError(
"Excel does not support datetimes with "
"timezones. Please ensure that datetimes "
"are timezone unaware before writing to Excel."
)
解决方案
正如您在评论中提到的那样,问题在于top_level_comment.author
type Redditor
,熊猫和/或 Excel 格式不支持它。
要解决此问题,请更改top_level_comment.author
为str(top_level_comment.author)
(这会将其转换为作者用户名的字符串)。
推荐阅读
- r - 用 TukeyHSD 做部分数据
- axios - axios:是否可以不从链接中硬编码 HTTPS 协议
- javascript - 如何在单个 html 页面上添加反应
- python - Python - 如何解析 xml 响应并将元素值存储在变量中?
- laravel - Laravel 不会在刷新时在“迁移”表中创建记录
- android - 如何定义导航操作的默认动画?
- meteor - Meteor 升级到 1.6.1.1 会生成 MaxListenersExceededWarning
- angular - Angular中的多个模板引用变量
- swift - 你如何在 Swift 中获得网站的平均背景颜色?
- ios - 如何在表格视图单元格中重复图像?只有 5 个图像我只想重复图像,它应该只包含 100 行