python-3.x - 从 .txt 文件读取到带有换行符作为分隔符的 pandas 数据帧
问题描述
我想从文本文件中提取一些数据到数据框:
文本文件看起来像这样
URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html
WASHINGTON — Stellar .... stretched thin.
“We were going t......e do anything.”
Wednesday’s ... starter.
“We’re n... work.”
The Mets did not scor....their 40-37 record.
URL: http://www.nytimes.com/2016/06/30/nyregion/mayor-de-blasios-counsel-to-leave-next-month-to-lead-police-review-board.html
Mayor Bill de .... Department.
The move.... April.
A civil ... conversations.
More... administration.
URL: http://www.nytimes.com/2016/06/30/nyregion/three-men-charged-in-killing-of-cuomo-administration-lawyer.html
In the early..., the Folk Nation.
As hundreds ... wounds.
For some...residents.
On Wednesd...killing.
One ...murder.
它包含来自纽约时报文章的 URL 和文本,我想创建一个 2 列的数据框,第一个是 URL,第二个是文本。
我遇到的问题是我无法处理分隔符,因为 URL 和相应的文本之间有两条新行。但文本本身也有单行换行。
我尝试使用此代码,但没有得到一个 2 列数据框,而是得到一个单列,每个使用的换行符都有一个新行,所以它也将文本分成多个段落,我正在使用 dask btw:
df_csv = dd.read_csv(filename,sep="\n\n",header=None,engine='python')
解决方案
# read file
file = open('ny.txt', encoding="utf8").read()
url = []
text = []
# split text at every 2-new-lines
# elements at 'odd' positions are 'urls'
# elements at 'even' positions are 'text/content'
for ind, line in enumerate(file.split('\n\n')):
if ind%2==0:
url.append(line)
else:
text.append(line)
# save to a dataframe
df = pd.DataFrame({'url':url, 'text':text})
df
url text
0 URL: http://www.nytimes.com/2016/06/30/sports/... WASHINGTON — Stellar .... stretched thin.\n“We...
1 URL: http://www.nytimes.com/2016/06/30/nyregio... Mayor Bill de .... Department.\nThe move.... A...
2 URL: http://www.nytimes.com/2016/06/30/nyregio... In the early..., the Folk Nation.\nAs hundreds...
# ADDITIONAL : Remove the characters 'URL: ' with empty string
df['url'] = df['url'].str.replace('URL: ', '')
df
url text
0 http://www.nytimes.com/2016/06/30/sports/baseb... WASHINGTON — Stellar .... stretched thin.\n“We...
1 http://www.nytimes.com/2016/06/30/nyregion/may... Mayor Bill de .... Department.\nThe move.... A...
2 http://www.nytimes.com/2016/06/30/nyregion/thr... In the early..., the Folk Nation.\nAs hundreds...
推荐阅读
- html - CSS 选择器,它指向一个悬停的元素,它的五个相邻兄弟元素中的每一个都在两个方向上
- spring-integration - Spring Integration DSL HTTP 出站网关支持字节数组作为有效负载
- typescript - 返回类型是从参数计算键的对象
- javascript - 如何检索“it”块状态并根据结果我需要在 ALM 中上传屏幕截图
- python - pip 安装时出现 Azure DevOps python 提要错误
- c# - 使用剪切或复制从剪贴板粘贴文件
- flutter - 如何在一行内对齐小部件
- nginx - 如何将 nginx 访问日志导出到套接字而不是文件?
- android - 如何使用 RxJava 在后台获取远程声音的持续时间
- python - 如何使用键盘在 Spyder3 中手动中止 Python3 脚本