首页 > 解决方案 > 从 .txt 文件读取到带有换行符作为分隔符的 pandas 数据帧

问题描述

我想从文本文件中提取一些数据到数据框:

文本文件看起来像这样

URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html

WASHINGTON — Stellar .... stretched thin.
“We were going t......e do anything.”
Wednesday’s ... starter.
“We’re n... work.”
The Mets did not scor....their 40-37 record.

URL: http://www.nytimes.com/2016/06/30/nyregion/mayor-de-blasios-counsel-to-leave-next-month-to-lead-police-review-board.html

Mayor Bill de .... Department.
The move.... April.
A civil ... conversations.
More... administration.

URL: http://www.nytimes.com/2016/06/30/nyregion/three-men-charged-in-killing-of-cuomo-administration-lawyer.html

In the early..., the Folk Nation.
As hundreds ... wounds.
For some...residents.
On Wednesd...killing.
One ...murder.

它包含来自纽约时报文章的 URL 和文本,我想创建一个 2 列的数据框,第一个是 URL,第二个是文本。

我遇到的问题是我无法处理分隔符,因为 URL 和相应的文本之间有两条新行。但文本本身也有单行换行。

我尝试使用此代码,但没有得到一个 2 列数据框,而是得到一个单列,每个使用的换行符都有一个新行,所以它也将文本分成多个段落,我正在使用 dask btw:

df_csv = dd.read_csv(filename,sep="\n\n",header=None,engine='python')

标签: python-3.xpandasdask-dataframe

解决方案


# read file
file = open('ny.txt', encoding="utf8").read()

url = []
text = []

# split text at every 2-new-lines
# elements at 'odd' positions are 'urls'
# elements at 'even' positions are 'text/content'
for ind, line in enumerate(file.split('\n\n')):
    if ind%2==0:
        url.append(line)
    else:
        text.append(line)

# save to a dataframe
df = pd.DataFrame({'url':url, 'text':text})
df
    url                                                 text
0   URL: http://www.nytimes.com/2016/06/30/sports/...   WASHINGTON — Stellar .... stretched thin.\n“We...
1   URL: http://www.nytimes.com/2016/06/30/nyregio...   Mayor Bill de .... Department.\nThe move.... A...
2   URL: http://www.nytimes.com/2016/06/30/nyregio...   In the early..., the Folk Nation.\nAs hundreds...

# ADDITIONAL : Remove the characters 'URL: ' with empty string
df['url'] = df['url'].str.replace('URL: ', '')
df
    url                                                 text
0   http://www.nytimes.com/2016/06/30/sports/baseb...   WASHINGTON — Stellar .... stretched thin.\n“We...
1   http://www.nytimes.com/2016/06/30/nyregion/may...   Mayor Bill de .... Department.\nThe move.... A...
2   http://www.nytimes.com/2016/06/30/nyregion/thr...   In the early..., the Folk Nation.\nAs hundreds...

推荐阅读