首页 > 解决方案 > 对 Pandas 系列中的文本进行分类

问题描述

我有以下数据框,它是通过将原始文本文件解析为列表然后进入数据框而构建的。

                                                                       Content
0                                                                     POLITICS
1               A Renewed Push in New York to Open Police Disciplinary Records
2                                                                  11:59 PM ET
3                                                                  CORRECTIONS
4                                                 Corrections & Amplifications
5                                                                  11:25 PM ET
6                                                                     NEW YORK
7   New York City to Have Curfew as Protests Over George Floyds Death Continue
8                                                                  10:20 PM ET
9                                                                         U.S.
10              Fresh Data Shows Heavy Coronavirus Death Toll in Nursing Homes
11                                                                  8:49 PM ET
12                                                                    BUSINESS
13    Reports of Violence Against Journalists Mount as U.S. Protests Intensify
14                                                                  8:05 PM ET
15                                                           MEDIA & MARKETING
16                      Music Labels Suspend Work in Support of Demonstrations
17                                                                  7:32 PM ET
18                                                            REVIEW & OUTLOOK
19                                                     Dont Call in the Troops
20                                                                  7:31 PM ET
21                                                                    NEW YORK
22                       Manhattan Stores Prepare for Another Night of Looting
23                                                                  7:31 PM ET
24                                                                     OPINION
25                                                 Dave Patrick Underwood, RIP
26                                                                  7:30 PM ET
27                                                            REVIEW & OUTLOOK
28                                       Courts Arent Financial Clearinghouses
29                                                                  7:27 PM ET

我想知道是否有任何方法可以将此列拆分为像这样的 3 列['Topic','Headline','Time']。每行包含这些列之一的数据。我想在不做任何手动工作的情况下拆分它们。我认为整个数据框不遵循主题、标题、时间的模式。由于原始数据是手工创建的,因此在某些时候模式会发生变化。因此,如果可以根据正则表达式或允许维护时间序列结构的东西对行进行分类;那很好啊。

标签: pythonpandas

解决方案


地址:在某些时候模式会发生变化

  • 使用列表推导,查找每个标题的数据
    • 列表创建的顺序很重要,time, top, 然后head
    • time模式必须与 2 个字符时区一致,包含AMorPMhh:mmor h:mm
    • top模式应保持所有大写字符的模式,而不是在time.
    • head是不在timeor中的任何东西top
  • 以下实现使用相当简单的匹配
    • 毫无疑问,可以应用更复杂的正则表达式。
import re
import pandas

# find components for each list
time = [v for v in cont if (len(v) in [10, 11]) & (':' in v)]  # the time pattern must be consistent
top = [v for v in cont if ''.join(re.findall('\w', v)).isupper() & (v not in time)]  # topics characters must be all uppercase
head = [v for v in cont if v not in time + top]  # anything not in the other two lists

# create the dataframe
df = pd.DataFrame({'Time': time, 'Topic': top, 'Headline': head})

地址:列保持连续模式

  • 我会将列转换为列表并使用字符串切片
  • 这仅适用于连续模式
    • 它没有解决在某些时候模式会发生变化,因为原始数据是手动创建的
# given your dataframe as df

# create a new dataframe with 3 columns
df_new = pd.DataFrame(columns=['cat', 'desc', 'time'])

# select data for columns
df_new.cat = df.Content.tolist()[0::3]
df_new.desc = df.Content.tolist()[1::3]
df_new.time = df.Content.tolist()[2::3]

# display(df_new)
                 cat                                                                        desc         time
0           POLITICS              A Renewed Push in New York to Open Police Disciplinary Records  11:59 PM ET
1        CORRECTIONS                                                Corrections & Amplifications  11:25 PM ET
2           NEW YORK  New York City to Have Curfew as Protests Over George Floyds Death Continue  10:20 PM ET
3               U.S.              Fresh Data Shows Heavy Coronavirus Death Toll in Nursing Homes   8:49 PM ET
4           BUSINESS    Reports of Violence Against Journalists Mount as U.S. Protests Intensify   8:05 PM ET
5  MEDIA & MARKETING                      Music Labels Suspend Work in Support of Demonstrations   7:32 PM ET
6   REVIEW & OUTLOOK                                                     Dont Call in the Troops   7:31 PM ET
7           NEW YORK                       Manhattan Stores Prepare for Another Night of Looting   7:31 PM ET
8            OPINION                                                 Dave Patrick Underwood, RIP   7:30 PM ET
9   REVIEW & OUTLOOK                                       Courts Arent Financial Clearinghouses   7:27 PM ET

使用循环

df_new = pd.DataFrame()

for i, col in enumerate(['Topic','Headline','Time']):
    df_new[col] = df.Content.tolist()[i::3]

推荐阅读