首页 > 解决方案 > 从python数据框中的文本列中获取唯一时间戳的计数

问题描述

我有一个数据框,它有大约 110 列和大约 200 万行。我想从名为评论的列中找到每一行中唯一日期计数的计数。“评论”列如下所示

------------------------------------------------------------------------
ID       Comments
------------------------------------------------------------------------
1        Log Type: customer chat
         chat history:
            xxxxxxxxx
            xxxxxxx
            xxxxxxxxxxxxxxx
            May 10 2020 23:34:57 +GMT 05:30
            --------------------------------------------
            log type: Phone call
            issue type: xxxxxx
            issue:
             qqqqqqqqqqqq
             qqqqqqqqqqqqqqqqqqqqqqq
             qqqqqqqqqqqqqqq
             May 11 2020 08:54:54 + GMT 05:30
             ----------------------------------------------
             log type: phone call
             issue:
              eeeeeeeeeeeeee
              eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
              eeeeeeee
              eeeeeeeeeee
              eeeeeeeeeeee
              eeeeeeeeeeeeeeeeeee
              May 11 2020 14:58:54 + GMT 05:30
            ----------------------------------
----------------------------------------------------------------------------
2           Log Type: Phone call
            issue:
            xxxxxxxxx
            xxxxxxx
            xxxxxxxxxxxxxxx
            May 10 2020 23:34:57 +GMT 05:30
            --------------------------------------------
            log type: Phone call
            issue type: xxxxxx
            issue:
             qqqqqqqqqqqq
             qqqqqqqqqqqqqqqqqqqqqqq
             qqqqqqqqqqqqqqq
             May 11 2020 08:54:54 + GMT 05:30
             ----------------------------------------------
             log type: phone call
             issue:
               eeeeeeeeeeeeee
               eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
               eeeeeeee
               eeeeeeeeeee
               eeeeeeeeeeee
               eeeeeeeeeeeeeeeeeee
             5/12/2020 14:58:54 + GMT 05:30
            ----------------------------------------------

所需的输出如下所示

ID Count
1   2
2   3

有人可以帮忙吗?

标签: pythonpandasdataframenlp

解决方案


根据评论编辑答案:

1.首先获取所有日期。请注意,str.findall 中的正则表达式包含匹配“MAY 20 2020”或“5/12/2020”或“05/12/2020”格式的模式

s = df['Comments'].str.findall(r'[\w\s\.]*(\w{3}\s\d{2}\s\d{4}|\d?\d/\d?\d/\d{4})[\w\s\.]*')
print(s)
0    [May 10 2020, May 11 2020, May 11 2020]
1      [May 10 2020, May 11 2020, 5/12/2020]

2.Above 返回一个列表。现在,我们必须将日期格式标准化为一种标准格式。

def conv(x):
    for val in x:
        if re.match("\d?\d/\d?\d/\d{4}",val) != None:
            x.remove(val)
            val = datetime.datetime.strptime(val, '%m/%d/%Y').strftime('%b %d %Y')
            x.append(val)
    return x
s.apply(lambda x: conv(x))
0    [May 10 2020, May 11 2020, May 11 2020]
1    [May 10 2020, May 11 2020, May 12 2020]

现在,我们可以从系列中提取唯一计数,然后在原始 df 中添加“计数”列。

df['count'] = s.transform(set).str.len()
print(df)
   ID                                           Comments  count
0   1  Log Type: customer chat chat history: xxxxxxxx...      2
1   2  Log Type: Phone call issue: xxxxxxxxx xxxxxxx ...      3

推荐阅读