python - 从python数据框中的文本列中获取唯一时间戳的计数
问题描述
我有一个数据框,它有大约 110 列和大约 200 万行。我想从名为评论的列中找到每一行中唯一日期计数的计数。“评论”列如下所示
------------------------------------------------------------------------
ID Comments
------------------------------------------------------------------------
1 Log Type: customer chat
chat history:
xxxxxxxxx
xxxxxxx
xxxxxxxxxxxxxxx
May 10 2020 23:34:57 +GMT 05:30
--------------------------------------------
log type: Phone call
issue type: xxxxxx
issue:
qqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqq
May 11 2020 08:54:54 + GMT 05:30
----------------------------------------------
log type: phone call
issue:
eeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeee
eeeeeeeeeee
eeeeeeeeeeee
eeeeeeeeeeeeeeeeeee
May 11 2020 14:58:54 + GMT 05:30
----------------------------------
----------------------------------------------------------------------------
2 Log Type: Phone call
issue:
xxxxxxxxx
xxxxxxx
xxxxxxxxxxxxxxx
May 10 2020 23:34:57 +GMT 05:30
--------------------------------------------
log type: Phone call
issue type: xxxxxx
issue:
qqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqq
May 11 2020 08:54:54 + GMT 05:30
----------------------------------------------
log type: phone call
issue:
eeeeeeeeeeeeee
eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee
eeeeeeee
eeeeeeeeeee
eeeeeeeeeeee
eeeeeeeeeeeeeeeeeee
5/12/2020 14:58:54 + GMT 05:30
----------------------------------------------
所需的输出如下所示
ID Count
1 2
2 3
有人可以帮忙吗?
解决方案
根据评论编辑答案:
1.首先获取所有日期。请注意,str.findall 中的正则表达式包含匹配“MAY 20 2020”或“5/12/2020”或“05/12/2020”格式的模式
s = df['Comments'].str.findall(r'[\w\s\.]*(\w{3}\s\d{2}\s\d{4}|\d?\d/\d?\d/\d{4})[\w\s\.]*')
print(s)
0 [May 10 2020, May 11 2020, May 11 2020]
1 [May 10 2020, May 11 2020, 5/12/2020]
2.Above 返回一个列表。现在,我们必须将日期格式标准化为一种标准格式。
def conv(x):
for val in x:
if re.match("\d?\d/\d?\d/\d{4}",val) != None:
x.remove(val)
val = datetime.datetime.strptime(val, '%m/%d/%Y').strftime('%b %d %Y')
x.append(val)
return x
s.apply(lambda x: conv(x))
0 [May 10 2020, May 11 2020, May 11 2020]
1 [May 10 2020, May 11 2020, May 12 2020]
现在,我们可以从系列中提取唯一计数,然后在原始 df 中添加“计数”列。
df['count'] = s.transform(set).str.len()
print(df)
ID Comments count
0 1 Log Type: customer chat chat history: xxxxxxxx... 2
1 2 Log Type: Phone call issue: xxxxxxxxx xxxxxxx ... 3
推荐阅读
- tmux - sendkey 到活动的 tmux 窗口
- unity3d - 对敌人 AI 进行多次射线投射
- .net - 将数据导出到 Excel、Docs 和 PDF (Dotnet Core 2.x)
- php - 使用准备好的语句 PHP 获取当月的 SQL 表数据
- javascript - 使用 jquery 递增和递减
- c++ - 如何检查字符串中是否不存在char
- javascript - 只有在执行服务器应用程序时才需要 Node.js 中的异步操作吗?
- swift - 我想在 swift playgounds 实时视图中添加子视图以测试约束但无法显示,添加的内部视图消失了
- laravel - laravel 护照:如何验证不记名授权码 - 无需登录
- java - JSTL 选择标签总是评估为假