首页 > 解决方案 > python消除几乎重复的CSV

问题描述

我有一个包含此类数据的大型 csv 文件

192.168.107.87,4662,69.192.30.179,80,"other"
192.168.107.87,4662,69.192.30.179,80,"infection"
192.168.177.85,4662,69.192.30.179,80,"infection"
192.168.177.85,4662,69.192.30.179,80,"other"

我已经能够消除真正的重复,但我需要取出也标记为“感染”的“其他”不知道该怎么做?下面是我删除重复连接和重复连接的代码,以及除了我需要的三个以外的消息的代码当我删除重复的“其他”时,与其中的“其他”的连接我需要跟踪它是“感染”还是“cnc”也只是每个的基本计数

    with open(r'alerts.csv','r') as in_file, open('alertsfix.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue # skip duplicate

        seen.add(line)
        out_file.write(line)
in_file.close()
out_file.close()


'''
   writes new file eliminates cross connections sorce and dest 
 '''
s1='"other"'
s2='"infection"'
s3='"cnc"'

with open('alertsfix.csv','r') as in_file, open('alertsfixmore.csv','w') as out_file:
    seen = set()
    for line in in_file:
        lines = line.strip()
        if len(lines) > 0:
            src_ip, src_port, dst_ip, dst_port, msg = lines.split(',')
            src = '{}:{}'.format(src_ip, src_port)
            dst = '{}:{}'.format(dst_ip, dst_port)
            key = frozenset([
                frozenset([src, dst]),
                msg,
            ])

            if key not in seen:
                seen.add(key) # we add 'key' to the set
                s4 = msg
                if s4 in (s1,s2,s3): # eliminate any other types                    
                  out_file.write(line)  # we write 'line ot' to the new file
in_file.close()
out_file.close()

标签: pythoncsv

解决方案


对索引 0 上的行进行排序;然后按索引 0 分组;为每个组过滤掉所有"other"的;检查剩下的并计算"infection"'s 和"cnc"'s;将剩余的行添加到新容器中。

import io, csv, itertools

f = io.StringIO('''192.168.107.87,4662,69.192.30.179,80,"other"
192.168.107.87,4662,69.192.30.179,80,"infection"
192.168.177.85,4662,69.192.30.179,80,"infection"
192.168.177.85,4662,69.192.30.179,80,"other"
192.168.177.111,4662,69.192.30.179,80,"cnc"
192.168.177.111,4662,69.192.30.179,80,"other"
192.168.177.222,4662,69.192.30.179,80,"infection"
192.168.177.222,4662,69.192.30.179,80,"cnc"
192.168.177.222,4662,69.192.30.179,80,"other"''')

reader = csv.reader(f)
data = list(reader)
data.sort(key=lambda item: item[0])
groups = itertools.groupby(data, lambda item: item[0])
newdata = []
infection, cnc = 0, 0
for key, group in groups:
    group = [row for row in group if row[-1] != "other"]
    infection += sum(row[-1] == "infection" for row in group)
    cnc += sum(row[-1] == "cnc" for row in group)
    newdata.extend(group)

In [18]: cnc
Out[18]: 2

In [19]: infection
Out[19]: 3

In [20]: newdata
Out[20]: 
[['192.168.107.87', '4662', '69.192.30.179', '80', 'infection'],
 ['192.168.177.111', '4662', '69.192.30.179', '80', 'cnc'],
 ['192.168.177.222', '4662', '69.192.30.179', '80', 'infection'],
 ['192.168.177.222', '4662', '69.192.30.179', '80', 'cnc'],
 ['192.168.177.85', '4662', '69.192.30.179', '80', 'infection']]

根据您实际尝试执行的操作,您可能需要按多个列进行排序和分组 - 示例数据看起来也可以使用lambda item: item[:-1].


推荐阅读