python - python消除几乎重复的CSV
问题描述
我有一个包含此类数据的大型 csv 文件
192.168.107.87,4662,69.192.30.179,80,"other"
192.168.107.87,4662,69.192.30.179,80,"infection"
192.168.177.85,4662,69.192.30.179,80,"infection"
192.168.177.85,4662,69.192.30.179,80,"other"
我已经能够消除真正的重复,但我需要取出也标记为“感染”的“其他”不知道该怎么做?下面是我删除重复连接和重复连接的代码,以及除了我需要的三个以外的消息的代码当我删除重复的“其他”时,与其中的“其他”的连接我需要跟踪它是“感染”还是“cnc”也只是每个的基本计数
with open(r'alerts.csv','r') as in_file, open('alertsfix.csv','w') as out_file:
seen = set() # set for fast O(1) amortized lookup
for line in in_file:
if line in seen: continue # skip duplicate
seen.add(line)
out_file.write(line)
in_file.close()
out_file.close()
'''
writes new file eliminates cross connections sorce and dest
'''
s1='"other"'
s2='"infection"'
s3='"cnc"'
with open('alertsfix.csv','r') as in_file, open('alertsfixmore.csv','w') as out_file:
seen = set()
for line in in_file:
lines = line.strip()
if len(lines) > 0:
src_ip, src_port, dst_ip, dst_port, msg = lines.split(',')
src = '{}:{}'.format(src_ip, src_port)
dst = '{}:{}'.format(dst_ip, dst_port)
key = frozenset([
frozenset([src, dst]),
msg,
])
if key not in seen:
seen.add(key) # we add 'key' to the set
s4 = msg
if s4 in (s1,s2,s3): # eliminate any other types
out_file.write(line) # we write 'line ot' to the new file
in_file.close()
out_file.close()
解决方案
对索引 0 上的行进行排序;然后按索引 0 分组;为每个组过滤掉所有"other"
的;检查剩下的并计算"infection"
's 和"cnc"
's;将剩余的行添加到新容器中。
import io, csv, itertools
f = io.StringIO('''192.168.107.87,4662,69.192.30.179,80,"other"
192.168.107.87,4662,69.192.30.179,80,"infection"
192.168.177.85,4662,69.192.30.179,80,"infection"
192.168.177.85,4662,69.192.30.179,80,"other"
192.168.177.111,4662,69.192.30.179,80,"cnc"
192.168.177.111,4662,69.192.30.179,80,"other"
192.168.177.222,4662,69.192.30.179,80,"infection"
192.168.177.222,4662,69.192.30.179,80,"cnc"
192.168.177.222,4662,69.192.30.179,80,"other"''')
reader = csv.reader(f)
data = list(reader)
data.sort(key=lambda item: item[0])
groups = itertools.groupby(data, lambda item: item[0])
newdata = []
infection, cnc = 0, 0
for key, group in groups:
group = [row for row in group if row[-1] != "other"]
infection += sum(row[-1] == "infection" for row in group)
cnc += sum(row[-1] == "cnc" for row in group)
newdata.extend(group)
In [18]: cnc
Out[18]: 2
In [19]: infection
Out[19]: 3
In [20]: newdata
Out[20]:
[['192.168.107.87', '4662', '69.192.30.179', '80', 'infection'],
['192.168.177.111', '4662', '69.192.30.179', '80', 'cnc'],
['192.168.177.222', '4662', '69.192.30.179', '80', 'infection'],
['192.168.177.222', '4662', '69.192.30.179', '80', 'cnc'],
['192.168.177.85', '4662', '69.192.30.179', '80', 'infection']]
根据您实际尝试执行的操作,您可能需要按多个列进行排序和分组 - 示例数据看起来也可以使用lambda item: item[:-1]
.
推荐阅读
- go - 如果“cobra.Command BoolVarP”的默认值为“true”,则始终为“true”
- ios - 两个不同目标的 Firebase 分析
- python - nltk.tokenize.TweetTokenizer 下划线处理不一致
- c++ - 如何准确地对通过 mmap 写入文件的函数进行基准测试?在打印时暂停,在定时区域之后?
- reactjs - 反应钩子导致鼠标移动缓慢渲染
- roblox - 如何在 Roblox Studio 中检查 vector3 值是否在 region3 值中?
- javascript - 如何求两个数的公质因数?javascript
- r - 使用 ggplot2 将图例放入多线图
- python - Python - 使用 pandas 从 excel 中匹配和提取数据
- swift - Swift 模态视图控制器。有没有一种简单的方法可以找到 ViewController 的结果