首页 > 解决方案 > 读取和获取特定字符串然后将其放入 csv 文件的正确方法是什么?

问题描述

我的 txt 文件中有这个包含 SHA-1 和描述的sample_vsdt.txt文件:

Scanning samples_extracted\02b809d4edee752d9286677ea30e8a76114aa324->(Microsoft RTF 6008-0)
->Found Virus [Possible_SCRDL]

Scanning samples_extracted\0349e0101d8458b6d05860fbee2b4a6d7fa2038d->(Adobe Portable Document Format(PDF) 6015-0)
->Found Virus [TROJ_FRS.VSN11I18]

例子:

SHA-1: 02b809d4edee752d9286677ea30e8a76114aa324
Description:(Microsoft RTF 6008-0)

问题:

我的任务是在我的 txt 文件中列出那些 SHA-1 和描述,然后在 csv 文件中列出,我可以使用正则表达式、前缀和分隔符来做到这一点。然而,这个例子让我很难:

Scanning samples_extracted\0191a23ee122bdb0c69008971e365ec530bf03f5
     - Invoice_No_94497.doc->Found Virus [Trojan.4FEC5F36]->(MIME 6010-0)

     - Found 1/3 Viruses in samples_extracted\0191a23ee122bdb0c69008971e365ec530bf03f5

它有不同的线型,我只想在第一行而不是第 4 行中获取 SHA-1,并在第二行中获取描述。

输出:

输出出错了,因为描述 (MIME 6010-0) 放在了 SHA-1 列中。

0191a23ee122bdb0c69008971e365ec530bf03f5    
(MIME 6010-0)   
02b809d4edee752d9286677ea30e8a76114aa324    (Microsoft RTF 6008-0)
0349e0101d8458b6d05860fbee2b4a6d7fa2038d    (Adobe Portable Document Format(PDF) 6015-0)
035a7afca8b72cf1c05f6062814836ee31091559    (Adobe Portable Document Format(PDF) 6015-0)

代码

import csv
import re

INPUTFILE = 'samples_vsdt.txt'
OUTPUTFILE = 'output.csv'
PREFIX = '\\'
DELIMITER = '->'
DELIMITER2 = ']->'
PREFIX2 = ' - '
def read_text_file(inputfile):
    data = []
    with open(inputfile, 'r') as f:
        lines = f.readlines()

    for line in lines:
        line = line.rstrip('\n')
        if re.search(r'[a-zA-Z0-9]{40}', line) and not "Found" in line: # <----
            line = line.split(PREFIX, 1)[-1]
            parts = line.split(DELIMITER)
            data.append(parts)

        else:

            if "->(" in line and "Found" in line :
                matched_words=(re.search(r'\(.*?\)',line))
                sha =(re.search(r'[a-zA-Z0-9]{40}',line))

                if matched_words!=None:
                    matched_words=matched_words.group()
                    matched_words=matched_words.split("]->")

                    data.append(matched_words)
    #data.append(parts)                  
    return data

def write_csv_file(data, outputfile):
    with open(outputfile, 'wb') as csvfile:
        csvwriter = csv.writer(csvfile, delimiter=',', quotechar='"')
        for row in data:
            csvwriter.writerow(row)

def main():
    data = read_text_file(INPUTFILE)
    write_csv_file(data, OUTPUTFILE)

if __name__ == '__main__':
    main()

这是我的文本文件的全部内容: sample_vsdt.txt

标签: python

解决方案


如果没有找到描述,请在下一行中找到它。

import re

patsha = re.compile(r'Scanning samples_extracted\\([a-z0-9]{40})')
patdesc = re.compile(r'->(\(.*\))')

with open("samples_vsdt.txt") as f:
    sha = None
    for l in f:
        if sha is None:
            sha = patsha.match(l)
        if sha:
            desc = patdesc.search(l)
            if desc:
                print(sha.group(1)+','+desc.group(1))
                sha = None

推荐阅读