首页 > 解决方案 > 如何将多行记录加载到数据框中

问题描述

我正在尝试加载具有相似格式的多个日志文件,但一条记录可以在多行中我已经进行了一些开发,但是单独阅读每一行花费了太多时间,示例代码如下。

请帮我

df = pd.DataFrame( {'eventtime':[],
                    'FileType':[],
                    'msg_type':[],
                    'thred_num':[],
                    'msg_lyr':[], 
                    'message':[],
                    'colorcode':[]
                    })
def write_line(record,file_type):
    if record=='':
        return

    split1= record.split("[")
    split2 = split1[0].split(" ")
    split3 = split1[1].split(" ")
    split4 = split1[2]

    s_time = split2[1].split(",")
    str_dateime  = split2[0] + ' ' + s_time[0] + "."+s_time[1]

    df.loc[len(df)] = pd.Series( {'eventtime':str_dateime,
                    'FileType':file_type,
                    'msg_type':split2[2],
                    'thred_num':split3[0][:-1],
                    'msg_lyr':split3[1], 
                    'message':split4,
                    'colorcode':""
                    })

for si_file in si_Files:
    f= gzip.open(si_file,'rt') 
    file_content = f.read()
    f.close()
    for record in file_content .split("]~~\n"):**
        write_line(record,'silogs')'''

有没有更好的方法可以将数据加载到 DataFrame 中,逐行读取文件,然后将其加载到数据框中,这在运行主应用程序的应用程序服务器上占用了太多资源

示例日志文件如下

2019-12-02 08:53:26,261 INFO  [18] CTL.CONF     - [Loading Configurations]~~<br>
2019-12-02 08:53:26,273 DEBUG [18] trg.sm.cs.client.CfgInterface - [Start: |User:default|ClientMachine:xxx.xxx.xxx.xxx]~~<br/>
2019-12-02 08:53:26,274 DEBUG [18] trg.sm.cs.client.CfgInterface - [Start: Waiting for connection with configuration server: xxx.xxx.xxx.xxx:X000]~~<br/>
2019-12-02 08:53:26,328 INFO  [19] GSI.Comms.SC - [Connecting|xxx.xxx.xx.xxx:x000]~~<br/>
2019-12-02 08:53:26,329 WARN  [19] GSI.Comms.SC - [Fast Loopback enabled]~~<br/>
2019-12-02 08:53:27,334 ERROR [19] GSI.Comms.SC - [Failed to connect with <br/>server|Endpoint:xxx.xxx.xx.xxx:x000|Error:No connection could be made because the target <br/>machine actively refused it xxx.xxx.xx.xxx:x000(ConnectionRefused:10061)]~~<br/>
2019-12-02 08:53:30,340 INFO  [19] GSI.Comms.SC - [Connecting|xxx.xxx.xx.xxx:x000]~~<br/>
2019-12-02 08:53:30,341 WARN  [19] GSI.Comms.SC - [Fast Loopback enabled]~~<br/>
2019-12-02 08:53:30,393 WARN  [19] sm.cs.client.CfgInterface - [Start: No QueryReload performed, EngineId missing.]~~<br/>
2019-12-02 08:53:30,393 DEBUG [19] sm.cs.client.CfgInterface - [ClientConnection |  Connected to the Server : Primary]~~<br/>
2019-12-02 08:53:30,393 DEBUG [18] sm.cs.client.CfgInterface - [Start: done.]~~<br/>
2019-12-02 08:53:30,512 DEBUG [13] CTL.CONF     - [ReloadResponse: Submitted]~~<br/>
**2019-12-02 08:53:31,791 INFO  [18] CTL.CONF     - [GroupSettingsXml|<Groups><br/>
xml tages EventProcessingLagThreshold 100 /EventProcessingLagThreshold <br/>
  XML tages QueueLengthThreshold 100 /QueueLengthThreshold <br/>
xml tags/Groups]~~<br/>**
2019-12-02 08:53:31,803 INFO  [18] CTL.CONF     - [EventProcessingLagThreshold:100]~~<br/>
2019-12-02 08:53:32,122 INFO  [18] SM.ENG.SatmapEngineCommonLibrary.Emailingclient.EmailAlerts - [Alerts Initialized with SMTPserver: 10.80.10.141, SMTPport: , Recipients: ssadaasd,  Sender Address: noreply-aasdlabaasd]~~<br/>
2019-12-02 08:53:41,856 INFO  [35] PRO.0.Q-AS-R - [QueryStatusResponse|ASC|AID:70032|DN:20029|CS:client_MODE_UNKNOWN=>client_MODE_READY|TS:TALK_STATE_UNKNOWN=>TALK_STATE_AVAILABLE]~~<br/>
2019-12-02 08:53:41,863 INFO  [31] CTL          - [HandleclientStatusEvent(client_MODE_READY)|AID:70032|KeepFreeclientOrder:False|ER:SUCCESSFULL]~~<br/>
2019-12-02 08:53:41,871 DEBUG [27] GSI.Comms.CM - [Tx|0|clientFree|MsgId:22|D-CID:1]~~<br/>
2019-12-02 08:53:41,899 DEBUG [24] TSP.EF       - [Rx|QUERY_TOD_RESPONSE|INV-ID:20|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0        - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0        - [RR|QueryTodRequest|INV-ID:20|08:53:41 808,08:53:41 817,08:53:41 899|RTT:90.98ms|PT:8.97]~~<br/>
2019-12-02 08:53:41,899 INFO  [35] PRO.0        - [CR|QueryTodRequest(ToCheckIfSkillMonitorCompleted)|S:80029|Pass:3]~~<br/>
2019-12-02 08:53:41,899 DEBUG [24] TSP.EF       - [Rx|QUERY_TOD_RESPONSE|INV-ID:21|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0        - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,899 DEBUG [35] PRO.0        - [RR|QueryTodRequest|INV-ID:21|08:53:41 814,08:53:41 820,08:53:41 899|RTT:85.01ms|PT:6]~~<br/>
2019-12-02 08:53:41,899 INFO  [36] TSP.0.1      - [Tx|QueryTime]~~<br/>
2019-12-02 08:53:41,900 INFO  [36] PRO.0.RM     - [RP|QueryTodRequest|INV-ID:22|ICheckRequest:False|SkillId:80029(Pass:3)]~~<br/>
2019-12-02 08:53:41,903 INFO  [36] PRO.0        - [ST|OR Count:0|PR Count:1]~~
2019-12-02 08:53:41,950 DEBUG [24] TSP.EF       - [Rx|QUERY_TOD_RESPONSE|INV-ID:22|TOD:12/02/2019 11:53:41]~~<br/>
2019-12-02 08:53:41,950 DEBUG [35] PRO.0        - [Pr|QUERY_TOD_RESPONSE]~~<br/>
2019-12-02 08:53:41,950 DEBUG [35] PRO.0        - [RR|QueryTodRequest|INV-ID:22|08:53:41 899,08:53:41 899,08:53:41 950|RTT:51ms|PT:]~~<br/>
2019-12-02 08:53:41,950 INFO  [35] PRO.0.Q-STOD - [SkillMonitored, Marked for Route Register|SK:80029]~~<br/>
2019-12-02 08:53:41,952 INFO  [35] PRO.0.SMGR.1 - [RegisterSkill|SK:80029]~~<br/>

标签: pythonpandasdelimiterline-breaks

解决方案


我假设您的示例数据中出现两次“**”是您的错误,它们不应该存在。也许您想将这些行标记为粗体

我还假设<br/>您的示例中实际上是换行符

做你的任务,从 开始import re,很快就会用到。

他们定义了以下生成器:

def myReader(file_name):
    outRow = ''
    for row in open(file_name, 'r'):
        row = row.strip()
        if row[-2:] == '~~':
            outRow += row[0:-2]
            yield outRow
            outRow = ''
        else:
            outRow += row

它将“延续”行(不以“~~”结尾)连接到前一行并返回(产生)完整行。

注意:如果要在“非终止”行之后保留换行符,请将最后一行更改为:

outRow += row + '\n'

最后,假设您有一个输入文件,要创建和加载 DataFrame,您应该运行:

df = pd.DataFrame(columns=['eventtime', 'FileType', 'msg_type',
    'thread_num', 'msg_lyr', 'message', 'colorcode'])
pat = re.compile(r'(?P<Date>[\d-]+ [\d:]+,\d+) (?P<msg_type>\w+) +'
    r'\[(?P<thread_num>\d+)\] (?P<msg_lyr>[\w.]+) +\- \[(?P<message>[^\]]+)')
myGen = myReader('log.txt')
for row in myGen:
    mtch = re.match(pat, row)
    if mtch:
        dat = mtch.group('Date').replace(',', '.')
        df.loc[len(df)] = pd.Series({'eventtime': dat, 'FileType': 'silogs',
            'msg_type': mtch.group('msg_type'), 'thread_num': mtch.group('thread_num'),
            'msg_lyr': mtch.group('msg_lyr'), 'message': mtch.group('message'),
            'colorcode': ''})

为了处理更多的文件,将上面的代码移动到一个函数中,将常量文件名更改为参数给定的名称,并为每个源文件名调用此函数。

当然,空 DataFrame 的创建和正则表达式编译都应该在此循环之前执行一次。

按照评论编辑

代码运行缓慢的原因是对于每个源行:

  • 一个系列被创建。
  • 然后将其附加到之前创建的DataFrame中。

更快的解决方案是:

  • 创建列表列表(源行的每个嵌入列表)。
  • 将整个列表附加到DataFrame

所以将代码更改为:

df = pd.DataFrame(columns=['eventtime', 'FileType', 'msg_type',
    'thread_num', 'msg_lyr', 'message', 'colorcode'])
pat = re.compile(r'(?P<Date>[\d-]+ [\d:]+,\d+) (?P<msg_type>\w+) +'
    r'\[(?P<thread_num>\d+)\] (?P<msg_lyr>[\w.]+) +\- \[(?P<message>[^\]]+)')
tbl = []
myGen = myReader('log.txt')
for row in myGen:
    mtch = re.match(pat, row)
    if mtch:
        dat = mtch.group('Date').replace(',', '.')
        tbl.append({'eventtime': dat, 'FileType': 'silogs',
            'msg_type': mtch.group('msg_type'), 'thread_num': mtch.group('thread_num'),
            'msg_lyr': mtch.group('msg_lyr'), 'message': mtch.group('message'),
            'colorcode': ''})
df = df.append(tbl, ignore_index=True)

我测量了两个变体的执行时间,上述变体的运行速度快了近 5 倍。对于更大的源文件,差异应该会更好。


推荐阅读