首页 > 解决方案 > 迭代 Try/Except 时无法附加到 df

问题描述

我正在尝试遍历 pdf 以从电子邮件中提取信息。当我在单个示例上尝试它们时,我的单个正则表达式语句起作用,但是,当我尝试将所有代码放在一个 for 循环中以一次迭代多个 pdf 时,我无法附加到我的聚合 df(我目前只是创建一个空的df)。我需要使用 try/except 因为并非所有电子邮件都有所有字段(例如,有些没有“附件”字段)。以下是我到目前为止编写的代码:

import os
import pandas as pd
pd.options.display.max_rows=999
import numpy
from numpy import NaN
from tika import parser

root = r"my_dir"

agg_df = pd.DataFrame()

for directory, subdirectory, files in os.walk(root):
    for file in files:
        filepath = os.path.join(directory, file)
        print(file)
        raw = parser.from_file(filepath)
        img = raw['content']
        img = img.replace('\n', '')

        try:
            from_field = re.search(r'From:(.*?)Sent:', img).group(1)
        except:
            pass
        try:
            sent_field = re.search(r'Sent:(.*?)To:', img).group(1)
        except:
            pass
        try:    
            to_field = re.search(r'To:(.*?)Cc:', img).group(1)
        except:
            pass
        try:    
            cc_field = re.search(r'Cc:(.*?)Subject:', img).group(1)
        except:
            pass
        try:   
            subject_field = re.search(r'Subject:(.*?)Attachments:', img).group(1)
        except:
            pass
        try:
            attachments_field = re.search(r'Attachments:(.*?)NOTICE', img).group(1)
        except:
            pass

        img_df = pd.DataFrame(columns=['From', 'Sent', 'To', 
                                       'Cc', 'Subject', 'Attachments'])
        img_df['From'] = from_field
        img_df['Sent'] = sent_field
        img_df['To'] = to_field
        img_df['Cc'] = cc_field
        img_df['Subject'] = subject_field
        img_df['Attachments'] = attachments_field

        agg_df = agg_df.append(img_df)

标签: python

解决方案


有两件事:

  1. 当你没有得到匹配时,你不应该只传递异常。您应该使用默认值。
  2. 不要在每次循环后附加到您的数据框。那很。将所有内容保存在字典中,然后在最后构造数据框。

例如

from collections import defaultdict

data = defaultdict(list)

for directory, _, files in os.walk(root):
    for file in files:
        filepath = os.path.join(directory, file)
        print(file)
        raw = parser.from_file(filepath)
        img = raw['content']
        img = img.replace('\n', '')

        from_match = re.search(r'From:(.*?)Sent:', img)
        if not from_match:
            sent_by = None
        else:
            sent_by = from_match.group(1)
        data["from"].append(sent_by)

        to_match = re.search(r'Sent:(.*?)To:', img)
        if not to_match:
            sent_to = None
        else:
            sent_to = to_match.group(1)
        data["to"].append(sent_to)

        # All your other regexes

df = pd.DataFrame(data)

此外,如果您对很多文件执行此操作,您应该考虑使用已编译的表达式


推荐阅读