首页 > 解决方案 > 将文本格式的电子邮件合并到一个 csv 文件中以进行机器学习

问题描述

我正在使用安然数据集来解决机器学习问题。我想将所有垃圾邮件文件合并到一个 csv 文件中,将所有 ham 文件合并到另一个 csv 文件中以供进一步分析。我正在使用此处列出的数据集:https ://github.com/crossedbanana/Enron-Email-Classification

我使用下面的代码来合并电子邮件,并且能够合并它们。但是,当我尝试读取 csv 文件并将其加载到 pandas 中时,由于以下原因出现错误ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 2

将txt中的电子邮件文件合并到csv中的代码

import os
for f in glob.glob("./dataset_temp/spam/*.txt"):
    os.system("cat "+f+" >> OutFile1.csv")

Code to load into pandas:

```# reading the csv into pandas

emails = pd.read_csv('OutFile1.csv')
print(emails.shape)```

1. How can I get rid of the parser error? this is occuring due to commas present in the email messages I think.
2. How can I just load each email message into pandas with just the email body?

This is how the email format looks like(an example of a text file in the spam folder)
The commas in line 3 are causing a problem while loading into pandas


*Subject: your prescription is ready . . oxwq s f e
low cost prescription medications
soma , ultram , adipex , vicodin many more
prescribed online and shipped
overnight to your door ! !
one of our us licensed physicians will write an
fda approved prescription for you and ship your
order overnight via a us licensed pharmacy direct
to your doorstep . . . . fast and secure ! !
click here !
no thanks , please take me off your list
ogrg z
lqlokeolnq
lnu* 


Thanks for any help. 

标签: python-3.xpandascsvmerging-data

解决方案


您可以使用 Excel 文件,而不是在 CSV 文件中读取和写入数据。所以你不会因为','(逗号)而得到任何错误。只需将 csv 替换为 excel 即可。

这是一个例子:

    import os
    import pandas as pd
    import codecs

    # Function to create list of emails.
    def create_email_list(folder_path):
        email_list = []
        folder = os.listdir(folder_path)#provide folder path, if the folder is in same directory provide only the folder name
        for txt in folder:
            file_name = fr'{folder_path}/{txt}'
            #read emails
            with codecs.open(file_name, 'r', encoding='utf-8',errors='ignore') as f:
                email = f.read()
                email_list.append(email)
        return email_list

    spam_list = create_email_list('spam')#calling the function for reading spam 
    spam_df = pd.DataFrame(spam_list)#creating a dataframe of spam
    spam_df.to_excel('spam.xlsx')#creating excel file of spam

    ham_list = create_email_list('ham')#calling the function for reading ham
    ham_df = pd.DataFrame(ham_list)#creating a dataframe of spam
    ham_df.to_excel('ham.xlsx')#creating excel file of ham

您只需要在函数中传递文件夹路径(文件夹名称是文件夹在同一目录中)。此代码将创建 excel 文件。


推荐阅读