首页 > 解决方案 > 在 pandas 中格式化非结构化 csv

问题描述

我在从存档的 4chan 评论中读取准确信息时遇到问题。由于 4chan 线程的线程结构(似乎)不能很好地转换为矩形数据框,因此实际上我在将每个线程的适当注释转换为 pandas 的单行时遇到了问题。

为了加剧这个问题,数据集的大小为 54GB,我问了一个类似的问题,即如何将数据读入 pandas 数据帧(该问题的解决方案让我意识到了这个问题),这使得诊断每个问题变得乏味。

我用来读取部分数据的代码如下:

def Four_pleb_chunker():
    """
    :return: 4pleb data is over 54 GB so this chunks it into something manageable
    """
    with open('pol.csv') as f:
        with open('pol_part.csv', 'w') as g:
            for i in range(1000):   ready
                g.write(f.readline())

    name_cols = ['num', 'subnum', 'thread_num', 'op', 'timestamp', 'timestamp_expired', 'preview_orig', 'preview_w', 'preview_h',
            'media_filename', 'media_w', 'media_h', 'media_size', 'media_hash', 'media_orig', 'spoiler', 'deleted', 'capcode',
            'email', 'name', 'trip', 'title', 'comment', 'sticky', 'locked', 'poster_hash', 'poster_country', 'exif']

    cols = ['num','timestamp', 'email', 'name', 'title', 'comment', 'poster_country']

    df_chunk = pd.read_csv('pol_part.csv',
                           names=name_cols,
                           delimiter=None,
                           usecols=cols,
                           skip_blank_lines=True,
                           engine='python',
                           error_bad_lines=False)

    df_chunk = df_chunk.rename(columns={"comment": "Comments"})
    df_chunk = df_chunk.dropna(subset=['Comments'])
    df_chunk['Comments'] = df_chunk['Comments'].str.replace('[^0-9a-zA-Z]+', ' ')

    df_chunk.to_csv('pol_part_df.csv')

    return df_chunk

这段代码工作正常,但是由于每个线程的结构,我编写的解析器有时会返回无意义的结果。在 csv 格式中,这是数据集的前几行的样子(请原谅屏幕截图,使用此 UI 实际写出所有这些行非常困难。)

数据的屏幕截图

可以看出,每个线程的评论被“\”分割,但每个评论都没有自己的行。我的目标是至少将每条评论放到自己的行中,这样我就可以正确地解析它。但是,我用来解析数据的函数在 1000 次迭代后会中断,无论它是否是新行。

从根本上说,我的问题是:如何构建这些数据以准确地阅读评论,并能够读取完整的示例数据框而不是截断的数据框。至于我尝试过的解决方案:

df_chunk = pd.read_csv('pol_part.csv',
                               names=name_cols,
                               delimiter='',
                               usecols=cols,
                               skip_blank_lines=True,
                               engine='python',
                               error_bad_lines=False)

如果我摆脱/更改参数delimiter,我会收到此错误:

Skipping line 31473: ',' expected after '"'

这是有道理的,因为数据没有被分隔,,因此它会跳过不符合该条件的每一行,在这种情况下是整个数据帧。输入\参数会给我一个语法错误。我有点不知所措,所以如果有人有处理此类问题的经验,那么您将成为救命稻草。如果这里没有包含我的内容,请告诉我,我会更新帖子。

更新,以下是 CSV 中用于测试的一些示例行:

2   23594708    1385716767  \N  Anonymous   \N  Example: not identifying the fundamental scarcity of resources which underlies the entire global power structure, or the huge, documented suppression of any threats to that via National Security Orders. Or that EVERY left/right ideology would be horrible in comparison to ANY in which energy scarcity and the hierarchical power structures dependent upon it had been addressed.
3   23594754    1385716903  \N  Anonymous   \N  ">>23594701\
                                                 \
                                                  No, /pol/ is bait. That's the point."
4   23594773    1385716983  \N  Anonymous   \N  ">>23594754
                                                 \
                                                 Being a non-bait among baits is equal to being a bait among non-baits."
5   23594795    1385717052  \N  Anonymous   \N  Don't forget how heavily censored this board is! And nobody has any issues with that.
6   23594812    1385717101  \N  Anonymous   \N  ">>23594773\
                                                 \
                                                 Clever. The effect is similar. But there are minds on /pol/ who don't WANT to be bait, at least."

标签: pythonpython-3.xpandascsv

解决方案


这是一个示例脚本,可将您的 csv 转换为每个评论的单独行:

import csv

# open file for output and create csv writer
f_out = open('out.csv', 'w')
w = csv.writer(f_out)

# open input file and create reader
with open('test.csv') as f:
    r = csv.reader(f, delimiter='\t')
    for l in r:
        # skip empty lines
        if not l:
            continue
        # in this line I want to split the last part 
        # and loop over each resulting string
        for s in l[-1].split('\\\n'):
            # we copy all fields except the last one
            output = l[:-1]
            # add a single comment
            output.append(s)
            w.writerow(output)

推荐阅读