首页 > 解决方案 > 在 csv 文件中添加新列和来自不同词典理解的值

问题描述

这是我下面的代码,我想在我的原始 csv 中编写新列,这些列应该包含在我的代码期间创建的每个字典的值,我想要最后一个字典,因为它包含 3 个值,每个值插入单个列中。在 csv 中编写的代码在最后,但也许有一种方法可以在我每次生成新字典时写入值。

我的 csv 路由代码:我无法弄清楚如何在不删除原始文件内容的情况下添加


# -*- coding: UTF-8 -*-
# -*- coding: UTF-8 -*-
import codecs 
import re
import os
import sys, argparse
import subprocess
import pprint
import csv
from itertools import islice
import pickle
import nltk
from nltk import tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import pandas as pd

try:
    import treetaggerwrapper
    from treetaggerwrapper import TreeTagger, make_tags
    print("import TreeTagger OK")
except:
    print("Import TreeTagger pas Ok")

from itertools import islice
from collections import defaultdict

#export le lexique de sentiments
pickle_in = open("dict_pickle", "rb")
dico_lexique = pickle.load(pickle_in)


# extraction colonne verbatim
d_verbatim = {}

with open(sys.argv[1], 'r', encoding='cp1252') as csv_file:
    csv_file.readline()
    for line in csv_file:
        token = line.split(';')
        try:
            d_verbatim[token[0]] = token[1]
        except:
            print(line)

#print(d_verbatim)

#Using treetagger   
tagger = treetaggerwrapper.TreeTagger(TAGLANG='fr')
d_tag = {}
for key, val in d_verbatim.items(): 
        newvalues = tagger.tag_text(val)
        d_tag[key] = newvalues
#print(d_tag)


#lemmatisation  
d_lemma = defaultdict(list)
for k, v in d_tag.items():
    for p in v:
        parts = p.split('\t')
        try:
            if parts[2] == '':
                d_lemma[k].append(parts[0])
            else:
                d_lemma[k].append(parts[2]) 
        except:
            print(parts)
#print(d_lemma) 


stopWords = set(stopwords.words('french'))          
d_filtered_words = {k: [w for w in l if w not in stopWords and w.isalpha()] for k, l in d_lemma.items()}

print(d_filtered_words)     

d_score = {k: [0, 0, 0] for k in d_filtered_words.keys()}
for k, v in d_filtered_words.items():
    for word in v:
        if word in dico_lexique:
            if word 
            print(word, dico_lexique[word]) 

标签: pythoncsvlist-comprehensionwriterdictionary-comprehension

解决方案


您的编辑似乎使事情变得更糟,您最终删除了很多相关的上下文。我想我已经拼凑了你想要做的事情。它的核心似乎是对文本进行情感分析的例程。

我将首先创建一个跟踪这一点的类,例如:

class Sentiment:
    __slots__ = ('positive', 'neutral', 'negative')

    def __init__(self, positive=0, neutral=0, negative=0):
        self.positive = positive
        self.neutral = neutral
        self.negative = negative

    def __repr__(self):
        return f'<Sentiment {self.positive} {self.neutral} {self.negative}>'

    def __add__(self, other):
        return Sentiment(
            self.positive + other.positive,
            self.neutral + other.neutral,
            self.negative + other.negative,
        )

这将允许您将复杂的代码替换为[a + b for a, b in zip(map(int, dico_lexique[word]), d_score[k])]下面score += sentiment的函数中的代码,并允许我们按名称引用各种值

然后我建议预处理你的腌制数据,这样你就不必int在不相关的代码中间将东西转换为 s ,例如:

with open("dict_pickle", "rb") as fd:
    dico_lexique = {}
    for word, (pos, neu, neg) in pickle.load(fd):
        dico_lexique[word] = Sentiment(int(pos), int(neu), int(neg))

这会将它们直接放入上述类中,并且似乎与代码中的其他约束相匹配。但我没有你的数据,所以无法检查。

在把你所有的理解和循环分开之后,我们只剩下一个很好的例程来处理一段文本:

def process_text(text):
    """process the specified text
    returns (words, filtered words, total sentiment score)
    """
    words = []
    filtered = []
    score = Sentiment()

    for tag in make_tags(tagger.tag_text(text)):
        word = tag.lemma
        words.append(word)

        if word not in stopWords and lemma.isalpha():
            filtered.append(word)

        sentiment = dico_lexique.get(word)
        if sentiment is not None:
            score += sentiment

    return words, filtered, score

我们可以把它放到一个循环中,从输入中读取行并将它们发送到输出文件:

filename = sys.argv[1]
tempname = filename + '~'

with open(filename) as fdin, open(tempname, 'w') as fdout:
    inp = csv.reader(fdin, delimiter=';')
    out = csv.writer(fdout, delimiter=';')

    # get the header, and blindly append out column names
    header = next(inp)
    out.writerow(header + [
        'd_lemma', 'd_filtered_words', 'Positive Score', 'Neutral Score', 'Negative Score',
    ])

    for row in inp:
        # assume that second item contains the text we want to process
        words, filtered, score = process_text(row[1])
        extra_values = [
            words, filtered,
            score.positive, score.neutal, score.negative,
        ]
        # add the values and write out
        assert len(row) == len(header), "code needed to pad the columns out"
        out.writerow(row + extra_values)

# only replace if everything succeeds
os.rename(tempname, filename)

我们写出一个不同的文件,只有在成功时才重命名,这意味着如果代码崩溃,它不会留下部分写入的文件。不过,我不鼓励像这样工作,并且倾向于让我的脚本stdinstdout. 这样我就可以运行:

$ python script.py < input.csv > output.csv

当一切正常时,还可以让我运行为:

$ head input.csv | python script.py

如果我只想使用前几行输入进行测试,或者:

$ python script.py < input.csv | less

如果我想在生成时检查输出

请注意,这些代码都没有运行,因此其中可能存在错误,但我实际上可以看到代码试图这样做。理解和“功能”风格的代码很棒,但如果你不小心,它很容易变得难以阅读


推荐阅读