首页 > 解决方案 > 使用列表比较和更新 CSV 文件

问题描述

我正在写一些需要两个 CSV 的东西:#1 是一个电子邮件列表,每个都收到 # 个,#2 是记录的每个电子邮件地址的目录,每个报告期间收到的电子邮件的数量,顶部注释日期的列。

import csv
from datetime import datetime

datestring = datetime.strftime(datetime.now(), '%m-%d')
storedEmails = []
newEmails = []
sortedList = []
holderList = []

with open('working.csv', 'r') as newLines, open('archive.csv', 'r') as oldLines:   #readers to make lists
    f1 = csv.reader(newLines, delimiter=',')
    f2 = csv.reader(oldLines, delimiter=',')

    print ('Processing new data...')
    for row in f2:
        storedEmails.append(list(row))                                   #add archived data to a list
    storedEmails[0].append(datestring)                                   #append header row with new date column
    for col in f1:
        if col[1] == 'email' and col[2] == 'To Address':                 #new list containing new email data
            newEmails.append(list(col))
    counter = len(newEmails)
    n = len(storedEmails[0])                                             #using header row len to fill zeros if no email received
    print(storedEmails[0])
    print (n)
    print ('Updating email lists and tallies, this could take a minute...')

with open ('archive.csv', 'w', newline='') as toWrite:                   #writer to overwrite old csv
    writer = csv.writer(toWrite, delimiter=',')
    for i in newEmails:
        del i[:3]                   #strip useless identifiers from data
        if int(i[1]) > 30:          #only keep emails with sufficient traffic
            sortedList.append(i)    #add these emails to new sorted list
    for i in storedEmails:
        for entry in sortedList:    #compare stored emails with the new emails, on match append row with new # of emails
            if i[0] == entry[0]:
                i.append(entry[1])
                counter -=1
            else:
                holderList.append(entry)    #if no match, it is a new email that meets criteria to land itself on the list
        break                               #break inner loop after iteration of outer email, to move to next email and avoid multiple entries
    storedEmails = storedEmails + holderList    #combine lists for archived csv rewrite
    for i in storedEmails:
         if len(i) < n:
             i.append('0')                      #if email on list but didnt have any activity this period, append with 0 to keep records intact
         writer.writerow(i)

    print('SortedList', sortedList)
    print (len(sortedList))
    print('storedEmails', storedEmails)
    print(len(storedEmails))
    print('holderList',holderList)
    print(len(holderList))

    print ('There are', counter, 'new emails being added to the list.')

    print ('All done!')

CSV 看起来与此类似。

工作.csv:

1,asdf@email.com,'to address',31
2,fsda@email.com,'to address',19
3,zxcv@email.com,'to address',117
4,qwer@gmail.com,'to address',92
5,uiop@fmail.com,'to address',11

存档.csv:

date,01-sep
asdf@email.com,154
fsda@email.com,128
qwer@gmail.com,77
ffff@xmail.com,63

处理后我想要的是:

date,01-sep,27-sep
asdf@email.com,154,31
fsda@email.com,128,19
qwer@gmail.com,77,92
ffff@xmail.com,63,0
zxcv@email.com,0,117

我不确定我哪里出错了 - 但它不断产生重复的条目。一些功能在那里,但我已经使用它太久了,我得到了隧道视野,试图找出我的循环做错了什么。

我知道最后我的零填充部分也是错误的,因为它会附加到新创建的记录的末尾,而不是在第一次出现之前填充零。

我确信有更有效的方法可以做到这一点,我是编程新手,所以它可能过于复杂和混乱 - 最初我试图将 CSV 与 CSV 进行比较,并意识到这是不可能的,因为你不能同时读写时间,所以我尝试转换为使用列表,我也知道当列表变大时由于内存限制,它不会永远工作。

-编辑- 使用特伦顿的熊猫解决方案:

我在 working.csv 上运行了一个脚本,所以它产生了以下内容:

asdf@email.com,1000
bsdf@gmail.com,500
xyz@fmail.com,9999

我已修改您的解决方案以反映此更改:

import pandas as pd
from datetime import datetime
import csv

# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')

# filter original list to grab only emails of interest
with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
    reader = csv.reader(fr, delimiter=',')
    writer = csv.writer(fw, delimiter=',')
    for row in reader:
        if row[1] == 'Email' and row[2] == 'To Address':
            writer.writerow([row[3], row[4]])

# read archive
arch = pd.read_csv('archive.csv')

# rename columns
arch.rename(columns={'email': 'date'}, inplace=True)

# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1]) # I assume usecols isnt necessery anymore, but I'm not sure

# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)

# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]

# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)

# save to csv
arch_updated.to_csv('archive.csv', index=False)

我显然仍然不知道这是如何工作的,因为我现在得到:

Traceback (most recent call last):
  File "---/agsdga.py", line 29, in <module>
    working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]
  File "---\Python\Python38-32\lib\site-packages\pandas\core\generic.py", line 5130, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'email'

Process finished with exit code 1

-UPDATE- 它现在的工作方式是:

import pandas as pd
from datetime import datetime
import csv

# get the date string
datestring = datetime.strftime(datetime.now(), '%d-%b')

with open ('working.csv', 'r') as fr, open ('writer.csv', 'w', newline='') as fw:
    reader = csv.reader(fr, delimiter=',')
    writer = csv.writer(fw, delimiter=',')
    for row in reader:
        if row[1] == 'Email' and row[2] == 'To Address':

            writer.writerow([row[3], row[4]])

# read archive
arch = pd.read_csv('archive.csv')

# rename columns
arch.rename(columns={'date': 'email'}, inplace=True)

# read working, but only the two columns that are needed
working = pd.read_csv('writer.csv', header=None, usecols=[0, 1])

# rename columns
working.rename(columns={0: 'email', 1: datestring}, inplace=True)

# only emails greater than 30 or already in arch
working = working[(working[datestring] > 30) | (working.email.isin(arch.email))]

# merge
arch_updated = pd.merge(arch, working, on='email', how='outer').fillna(0)

# save to csv
arch_updated.to_csv('archive.csv', index=False)

上面的错误是因为我改了

arch.rename(columns={'date': 'email'}, inplace=True)

arch.rename(columns={'email': 'date'}, inplace=True)

我遇到了进一步的复杂情况,因为我从测试存档中删除了标题行,因为我认为标题无关紧要,即使 header=None 我仍然遇到问题。我仍然不清楚为什么当我们为数据框的目的将我们自己的值分配给列时,标题如此重要,但它现在可以工作了。感谢所有的帮助!

标签: pythoncsv

解决方案


所以,问题是你有两组数据。两者都将数据存储在“关键”条目(电子邮件)和您希望压缩到一个存储中的附加数据。确定这两组数据有一个相似的“关键”可以大大简化这一过程。

将每个键想象为存储桶的名称。每个桶需要两条信息,一条来自一个 csv,另一条来自另一个 csv。

现在,我必须绕个小弯来解释python中的字典。这是从 这里窃取的定义

字典是一个无序、可变和索引的集合。

集合是一个容器,类似于保存数据的列表。无序和索引意味着字典不能像列表一样访问,其中数据可以通过索引访问。在这种情况下,使用键访问字典,键可以是字符串或数字之类的任何东西(从技术上讲,键必须是可散列的,但这太深入了)。最后,可更改意味着字典实际上可以更改其存储的数据(再一次,过于简单化了)。

例子:

dictionary = dict()
key = "Something like a string or a number!"
dictionary[key] = "any kind of value can be stored here! Even lists and other dictionaries!"
print(dictionary[key])  # Would print the above string

这是我建议您使用的结构,而不是大多数列表:

dictionary[email] = [item1, item2]

这样,您可以避免使用多个列表并大大简化您的代码。如果您对字典的使用仍然有疑问,这里有很多关于字典使用的文章和视频。祝你好运!


推荐阅读