首页 > 解决方案 > 如何替换 CSV 文件中的引号和制表符以进行 Pandas 预处理?

问题描述

我有一个作为制表符分隔文件下载的数据集,所有列都用引号括起来。

Pandas 无法正确解释文件。

我需要预处理文件以删除和替换熊猫可以理解的不同分隔符的制表符和引号。

数据集如下所示:

"Period: 1/12/2018 1:00:00 - 8/12/2018 0:59:59"

"Date"  "Power AC (IS01_I1) [W]"    "Power AC (IS01_I2) [W]"    "Power AC (IS01_I3) [W]"    "Power AC (IS02_I1) [W]"    "Power AC (IS02_I2) [W]"    "Power AC (IS02_I3) [W]"    "Power AC (IS03_I1) [W]"    "Power AC (IS03_I2) [W]"    "Power AC (IS03_I3) [W]"    "Power AC (IS04_I1) [W]"    "Power AC (IS04_I2) [W]"    "Power AC (IS04_I3) [W]"    "Power AC (IS05_I1) [W]"    "Power AC (IS05_I2) [W]"    "Power AC (IS05_I3) [W]"    "Power AC (IS06_I1) [W]"    "Power AC (IS06_I2) [W]"    "Power AC (IS06_I3) [W]"    "Power AC (IS07_I1) [W]"    "Power AC (IS07_I2) [W]"    "Power AC (IS07_I3) [W]"    "Power AC (IS08_I1) [W]"    "Power AC (IS08_I2) [W]"    "Power AC (IS08_I3) [W]"    "Power AC (IS09_I1) [W]"    "Power AC (IS09_I2) [W]"    "Power AC (IS09_I3) [W]"    "Power AC (IS10_I1) [W]"    "Power AC (IS10_I2) [W]"    "Power AC (IS10_I3) [W]"
"1/12 1:00" "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"
"1/12 2:00" "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"
"1/12 3:00" "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"
"1/12 4:00" "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"  "0.00"

我在 .replace() 中尝试了以下各种不同的搜索词,但没有成功:

df = 'C:\\path\\'
text = open(df+"\\Energy 1.csv", "r")
text = ''.join([i for i in text]) \
    .replace('\t',",")
x = open(df+"\\E1 Clean.csv","w")
x.writelines(text)
x.close()

结果文件要么仍包含引号,要么已损坏(打开时为中文)。

我也尝试过使用不同类型的正则表达式,但没有运气:

import re
text = open(df+"\\Energy 1.csv", "r")
text = ''.join([i for i in text])/
    re.sub('\"\"', ",",text)
x = open(df+"\\E1 Clean.csv","w")
x.writelines(text)
x.close()

我确信解决方案一定很简单,但它让我望而却步。

如果我尝试类似:

pd.read_csv(df+"\\Energy 1.csv",sep='"\t"',skiprows=2)

我得到的是:

0                                                     
1     " D a t u m " \t " E n e r g y   g e n e r a ...
2                                                     
3                " 1 / 1 2   1 : 0 0 " \t " 0 . 0 0 " 
4                                                     
5                " 1 / 1 2   2 : 0 0 " \t " 0 . 0 0 " 
6                                                     
7                " 1 / 1 2   3 : 0 0 " \t " 0 . 0 0 " 
8                                                     
9                " 1 / 1 2   4 : 0 0 " \t " 0 . 0 0 " 
10                                                    
11               " 1 / 1 2   5 : 0 0 " \t " 0 . 0 0 " 
12                                                    
13               " 1 / 1 2   6 : 0 0 " \t " 0 . 0 0 " 
14                                                    
15               " 1 / 1 2   7 : 0 0 " \t " 0 . 0 0 " 
16                                                    
17               " 1 / 1 2   8 : 0 0 " \t " 0 . 0 0 " 
18                                                    
19             " 1 / 1 2   9 : 0 0 " \t " 1 3 . 1 3 " 

如果相反,我尝试:

pd.read_csv(df+"\\Energy 1.csv",sep='\t',skiprows=2)

我得到:

Error: line contains NULL byte

预期成绩:

具有以下格式的新数据集:

Period: 1/12/2018 1:00:00 - 8/12/2018 0:59:59

Date,Power AC (IS01_I1) [W],Power AC (IS01_I2) [W],Power AC (IS01_I3) [W],Power AC (IS02_I1) [W],Power AC (IS02_I2) [W],Power AC (IS02_I3) [W],Power AC (IS03_I1) [W],Power AC (IS03_I2) [W],Power AC (IS03_I3) [W],Power AC (IS04_I1) [W],Power AC (IS04_I2) [W],Power AC (IS04_I3) [W],Power AC (IS05_I1) [W],Power AC (IS05_I2) [W],Power AC (IS05_I3) [W],Power AC (IS06_I1) [W],Power AC (IS06_I2) [W],Power AC (IS06_I3) [W],Power AC (IS07_I1) [W],Power AC (IS07_I2) [W],Power AC (IS07_I3) [W],Power AC (IS08_I1) [W],Power AC (IS08_I2) [W],Power AC (IS08_I3) [W],Power AC (IS09_I1) [W],Power AC (IS09_I2) [W],Power AC (IS09_I3) [W],Power AC (IS10_I1) [W],Power AC (IS10_I2) [W],Power AC (IS10_I3) [W]
1/12 1:00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1/12 2:00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1/12 3:00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
1/12 4:00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00

编辑:

通过使用以下代码:

text = open(df+"\\Energy 2.csv", "r")
for i in text:
    i = re.sub('\t', ',',i)
    i = re.sub('"', '',i)
    print(i)
text.close()

我已经能够阅读每一行,用逗号替换 \t,并删除引号。但是,如果我使用 re.sub("\s+","",i) 则不会删除空格。

我注意到这些文件前面似乎有一个“ÿþ”字符。

编辑:

经过一番摸索,我意识到原始文件中有些字符是不可见的。以下代码似乎可以删除其中的大部分,但它仍然没有产生可用的数据框:

a = pd.Series()
text = open(df+"\\Energy 1.csv", "r")
for i in text:
    i = re.sub('\t', ";",i)
    i = re.sub('ÿþ','',i)
    i = re.sub('"', "",i)
    i = re.sub('\x00','',i)
    i = re.sub('\n','',i)
    i = pd.Series(i)
    a = a.append(i)
text.close()

标签: pythonpandascsv

解决方案


据我所知,问题在于数据集中有一些不可见/奇怪的字符('ÿþ'、'\x00')。

我使用的解决方法是通过反复试验开发的,如下所示。

这可以通过保存到 tempFile 并重新导入数据来实现。我确定这不是好的做法或“pythonic”,但它确实有效。

    df = 'C:\\Users\\Desktop\\AutoDownloadData'
    filelist = os.listdir(df)
    ResInv = pd.DataFrame()
    ResNrg = pd.DataFrame()
    ResSens = pd.DataFrame()
    ResCurr = pd.DataFrame()

    for f in filelist:
        print(f)
        """Define una serie donde guardar los datos de entrada"""
        a = pd.Series()
        """Abre el archivo de entrada en modo lectura"""
        text = open(df+"\\{}".format(f), "r")
        """Lee cada linea del archivo de entrada y elimina los caracteres indeseados, reemplaza el tab por ;
        luego, combina todos los datos leidos en la serie a"""
        for i in text:
            i = re.sub('\t', ';',i)
            i = re.sub('ÿþ','',i)
            i = re.sub('"', '',i)
            i = re.sub('\x00','',i)
            i = re.sub('\n','',i)
            i = pd.Series(i)
            a = a.append(i)
        """Exporta la serie a a un archivo temporal"""
        a.to_csv(df+'\\tempCSV.csv')
        """Re importa la serie a a una serie b"""
        b = pd.read_csv(df+'\\tempCSV.csv')
        b.columns = list(range(0,len(b.columns)))
        """Selecciona solo aquellas filas de B con valores"""
        b = b[1].loc[b[1].notna()]
        """Exporta b a un archivo temporal"""
        b.to_csv(df+'\\tempCSV.csv')
        """Reimporta b a un dataframe llamado c, con delimitadores ;"""
        c = pd.read_csv(df+'\\tempCSV.csv', sep=';')
        c.columns = list(range(0,len(c.columns)))
        """Lee cada fila en la columna 0 del dataframe c, luego elimina todo lo que haya antes de la coma"""
        for i in range(0,len(c[0])):
            cln = str(c[0].loc[i])
            cln = re.sub(r'.*,', '', cln)
            c[0].loc[i] = cln

推荐阅读