首页 > 解决方案 > 如何使用python从csv文件的所有列中过滤字符串

问题描述

csv 文件示例我有一个 csv 文件,我需要检查所有列才能找到?在 csv 文件中并删除这些行。

下面是一个例子

Column1 Column 2 Column 3
1 ? 3
2 ?.. 1
? 2 ?.
? 4 4

我尝试了以下但它不起作用

data = readData(“text.csv”)
print(data)

def Filter(string, substr):
return [str for str in string if
any(sub not in str for sub in substr)]

string = data
substr = [’?’,’?.’,’? ‘,’? ']
filter_data=Filter(string, substr)

我的代码在下面获取元组中的输出。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def readData(filename) :
    data = pd.read_csv(filename, skipinitialspace=True)
    return [d for d in data.itertuples(index=False, name=None)]

data = readData("problem2.csv")
print(data)

[('18.0', 8, '307.0 ', '130.0 ', '3504.', '12.0', 70, 1, 'chevrolet chevelle malibu'), ('15.0', 8, '350.0 ', '165.0 ', '3693.', '11.5', 70, 1, 'buick skylark 320'), ('18.0', 8, '318.0 ', '150.0 ', '?.', '11.0', 70, 1, 'plymouth satellite'), ('16.0', 8, '304.0 ', '150.0 ', '3433.', '12.0', 70, 1, 'amc rebel sst'), ('17.0', 8, '302.0 ', '140.0 ', '3449.', '10.5', 70, 1, 'ford torino'), ('15.0', 8, '429.0 ', '198.0 ', '4341.', '10.0', 70, 1, 'ford galaxie 500'), ('14.0', 8, '454.0 ', '220.0 ', '4354.', '9.0', 70, 1, 'chevrolet impala'), ('14.0', 8, '440.0 ', '215.0 ', '4312.', '8.5', 70, 1, 'plymouth fury iii'),

接下来要删除带有 '?; 的行 从所有列中以元组形式提供相同的输出。

标签: python-3.x

解决方案


我的输入文件如下:

mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name
18,8,307,130,3504,12,70,1,chevy malibu
18,8,308,140,?.,14,70,1,plymoth satellite
18,8,309,150,?,15,70,1,ford torino
18,8,310,150,? ,16,70,1,ford galaxy
18,8,310,150, ?,17,70,1,pontiac catalina
18,8,310,150,3505,18,70,1,ford maverick

替换以下任何情况的代码['?','?.',' ?','? ']如下:

import csv
qs = ['?','?.',' ?','? ']
with open('abc.txt') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        row = ['' if r in qs else r for r in row]
        print (row)

其输出如下:

['mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu']
['18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite']
['18', '8', '309', '150', '', '15', '70', '1', 'ford torino']
['18', '8', '310', '150', '', '16', '70', '1', 'ford galaxy']
['18', '8', '310', '150', '', '17', '70', '1', 'pontiac catalina']
['18', '8', '310', '150', '3505', '18', '70', '1', 'ford maverick']

如您所见,第 3 行到第 6 行的值已替换为''.

用另外一个样本数据集运行:

mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name
18,8,307,130,3504,12,70,1,chevy malibu
18,8,308,140,?.,14,70,1,plymoth satellite
18,8,309,?,3506,15,70,1,ford torino
18,8,310,160,? ,16,70,1,ford galaxy
18,8,311,170,3508, ?,70,1,pontiac catalina
18,8,312,180,3509,18,70,1,ford maverick

输出是:

['mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name']
['18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu']
['18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite']
['18', '8', '309', '', '3506', '15', '70', '1', 'ford torino']
['18', '8', '310', '160', '', '16', '70', '1', 'ford galaxy']
['18', '8', '311', '170', '3508', '', '70', '1', 'pontiac catalina']
['18', '8', '312', '180', '3509', '18', '70', '1', 'ford maverick']

在这种情况下,?它位于各个列上。它仍然解决了这个问题。

如果您要一次性查找所有行,您可以将所有行读入一个变量并进行处理。

qs = {'?.':'',' ?':'','? ':'','?':''}
with open('abc.txt') as csv_file:
    lines = csv_file.readlines()
    for i,text in enumerate(lines):
        [text := text.replace(a,b) for a,b in qs.items()]
        lines[i] = text
    print (lines)

您的输出数据将如下所示:

['mpg,cylinder,displace,horsepower,weight,accelerate,year,origin,name\n', '18,8,307,130,3504,12,70,1,chevy malibu\n', '18,8,308,140,,14,70,1,plymoth satellite\n', '18,8,309,,3506,15,70,1,ford torino\n', '18,8,310,160,,16,70,1,ford galaxy\n', '18,8,311,170,3508,,70,1,pontiac catalina\n', '18,8,312,180,3509,18,70,1,ford maverick\n']

元组输出

看起来您期望元组作为输出。

这是执行此操作的代码:

import csv
qs = {'?.':'',' ?':'','? ':'','?':''}
final_list = []

with open('abc.txt') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        row = ['' if r in qs else r for r in row]
        final_list.append(tuple(row))

print (final_list)

输出如下:

[('mpg', 'cylinder', 'displace', 'horsepower', 'weight', 'accelerate', 'year', 'origin', 'name'), ('18', '8', '307', '130', '3504', '12', '70', '1', 'chevy malibu'), ('18', '8', '308', '140', '', '14', '70', '1', 'plymoth satellite'), ('18', '8', '309', '', '3506', '15', '70', '1', 'ford torino'), ('18', '8', '310', '160', '', '16', '70', '1', 'ford galaxy'), ('18', '8', '311', '170', '3508', '', '70', '1', 'pontiac catalina'), ('18', '8', '312', '180', '3509', '18', '70', '1', 'ford maverick')]

推荐阅读