首页 > 解决方案 > 从多个数据框创建过滤数据集

问题描述

我想创建基于多个数据框的过滤数据集(数据框彼此不同,因为主题不同)。对于每个数据框,我需要根据一些关键词过滤行。例如,对于第一个数据帧,我只需要包含某些单词的行(例如Michaeland Andrew);对于第二个数据框,我只需要包含单词的行,Laura依此类推。

原始数据帧示例

df["0"]

Names Surnames
Michael Connelly
John    Smith
Andrew   Star
Laura   Parker

df["1"]

Names Surnames
Laura  Bistro
Lisa    Roberts
Luke    Gary
Norman  Loren

为此,我写了以下内容

for i in range(0,1): # I have more than 50 data frames, but I am considering only two for this example
    key_words = [] 

    while True:
        key_word = input("Key word : ")

        if key_word!='0':
            list_key_words.append(key_word)
            dataframe[str(i)].Filter= dataframe[str(i)]..str.contains('|'.join(key_word), case=False, regex=True) # Creates a new column where with boolean values
            dataframe[str(i)].loc[dataframe[str(i)].Filter != False]

            filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
            filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them

预期输出:

df["0"]

Names Surnames  Filter
Michael Connelly 1
John    Smith    0
Andrew   Star    1
Laura   Parker   0

df["1"]

Names Surnames   Filter
Laura  Bistro     1
Lisa    Roberts   0
Luke    Gary      0
Norman  Loren     0

然后,过滤后的数据集应分别有 2 行和 1 行。

filtered["0"]

Names Surnames  Filter
Michael Connelly 1
Andrew   Star    1


filtered["1"]

Names Surnames   Filter
Laura  Bistro     1

但是,我的代码中过滤的代码行似乎是错误的。你能看看他们,让我知道错误在哪里吗?

标签: pythonpandas

解决方案


list_key_words = []
# BUG 1: range(first index included, last index excluded), to get 1 you need range(0, 2)
for i in range(0,2): # I have more than 50 data frames, but I am considering only two for this example
    key_words = [] 

    while True:
        key_word = input("Key word : ")

        if key_word!='0':
            list_key_words.append(key_word)

            # BUG 2.1: you can't apply ".str.contains" to an entire row, you need to indicate the column by name, e.g. "Names". 
            # If you want to test all the columns, you need multiple filter columns which you OR at the end
            # BUG 2.2: You can't create a column using ".Filter", it needs to be "["Filter"]"
            dataframe[str(i)]["Filter"]=dataframe[str(i)]["Names"].str.contains(key_word, case=False, regex=True) # Creates a new column where with boolean values

            #BUG 3: this line does nothing
            dataframe[str(i)].loc[dataframe[str(i)].Filter != False]


            #BUG 5: You need a way to save these or they will be overwritten each time
            filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
            filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them

        #BUG 6: you need to actually leave the "while True" loop at some point
        else:
            break

有关修复的注释在代码中。最大的问题是错误 2.1,您不能一次将正则表达式应用于行中的所有字段。如果要检查所有字段,可以为每个字段创建新的过滤器列,并df["Filter 1"] | df ["Filter 2"]...在最后使用布尔逻辑重新组合。


推荐阅读