python - 从多个数据框创建过滤数据集
问题描述
我想创建基于多个数据框的过滤数据集(数据框彼此不同,因为主题不同)。对于每个数据框,我需要根据一些关键词过滤行。例如,对于第一个数据帧,我只需要包含某些单词的行(例如Michael
and Andrew
);对于第二个数据框,我只需要包含单词的行,Laura
依此类推。
原始数据帧示例
df["0"]
Names Surnames
Michael Connelly
John Smith
Andrew Star
Laura Parker
df["1"]
Names Surnames
Laura Bistro
Lisa Roberts
Luke Gary
Norman Loren
为此,我写了以下内容
for i in range(0,1): # I have more than 50 data frames, but I am considering only two for this example
key_words = []
while True:
key_word = input("Key word : ")
if key_word!='0':
list_key_words.append(key_word)
dataframe[str(i)].Filter= dataframe[str(i)]..str.contains('|'.join(key_word), case=False, regex=True) # Creates a new column where with boolean values
dataframe[str(i)].loc[dataframe[str(i)].Filter != False]
filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them
预期输出:
df["0"]
Names Surnames Filter
Michael Connelly 1
John Smith 0
Andrew Star 1
Laura Parker 0
df["1"]
Names Surnames Filter
Laura Bistro 1
Lisa Roberts 0
Luke Gary 0
Norman Loren 0
然后,过滤后的数据集应分别有 2 行和 1 行。
filtered["0"]
Names Surnames Filter
Michael Connelly 1
Andrew Star 1
filtered["1"]
Names Surnames Filter
Laura Bistro 1
但是,我的代码中过滤的代码行似乎是错误的。你能看看他们,让我知道错误在哪里吗?
解决方案
list_key_words = []
# BUG 1: range(first index included, last index excluded), to get 1 you need range(0, 2)
for i in range(0,2): # I have more than 50 data frames, but I am considering only two for this example
key_words = []
while True:
key_word = input("Key word : ")
if key_word!='0':
list_key_words.append(key_word)
# BUG 2.1: you can't apply ".str.contains" to an entire row, you need to indicate the column by name, e.g. "Names".
# If you want to test all the columns, you need multiple filter columns which you OR at the end
# BUG 2.2: You can't create a column using ".Filter", it needs to be "["Filter"]"
dataframe[str(i)]["Filter"]=dataframe[str(i)]["Names"].str.contains(key_word, case=False, regex=True) # Creates a new column where with boolean values
#BUG 3: this line does nothing
dataframe[str(i)].loc[dataframe[str(i)].Filter != False]
#BUG 5: You need a way to save these or they will be overwritten each time
filtered=dataframe[str(i)][dataframe[str(i)]. Filter != False] # Create a dataframe/dataset with only filtered rows
filtered_surnames=filtered['Names'].tolist() # this should select only the column called Names, existing in each dataframe, just for analysing them
#BUG 6: you need to actually leave the "while True" loop at some point
else:
break
有关修复的注释在代码中。最大的问题是错误 2.1,您不能一次将正则表达式应用于行中的所有字段。如果要检查所有字段,可以为每个字段创建新的过滤器列,并df["Filter 1"] | df ["Filter 2"]...
在最后使用布尔逻辑重新组合。
推荐阅读
- c# - 更新任何datagridview单元格时如何触发计算
- python - 具有移位索引的 Pandas 滚动函数
- scala - 将两个 Map 合并成一个序列 scala
- firebase - firebase 5 包使用自己的 Observer 出错
- android - 强制全屏编辑视图以横向键盘输入?
- python-3.x - 为什么 Alfred 工作流无法获得 bash 输出?
- perl - 关闭网站选项卡时服务器如何知道关闭websocket的连接
- javascript - TypeError:为空(外部 javascript)
- c# - C# 应用程序从另一个已经运行的应用程序调用方法
- grunt-contrib-imagemin - Grunt SyntaxError:意外的令牌