首页 > 解决方案 > 如何迭代直到所有条目都在给定列中?

问题描述

我正在尝试将 while 语句应用于我的代码,以便运行它,直到下面列表中的所有元素(在 Check 列中)都在 Source 列中。

到目前为止,我的代码是:

while set_condition: # to set the condition
     newCol = pd.Series(list(set(df['Check']) - set(df['Source']))) # this check for elements which are not currently included in the column Source
     newList1 = newCol.apply(lambda x: my_function(x)) # this function should generate the lists n Check -> this explains why I need to create a while statement
     df = df.append(pd.DataFrame(dict('Source'=newCol, 'Check'=newList1)), ignore_index=True) # append the results in the new column
     df = df.explode('Check')

我会给你一个过程和如何my_function工作的例子:假设我有我的初始数据集

Source       Check
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  
elephant [horse, bird]

在爆炸Check列并将结果附加到之后Source,我将拥有

Source       Check
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  
elephant [horse, bird]
dog     [] # this will be filled in after applying the function
cat     [] # this will be filled in after applying the function
bird    [] # this will be filled in after applying the function

在应用函数之前,列表中的每个元素都应该添加到 Source 列中。当我应用这个函数时,我填充了其他元素的列表;所以,例如我可以有

Source       Check
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  
elephant [horse, bird]
dog     [mouse, fish]  # they are filled in
cat     [mouse]
bird    [elephant, penguin]
fish    [dog]

由于fishand penguinare not in Source,我将需要再次运行代码以获得预期的输出(列表中的所有元素都已经在 Source 列中):

Source       Check
mouse   [dog, horse, cat]   
horse   [mouse, elephant]   
tiger   []  
elephant [horse, bird]
dog     [mouse, fish] 
cat     [mouse]
bird    [elephant, penguin]
fish    [dog]
penguin [bird]

因为两者dogbird已经在 中Source,所以我不需要再次应用该函数,因为所有列表都填充了 Source 列中已经存在的元素。代码可以停止运行。

我想做的是在列表中的所有元素都在 Source 列中并应用该函数来填充所有列表时停止循环/循环。

感谢您提供的所有帮助。

标签: pythonpandaswhile-loop

解决方案


如果您重复循环直到没有更多行要添加到 DataFrame 中,这与说 的所有元素df['Check']都在df['Source']. 无论如何,您必须计算每个循环,那么为什么不使用它来跳出循环呢?

while True: # loop forever!
     diff = set(df['Check']) - set(df['Source'])
     if len(diff) == 0:
         break # done!
     newCol = pd.Series(list(diff))
     newList1 = newCol.apply(lambda x: my_function(x))
     df = df.append(pd.DataFrame(dict('Source'=newCol, 'Check'=newList1)), ignore_index=True)
     df = df.explode('Check') # NOTE: I will use this to my advantage in the next suggested solution

因为不断附加到 DataFrame 会占用内存,所以您可能需要考虑先构建列,然后在循环之外一次构建 DataFrame。df['Check']无论如何最终都会爆炸,所以从爆炸开始并建立在这些列表上:

df = df.explode('Check')
check = df['Check']                # Append to this list as we iterate
source = df['Source']              # Append to this list as we iterate
unique_source = set(source)
diff = set(check) - unique_source  # Iterate until this is empty
while len(diff) > 0:
    new_check = [my_function(x) for x in diff] # a list of lists
    check.append(new_check)    # Add the list of lists as-is, but explode later
    source.append(diff)        # Keep track of the new sources for the DataFrame...
    unique_source.update(diff) # and keep track of the unique sources for efficiency
    flat_check = set(x for sublist in new_check for x in sublist)
    diff = flat_check - unique_source  # We only have to check the new elements!

df = pd.DataFrame({"Check": check, "Source": source}).explode("Check") # build the entire DataFrame at once

有很多方法可以使用这个结构来获得你想要的 DataFrame 的结构。例如,如果您不想爆炸,只需保留本示例开头df['Check']的原始版本并将新数据附加到该版本:df

new_df = df.explode('Check')
unique_source = set(new_df['Source'])
diff = set(new_df['Check']) - unique_source
source = [] # append to empty lists
check = []  # append to empty lists
while len(diff) > 0:
    # ...

df = pd.append([df, pd.DataFrame({"Check": check, "Source": source})]) # keep the unexploded columns

推荐阅读