首页 > 解决方案 > 根据另一个数据框查找重新排序对象内的字符串

问题描述

这是我的第一个数据框(df1)。

在此处输入图像描述

我必须根据第二个数据帧(df2)的最小计数对序列列(我的 df1)中的元素重新排序

在此处输入图像描述

所以我的最终结果应该是这样的

在此处输入图像描述

标签: python

解决方案


我认为下面的代码会做你想要的。我已经评论了内联并放置了进一步阅读的链接......

我知道它可以被压缩成更短的代码,但我希望步骤清晰。

import pandas as pd
from pprint import pprint


data1 = {'id': ['A1234', 'A2345'],
         'Sequence': ['16 31 17', '51 59 43']}
df1 = pd.DataFrame(data1)
 

# I assumed the label en count columns are integers
data2 = {'label': [10, 11, 12, 13, 16, 17, 21, 24, 31, 43, 44, 51, 59, 60],
         'count': [214, 128, 135, 37, 184, 68, 267, 264, 231, 13, 82, 100, 99, 92]}
df2 = pd.DataFrame(data2)


def seq_value_sort(seq_df, label_df):
    new_sequence_list = []
    for value in seq_df['Sequence'].values:
        print(f'{"":-<40}') # prints a line
        # convert string to list of integers
        # https://www.geeksforgeeks.org/python-converting-all-strings-in-list-to-integers/
        sequence = [int(i) for i in value.split()]
        
        # generate an unsorted list of dict items based on Sequence
        data = []
        for index, row in label_df.T.iteritems():
            if int(row['label']) in sequence:
                data.append({'label': int(row['label']),
                             'count': int(row['count'])})
        pprint(data)

        # now sort the unsorted list based on key 'count'
        # https://stackoverflow.com/a/73050/9267296
        data = sorted(data, key=lambda k: k['count'])
        pprint(data)

        # list comprehension to make list of strings out
        # of the list of dict
        # https://stackoverflow.com/a/7271523/9267296
        sequence_sorted = [ str(item['label']) for item in data ]
        pprint(sequence_sorted)

        # create the final sequence string from the list
        new_sequence_list.append(' '.join(sequence_sorted))
    
    # create return data
    return_data = {'id': list(seq_df['id'].values),
                   'Sequence': new_sequence_list}
    pprint(return_data)
    
    # finally return a new df
    return pd.DataFrame(return_data)


df3 = seq_value_sort(df1, df2)
print(f'{"":-<40}')
print(df3)

编辑:

忘记输出:

----------------------------------------
[{'count': 184, 'label': 16},
 {'count': 68, 'label': 17},
 {'count': 231, 'label': 31}]
[{'count': 68, 'label': 17},
 {'count': 184, 'label': 16},
 {'count': 231, 'label': 31}]
['17', '16', '31']
----------------------------------------
[{'count': 13, 'label': 43},
 {'count': 100, 'label': 51},
 {'count': 99, 'label': 59}]
[{'count': 13, 'label': 43},
 {'count': 99, 'label': 59},
 {'count': 100, 'label': 51}]
['43', '59', '51']
{'Sequence': ['17 16 31', '43 59 51'], 'id': ['A1234', 'A2345']}
----------------------------------------
      id  Sequence
0  A1234  17 16 31
1  A2345  43 59 51

推荐阅读