首页 > 解决方案 > 如何在 Python 的 pandas 数据框中的新列中将所有名称与常见的 3 字母集映射?

问题描述

我有一个熊猫数据框df,如下所示:

Unnamed: 0  Characters Length   Characters Split    A   B   C   D   Names with common 3-letters
0   FROKDUWJU   9   [FRO, KDU, WJU] FRO KDU WJU NaN 
1   IDJWPZSUR   9   [IDJ, WPZ, SUR] IDJ WPZ SUR NaN 
2   UCFURKIRODCQ    12  [UCF, URK, IRO, DCQ]    UCF URK IRO DCQ 
3   ORI 3   [ORI]   ORI NaN NaN NaN 
4   PROIRKIQARTIBPO 15  [PRO, IRK, IQA, RTI, BPO]   PRO IRK IQA RTI 
5   QAZWREDCQIBR    12  [QAZ, WRE, DCQ, IBR]    QAZ WRE DCQ IBR 
6   PLPRUFSWURKI    12  [PLP, RUF, SWU, RKI]    PLP RUF SWU RKI 
7   FROIEUSKIKIR    12  [FRO, IEU, SKI, KIR]    FRO IEU SKI KIR 
8   ORIUWJZSRFRO    12  [ORI, UWJ, ZSR, FRO]    ORI UWJ ZSR FRO 
9   URKIFJVUR   9   [URK, IFJ, VUR] URK IFJ VUR NaN 
10  RUFOFR  6   [RUF, OFR]  RUF OFR NaN NaN 
11  IEU 3   [IEU]   IEU NaN NaN NaN 
12  PIMIEU  6   [PIM, IEU]  PIM IEU NaN NaN 

在最后一列,具有常见 3 个字母的名称中,我想列出第一列中在其名称中设置了常见 3 个字母的所有名称。例如,在第一行,我想列出所有名称中包含 FRO、KRU 和 WJU 的名称。这些名称的 3 个字母拆分也可以在“字符拆分”或 A、B、C 和 D 列中找到以供参考。

为了逐步进行讨论,我需要扫描给定行中名称中存在的 3 字母集是否也存在于其余行中的任何名称中。如果它存在,我需要将另一行的相应名称作为列表添加到“具有常见 3 个字母的名称”列中。例如,在随附的屏幕截图中,在 C 列中,黄色突出显示的单元格的名称具有共同的 3 字母集,并且名称在同一行中。在此处输入图像描述

实现此目的的适当方法是什么?我应该使用函数还是循环语句?

注意:df.to_dict()看起来如下:

{'Unnamed: 0': {0: 'FROKDUWJU',
  1: 'IDJWPZSUR',
  2: 'UCFURKIRODCQ',
  3: 'ORI',
  4: 'PROIRKIQARTIBPO',
  5: 'QAZWREDCQIBR',
  6: 'PLPRUFSWURKI',
  7: 'FROIEUSKIKIR',
  8: 'ORIUWJZSRFRO',
  9: 'URKIFJVUR',
  10: 'RUFOFR',
  11: 'IEU',
  12: 'PIMIEU'},
 'Characters Length': {0: 9,
  1: 9,
  2: 12,
  3: 3,
  4: 15,
  5: 12,
  6: 12,
  7: 12,
  8: 12,
  9: 9,
  10: 6,
  11: 3,
  12: 6},
 'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
  1: ['IDJ', 'WPZ', 'SUR'],
  2: ['UCF', 'URK', 'IRO', 'DCQ'],
  3: ['ORI'],
  4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
  5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
  6: ['PLP', 'RUF', 'SWU', 'RKI'],
  7: ['FRO', 'IEU', 'SKI', 'KIR'],
  8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
  9: ['URK', 'IFJ', 'VUR'],
  10: ['RUF', 'OFR'],
  11: ['IEU'],
  12: ['PIM', 'IEU']},
 'A': {0: 'FRO',
  1: 'IDJ',
  2: 'UCF',
  3: 'ORI',
  4: 'PRO',
  5: 'QAZ',
  6: 'PLP',
  7: 'FRO',
  8: 'ORI',
  9: 'URK',
  10: 'RUF',
  11: 'IEU',
  12: 'PIM'},
 'B': {0: 'KDU',
  1: 'WPZ',
  2: 'URK',
  3: nan,
  4: 'IRK',
  5: 'WRE',
  6: 'RUF',
  7: 'IEU',
  8: 'UWJ',
  9: 'IFJ',
  10: 'OFR',
  11: nan,
  12: 'IEU'},
 'C': {0: 'WJU',
  1: 'SUR',
  2: 'IRO',
  3: nan,
  4: 'IQA',
  5: 'DCQ',
  6: 'SWU',
  7: 'SKI',
  8: 'ZSR',
  9: 'VUR',
  10: nan,
  11: nan,
  12: nan},
 'D': {0: nan,
  1: nan,
  2: 'DCQ',
  3: nan,
  4: 'RTI',
  5: 'IBR',
  6: 'RKI',
  7: 'KIR',
  8: 'FRO',
  9: nan,
  10: nan,
  11: nan,
  12: nan},
 'Names with common 3-letters': {0: '',
  1: '',
  2: '',
  3: '',
  4: '',
  5: '',
  6: '',
  7: '',
  8: '',
  9: '',
  10: '',
  11: '',
  12: ''}}

标签: pythonpython-3.xpandasdataframedictionary

解决方案


可能有一种更快的方法来搜索和创建列表,但这有效:

# create a different temporary, column (you can't search the Characters Split column directly as the 3 letter combinations aren't honored
df['patrn'] = df.apply( lambda x: '|'.join(x['Characters Split']), axis=1)
    
def find_matches(x):
    # print(x.name) # index number
    new_df = df[~df.index.isin([x.name])] # all rows except current index
    return set(new_df.loc[df['patrn'].str.contains(x['patrn'], case=False)]['Unnamed: 0'].tolist())
    
df['Names with common 3-letters'] = df.apply(lambda x: find_matches(x), axis=1)
df

输出

         Unnamed: 0  Characters Length           Characters Split    A    B    C    D             Names with common 3-letters                patrn
0         FROKDUWJU                  9            [FRO, KDU, WJU]  FRO  KDU  WJU  NaN            {FROIEUSKIKIR, ORIUWJZSRFRO}          FRO|KDU|WJU
1         IDJWPZSUR                  9            [IDJ, WPZ, SUR]  IDJ  WPZ  SUR  NaN                                      {}          IDJ|WPZ|SUR
2      UCFURKIRODCQ                 12       [UCF, URK, IRO, DCQ]  UCF  URK  IRO  DCQ               {URKIFJVUR, QAZWREDCQIBR}      UCF|URK|IRO|DCQ
3               ORI                  3                      [ORI]  ORI  NaN  NaN  NaN                          {ORIUWJZSRFRO}                  ORI
4   PROIRKIQARTIBPO                 15  [PRO, IRK, IQA, RTI, BPO]  PRO  IRK  IQA  RTI                                      {}  PRO|IRK|IQA|RTI|BPO
5      QAZWREDCQIBR                 12       [QAZ, WRE, DCQ, IBR]  QAZ  WRE  DCQ  IBR                          {UCFURKIRODCQ}      QAZ|WRE|DCQ|IBR
6      PLPRUFSWURKI                 12       [PLP, RUF, SWU, RKI]  PLP  RUF  SWU  RKI                                {RUFOFR}      PLP|RUF|SWU|RKI
7      FROIEUSKIKIR                 12       [FRO, IEU, SKI, KIR]  FRO  IEU  SKI  KIR  {PIMIEU, FROKDUWJU, ORIUWJZSRFRO, IEU}      FRO|IEU|SKI|KIR
8      ORIUWJZSRFRO                 12       [ORI, UWJ, ZSR, FRO]  ORI  UWJ  ZSR  FRO          {FROKDUWJU, FROIEUSKIKIR, ORI}      ORI|UWJ|ZSR|FRO
9         URKIFJVUR                  9            [URK, IFJ, VUR]  URK  IFJ  VUR  NaN                          {UCFURKIRODCQ}          URK|IFJ|VUR
10           RUFOFR                  6                 [RUF, OFR]  RUF  OFR  NaN  NaN                          {PLPRUFSWURKI}              RUF|OFR
11              IEU                  3                      [IEU]  IEU  NaN  NaN  NaN                  {PIMIEU, FROIEUSKIKIR}                  IEU
12           PIMIEU                  6                 [PIM, IEU]  PIM  IEU  NaN  NaN                     {FROIEUSKIKIR, IEU}              PIM|IEU

推荐阅读