首页 > 解决方案 > 在它们的索引列和它们的非索引列上内部连接 ​​2 个表

问题描述

问题:
我有两个大约 3GB 的 pandas 数据帧,要加入 1)邮政编码和 2)house_identifier 变量的每个组合,直到一行找到一个连接(在 for 循环中)或所有变量到变量的组合都失败了为行。

在两列加入循环中的一行后,该行将附加到一个单独的列表并从数据帧中删除。邮政编码是一个非唯一索引。

表列(非分层)
dataset_1 具有以下变量:
postcode、house_identifier_1、house_identifier_2、house_identifier_3、id
dataset_2 具有以下变量:
postcode、house_identifier_a、house_identifier_b、id_2

加入循环的列组合:

table_1_variables = ['number_x', 'number_y', 'number_z']
table_2_variables = ['number_a', 'number_b']
for i in table_1_variables:
    for j in table_2_variables:

为了有效地连接表,一种策略似乎是首先连接索引(邮政编码),然后连接非索引列。但是,这似乎会创建一个非常大的中间连接,这会将 8GB 内存超出限制,并且组合之间的语法也不清楚 (left_index=True, right_index=True, left_on=, right_on=)

同时,在循环内建立索引/重新索引然后排序索引似乎非常低效。有没有更好的方法来有效地加入或合并这些?

相交示例:

{'id': {27: '{582D0636-8DEF-8F22-E053-6C04A8C01BAC}',
  41: '{D9E869FE-7B55-4C36-AC43-695B9033A13B}',
  33: '{93E6821E-554E-40FD-E053-6B04A8C0C1DF}',
  1: '{288DCE29-0589-E510-E050-A8C06205480E}',
  48: '{3A23DDD5-A0E8-41D2-A514-5B09385C301F}',
  52: '{CEB16957-F7FA-4D1B-B45F-A390214735BC}',
  13: '{404A5AF3-9B20-CD2B-E050-A8C063055C7B}',
  16: '{64342BFD-FD07-422C-E053-6C04A8C0FB8A}',
  57: '{29A8E769-8A10-4477-9494-FF55EF5FAE4B}',
  10: '{404A5AF3-0B58-CD2B-E050-A8C063055C7B}',
  21: '{55BDCAE6-0C10-521D-E053-6B04A8C0DD7A}',
  31: '{5C676A02-1781-4152-950C-6E5CA2CBC487}',
  7: '{68FEB20B-142E-38DA-E053-6C04A8C051AE}',
  45: '{8F1B26BD-673F-53DB-E053-6C04A8C03649}',
  12: '{2F115F7A-8F81-4124-9FD4-FB76E742B2C1}',
  36: '{344AB2D7-4B59-4AB4-8F52-75B29BE8C509}',
  20: '{965B6D91-D4B6-95E4-E053-6C04A8C07729}',
  56: '{59872FD9-F39D-4BB9-95F6-91E002D948B1}',
  22: '{6141DFF0-973F-4FEC-A582-7F310B566031}'},
 'id_2': {27: 10002277489,
  41: 64023255,
  33: 10007367447,
  1: 22229221,
  48: 10033235735,
  52: 100062162615,
  13: 50103744,
  16: 10022903998,
  57: 12015624,
  10: 12154940,
  21: 10024247587,
  31: 100041193990,
  7: 10008230730,
  45: 10091640210,
  12: 202107394,
  36: 5062293,
  20: 48114659,
  56: 10001311242,
  22: 10000443154},
 'postcode': {27: 'lu72la',
  41: 'cf626nt',
  33: 'hr40aq',
  1: 'bn32pd',
  48: 'sg13ae',
  52: 'gu97jx',
  13: 'ct202ef',
  16: 'bh14rn',
  57: 'ub24af',
  10: 'w55bu',
  21: 'po302dp',
  31: 'tq148aq',
  7: 'e82ag',
  45: 'ch47ew',
  12: 'ha90ae',
  36: 'nw34tt',
  20: 'sw192rw',
  56: 'so143hw',
  22: 'se218hp'},
 'house_identifier_1': {27: '76',
  41: 'flat6',
  33: '49',
  1: 'flat10',
  48: '145',
  52: '31',
  13: 'flat19',
  16: 'flat7',
  57: '76',
  10: 'flat1',
  21: 'flat1',
  31: 'flat43',
  7: 'flata',
  45: '8',
  12: '42',
  36: 'flat9',
  20: 'flat43',
  56: 'flat156',
  22: 'flat2'},
 'house_identifier_2': {27: 'eastdock',
  41: 'courtlands',
  33: 'watkinscourt',
  1: 'ascothouse',
  48: 'monumentcourt',
  52: 'sumnercourt',
  13: '22-24',
  16: '77',
  57: 'osterleyviews',
  10: '55-59',
  21: '138',
  31: 'leandercourt',
  7: '130',
  45: 'greenbankhall',
  12: 'danescourt',
  36: 'holmefieldcourt',
  20: 'bennetscourtyard',
  56: 'oceanaboulevard',
  22: '124f'},
 'house_identifier_3': {27: 'eastdock',
  41: 'courtlands',
  33: 'watkinscourt',
  1: 'ascothouse',
  48: 'monumentcourt',
  52: 'sumnercourt',
  13: None,
  16: None,
  57: 'osterleyviews',
  10: None,
  21: None,
  31: 'leandercourt',
  7: None,
  45: 'greenbankhall',
  12: 'danescourt',
  36: 'holmefieldcourt',
  20: 'bennetscourtyard',
  56: 'oceanaboulevard',
  22: None},
 'house_identifier_a': {27: None,
  41: None,
  33: None,
  1: '18-20',
  48: None,
  52: None,
  13: '22-24',
  16: '77',
  57: None,
  10: '55-59',
  21: '138',
  31: None,
  7: '130',
  45: None,
  12: None,
  36: None,
  20: None,
  56: None,
  22: '124f'},
 'house_identifier_b': {27: '76',
  41: 'flat6',
  33: '49',
  1: 'flat10',
  48: '145',
  52: '31',
  13: 'flat19',
  16: 'flat7',
  57: '76',
  10: 'flat1',
  21: 'flat1',
  31: 'flat43',
  7: 'flata',
  45: '8',
  12: '42',
  36: 'flat9',
  20: 'flat43',
  56: 'flat156',
  22: 'flat2'}}

标签: pythonpandasjoinindexingcombinatorics

解决方案


推荐阅读