python - 在它们的索引列和它们的非索引列上内部连接 2 个表
问题描述
问题:
我有两个大约 3GB 的 pandas 数据帧,要加入 1)邮政编码和 2)house_identifier 变量的每个组合,直到一行找到一个连接(在 for 循环中)或所有变量到变量的组合都失败了为行。
在两列加入循环中的一行后,该行将附加到一个单独的列表并从数据帧中删除。邮政编码是一个非唯一索引。
表列(非分层)
dataset_1 具有以下变量:
postcode、house_identifier_1、house_identifier_2、house_identifier_3、id
dataset_2 具有以下变量:
postcode、house_identifier_a、house_identifier_b、id_2
加入循环的列组合:
table_1_variables = ['number_x', 'number_y', 'number_z']
table_2_variables = ['number_a', 'number_b']
for i in table_1_variables:
for j in table_2_variables:
为了有效地连接表,一种策略似乎是首先连接索引(邮政编码),然后连接非索引列。但是,这似乎会创建一个非常大的中间连接,这会将 8GB 内存超出限制,并且组合之间的语法也不清楚 (left_index=True, right_index=True, left_on=, right_on=)
同时,在循环内建立索引/重新索引然后排序索引似乎非常低效。有没有更好的方法来有效地加入或合并这些?
相交示例:
{'id': {27: '{582D0636-8DEF-8F22-E053-6C04A8C01BAC}',
41: '{D9E869FE-7B55-4C36-AC43-695B9033A13B}',
33: '{93E6821E-554E-40FD-E053-6B04A8C0C1DF}',
1: '{288DCE29-0589-E510-E050-A8C06205480E}',
48: '{3A23DDD5-A0E8-41D2-A514-5B09385C301F}',
52: '{CEB16957-F7FA-4D1B-B45F-A390214735BC}',
13: '{404A5AF3-9B20-CD2B-E050-A8C063055C7B}',
16: '{64342BFD-FD07-422C-E053-6C04A8C0FB8A}',
57: '{29A8E769-8A10-4477-9494-FF55EF5FAE4B}',
10: '{404A5AF3-0B58-CD2B-E050-A8C063055C7B}',
21: '{55BDCAE6-0C10-521D-E053-6B04A8C0DD7A}',
31: '{5C676A02-1781-4152-950C-6E5CA2CBC487}',
7: '{68FEB20B-142E-38DA-E053-6C04A8C051AE}',
45: '{8F1B26BD-673F-53DB-E053-6C04A8C03649}',
12: '{2F115F7A-8F81-4124-9FD4-FB76E742B2C1}',
36: '{344AB2D7-4B59-4AB4-8F52-75B29BE8C509}',
20: '{965B6D91-D4B6-95E4-E053-6C04A8C07729}',
56: '{59872FD9-F39D-4BB9-95F6-91E002D948B1}',
22: '{6141DFF0-973F-4FEC-A582-7F310B566031}'},
'id_2': {27: 10002277489,
41: 64023255,
33: 10007367447,
1: 22229221,
48: 10033235735,
52: 100062162615,
13: 50103744,
16: 10022903998,
57: 12015624,
10: 12154940,
21: 10024247587,
31: 100041193990,
7: 10008230730,
45: 10091640210,
12: 202107394,
36: 5062293,
20: 48114659,
56: 10001311242,
22: 10000443154},
'postcode': {27: 'lu72la',
41: 'cf626nt',
33: 'hr40aq',
1: 'bn32pd',
48: 'sg13ae',
52: 'gu97jx',
13: 'ct202ef',
16: 'bh14rn',
57: 'ub24af',
10: 'w55bu',
21: 'po302dp',
31: 'tq148aq',
7: 'e82ag',
45: 'ch47ew',
12: 'ha90ae',
36: 'nw34tt',
20: 'sw192rw',
56: 'so143hw',
22: 'se218hp'},
'house_identifier_1': {27: '76',
41: 'flat6',
33: '49',
1: 'flat10',
48: '145',
52: '31',
13: 'flat19',
16: 'flat7',
57: '76',
10: 'flat1',
21: 'flat1',
31: 'flat43',
7: 'flata',
45: '8',
12: '42',
36: 'flat9',
20: 'flat43',
56: 'flat156',
22: 'flat2'},
'house_identifier_2': {27: 'eastdock',
41: 'courtlands',
33: 'watkinscourt',
1: 'ascothouse',
48: 'monumentcourt',
52: 'sumnercourt',
13: '22-24',
16: '77',
57: 'osterleyviews',
10: '55-59',
21: '138',
31: 'leandercourt',
7: '130',
45: 'greenbankhall',
12: 'danescourt',
36: 'holmefieldcourt',
20: 'bennetscourtyard',
56: 'oceanaboulevard',
22: '124f'},
'house_identifier_3': {27: 'eastdock',
41: 'courtlands',
33: 'watkinscourt',
1: 'ascothouse',
48: 'monumentcourt',
52: 'sumnercourt',
13: None,
16: None,
57: 'osterleyviews',
10: None,
21: None,
31: 'leandercourt',
7: None,
45: 'greenbankhall',
12: 'danescourt',
36: 'holmefieldcourt',
20: 'bennetscourtyard',
56: 'oceanaboulevard',
22: None},
'house_identifier_a': {27: None,
41: None,
33: None,
1: '18-20',
48: None,
52: None,
13: '22-24',
16: '77',
57: None,
10: '55-59',
21: '138',
31: None,
7: '130',
45: None,
12: None,
36: None,
20: None,
56: None,
22: '124f'},
'house_identifier_b': {27: '76',
41: 'flat6',
33: '49',
1: 'flat10',
48: '145',
52: '31',
13: 'flat19',
16: 'flat7',
57: '76',
10: 'flat1',
21: 'flat1',
31: 'flat43',
7: 'flata',
45: '8',
12: '42',
36: 'flat9',
20: 'flat43',
56: 'flat156',
22: 'flat2'}}
解决方案
推荐阅读
- javascript - 使用 jQuery Datepicker 使日期在加载时无法选择
- python - pyqt5 设计器第二个窗口按钮不起作用
- r - 我在数据框的下标分配中不允许缺失值有错误
- azure - Azure:无法将存档 blob 从一个存储帐户复制到另一个?
- java - 如何调用字符串类型的方法?
- amazon-web-services - 无法使用 s3 bitbucket 在 ec2 中部署代码
- tfs - SpecFlow:关联自动化未在 Visual Studio 2017 中显示测试
- saxon - 处理 DITA 时在 Saxon-HE 9.9.1 中出现错误:DTD 上的 I/O 错误
- java - 本体插件不可用
- firebase - 如何使用 Firebase OAuth 处理回调 URL 的请求?