首页 > 解决方案 > 优化慢速聚合 mongo 查询

问题描述

我有两个集合 forms4InsTrader_final(200 万份文档)和 TradeData(1300 万份文档)。我很难理解为什么$out不保存聚合结果。

在以下聚合中,有以下阶段:

  1. 阶段 1:$match特定日期范围之间的日期。 {'pdOfRpt': {'$gte': '2004-01-01', '$lte': '2020-12-31' }}

  2. 第 2 阶段:加入 ( $lookup) forms4InsTrader_finaltoTradeData {'from': 'aprl_test_Trade', 'localField': 'issuertradingsymbol', 'foreignField': 'ticker','as': 'string'}

  3. 第 3 阶段:$unwind来自上方的“字符串”

  4. 第 4 阶段:然后匹配同一文档中的日期 {'$expr': {'$eq': [ '$pdOfRpt', '$string.Date_unmodified']}}

  5. 第 5 阶段:$unwind

  6. 第 6 阶段:选择我需要分析的几个字段$project

  7. 阶段 7:保存结果$out

在上述所有步骤中 - 除了第 7 阶段之外,这两个集合的一切都按预期进行。但是,我想将此结果保存为单独的集合。它已经运行了三个多小时,我对大约 100 万个文档的结果有限,但我没有看到结果保存在不同的集合中。有趣的是,当我$limit对 20000 个文档运行此查询时,它会在不到一分钟的时间内保存下来。我不明白为什么要保存大约 100 万个文档的结果需要这么长时间$out。我在这里想念什么?

请注意,我尝试在本地使用带有指南针和/或终端的可视化查询构建器。

完整的管道:

`db.forms4InsTrader_final.aggregate([     {         '$match': {             'pdOfRpt': {                 '$gte': '2004-01-01',                  '$lte': '2020-12-31'             }         }     }, {         '$lookup': {             'from': 'TradeData',              'localField': 'issuertradingsymbol',              'foreignField': 'ticker',              'as': 'string'         }     }, {         '$unwind': {             'path': '$string',              'includeArrayIndex': 'Date_unmodified'         }     }, {         '$match': {             '$expr': {                 '$eq': [                     '$pdOfRpt', '$string.Date_unmodified'                 ]             }         }     }, {         '$project': {             'string.Adj Close': 1,              'string.Volume': 1,              'string.Close': 1,              'string.avg_Week_Vol': 1,              'string.db.forms4InsTrader_final.aggregate([     {         '$match': {             'pdOfRpt': {                 '$gte': '2004-01-01',                  '$lte': '2020-12-31'             }         }     }, {         '$lookup': {             'from': 'TradeData',              'localField': 'issuertradingsymbol',              'foreignField': 'ticker',              'as': 'string'         }     }, {         '$unwind': {             'path': '$string',              'includeArrayIndex': 'Date_unmodified'         }     }, {         '$match': {             '$expr': {                 '$eq': [                     '$pdOfRpt', '$string.Date_unmodified'                 ]             }         }     }, {         '$project': {             'string.Adj Close': 1,              'string.Volume': 1,              'string.Close': 1,              'string.avg_Week_Vol': 1,              'string.avg_Week_Adj_Close_Price': 1,              'string.Date_unmodified': 1,              'pdOfRpt': 1,              'issuercik': 1,              'issuertradingsymbol': 1,              'reportingownerid_rptownercik': 1,              'reportingowneraddress_rptownerzipcode': 1,              'reportingownerrelationship_isdirector': 1,              'reportingownerrelationship_isofficer': 1,              'reportingownerrelationship_istenpercentowner': 1,              'reportingownerrelationship_isother': 1,              'nonderivativetransaction_securitytitle_value': 1,              'nonderivativetransaction_transactionamounts_transactionshares_value': 1,              'nonderivativetransaction_transactionamounts_transactionpricepershare_value': 1,              'nonderivativetransaction_transactionamounts_transactionacquireddisposedcode_value': 1,              'nonderivativetransaction_posttransactionamounts_sharesownedfollowingtransaction_value': 1,              'derivativetransaction_securitytitle_value': 1,              'derivativetransaction_transactionamounts_transactionshares_value': 1,              'derivativetransaction_transactionamounts_transactionpricepershare_value': 1,              'derivativetransaction_transactionamounts_transactionacquireddisposedcode_value': 1,              'derivativetransaction_ownershipnature_directorindirectownership_value': 1,              'derivativetransaction_underlyingsecurity_underlyingsecuritytitle_value': 1,              'derivativetransaction_underlyingsecurity_underlyingsecurityshares_value': 1,              'derivativetransaction_posttransactionamounts_sharesownedfollowingtransaction_value': 1         }     }, {         '$limit': 1000000     }, {         '$out': 'TestAPril20'     } ])`

哈克方式 - 所以指南针会生成临时集合,您可以从中导出并重新导入为单独的集合。非常低效,但是在我找到另一个解决方案之前,它让我手动操作

标签: mongodbmongodb-queryaggregation-framework

解决方案


I vaguely remember this was the solution when it happened to me, let me know if allowDiskUse will solve your problem

from the mongodb documentation

Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.

it's a config flag, so this is a sample of how to use it

db.stocks.aggregate( [
      { $project : { cusip: 1, date: 1, price: 1, _id: 0 } },
      { $sort : { cusip : 1, date: 1 } }
   ],
   { allowDiskUse: true }
)

推荐阅读