首页 > 解决方案 > 使用聚合和 $group 查询时出现内存溢出错误

问题描述

我有一个名为“allvoice”的集合,其结构如下:

{
    "_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
    "subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
    "callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
    "originaltimestamp" : "20200113033803",
    "duration" : "13",
    "maincost" : "255.6",
    "type" : "Internal",
    "type_network" : "local_network",
    "month" : "202001"
}

字段“originaltimestamp”为当天数据的交易周期,为了方便查询当天的大数据,我使用按日期查询“聚合”和“分组”来获取订阅者的每日交易(每日交易),然后我写到一个名为“dailyvoice”的集合中。从现在开始按日期查询交易数据,我直接在集合“dailyvoice”中查询。这是我的查询。

db.getCollection('allvoice').aggregate(
  [ 
  { 
      "$project": {
       "date": { "$toDate": "$originaltimestamp" },
       "subscriber_id":1,
       "callednumber":1,
       "originaltimestamp":1,
       "duration": 1,
       "maincost": 1,
       "type": 1,
       "type_network": 1,
       "month":1
  }},
 
  { "$group": {
    "_id": { "$dateToString": { "format": "%Y-%m-%d", "date": "$date" } },
    "data": { $push: "$$ROOT" } ,
    "count": { "$sum": 1 }
  }},
  {
    '$out': 'dailyvoice' 
  }
],  { allowDiskUse: true }
)

而“dailyvoice”集合的输出如下:

{
    "_id" : "2020-01-13",
    "data" : [ 
        {
            "_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
            "subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
            "callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
            "originaltimestamp" : "20200113033803",
            "duration" : "13",
            "maincost" : "255.6",
            "type" : "trong nuoc",
            "type_network" : "local_network",
            "month" : "202001",
            "date" : ISODate("2020-01-13T03:38:03.000Z")
        }, 
        {
            "_id" : ObjectId("612599bb1cff80e6fc5cbf39"),
            "subscriber_id" : "6cf5d711bfa12160eefe62b8bc9c914370eebd70",
            "callednumber" : "0241052d42e5491b0529733716fb6fb04804248f",
            "originaltimestamp" : "20200113041608",
            "duration" : "28",
            "maincost" : "644.0",
            "type" : "trong nuoc",
            "type_network" : "global_network",
            "month" : "202001",
            "date" : ISODate("2020-01-13T04:16:08.000Z")
        }, 
        {
            "_id" : ObjectId("612599bb1cff80e6fc5cbf3a"),
            "subscriber_id" : "3e554a5a920c469da9faf7375c5265c5cf6fb696",
            "callednumber" : "307219a71c028931a4b74f8f5f014ffa16005ee9",
            "originaltimestamp" : "20200113051416",
            "duration" : "202",
            "maincost" : "2961.4",
            "type" : "trong nuoc",
            "type_network" : "local_network",
            "month" : "202001",
            "date" : ISODate("2020-01-13T05:14:16.000Z")
        }
    ],
    "count" : 3.0
}

我这里的问题是,如果集合“allvoice”有一个小数据集,查询语句工作正常,但是当集合“allvoice”有一个大数据集时,大约 114513872 条记录(文档)查询遇到的语句溢出(“聚合期间的 PlanExcutor 错误”)。有没有比增加服务器配置更好的解决方案?请看一下为我找到优化查询的方法!非常感谢

在我优化查询逐月查询后,结果还是报错:“PlanExcutor error during aggreation::cased by::BSONObj size..”

db.getCollection('allvoice').aggregate(
    [

        { $match: { month: "202001" } },

        {
            "$group": {

                "_id": {
                    "$dateToString": {
                        "format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" }
                    }
                },

                "data": {
                    $push: {
                        "subscriber_id": "$subscriber_id",
                        "type": "$type",
                        // "originaltimestamp":"$originaltimestamp"
                        "date": { "$toDate": "$originaltimestamp" },
                        "month": "$month"
                    }
                },

                "count": { "$sum": 1 }
            }
        },
        
        {
            '$out': 'dailyvoice_202001'
        }
    ], { allowDiskUse: true }
)

标签: mongodbmongooseaggregation-frameworkbigdataquery-optimization

解决方案


一些想法:

您不需要$project查询中的第一阶段。并且,您可以{ "$toDate": "$originaltimestamp" }$groupstage中包含_id,如下所示:

"_id": { 
    "$dateToString": { 
        "format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" } 
    } 
}

关于$push: "$$ROOT"- 而不是$$ROOT,仅捕获您最需要(或重要)的字段。这是为了减少内存使用。例如:

"data": { 
    $push: { 
        "subscriber_id": "$subscriber_id",
        "type": "$type",
        // other required fields...
    } 
}

最后,您可以考虑一次将查询限制为一组日期。这将需要针对不同的日期范围多次运行查询 - 但我认为总体上可能会更好。例如,一次匹配该month字段的月份。而且,这month可以为性能编制索引。这将需要$match在查询的开头(第一阶段)包含一个阶段,例如:

{ $match: { month: "202001" } }

并且,这将查询 2020 年 1 月的数据。


推荐阅读