首页 > 解决方案 > Spark Check if there is At Least n element in dataset

问题描述

I am using Spark (2.3.1) to do some processing on datasets. For some reason, I would like to know if there is enough data in my Dataset before doing my computation.

The basic solution to do that is the following :

int count = myDataset.count();
int threshold = 100;

if (count>threshold){
    // compute
}else{
    System.out.println("Not enough data to do computation");
}

But it is really inefficient. Another solution that is a bit more efficient is to use the countApprox() function.

int count = (long) (myDataset.rdd().countApprox(1000,0.90).getFinalValue().mean());

But in my case, it could be way more efficient.

What is the best way to solve this problem ?

Note :

标签: apache-sparkapache-spark-dataset

解决方案


如果这样做myDataset.count(),它将扫描完整数据并且可能会很慢。

为了加快速度,你可以limit(threshold+1)对你的数据集做一个。这将返回另一个包含threshold+1行的数据集。在这一点上,你可以做到.count()

    整数阈值 = 100;
    int totalRowsAfterLimit = myDataset.limit(threshold+1).count();

    如果(totalRowsAfterLimit > 阈值){
        // 计算
    } 别的 {
        System.out.println("没有足够的数据进行计算");
    }

limit(threshold+1)将确保您的基础工作只读取有限数量的记录,并且它会更快地完成。


推荐阅读