apache-spark - Spark Check if there is At Least n element in dataset
问题描述
I am using Spark (2.3.1) to do some processing on datasets. For some reason, I would like to know if there is enough data in my Dataset before doing my computation.
The basic solution to do that is the following :
int count = myDataset.count();
int threshold = 100;
if (count>threshold){
// compute
}else{
System.out.println("Not enough data to do computation");
}
But it is really inefficient. Another solution that is a bit more efficient is to use the countApprox() function.
int count = (long) (myDataset.rdd().countApprox(1000,0.90).getFinalValue().mean());
But in my case, it could be way more efficient.
What is the best way to solve this problem ?
Note :
- I was thinking iterating over my data, manualy counting my rows and stopping when I reach the threshold, but I am not sure it is the best solution.
解决方案
如果这样做myDataset.count()
,它将扫描完整数据并且可能会很慢。
为了加快速度,你可以limit(threshold+1)
对你的数据集做一个。这将返回另一个包含threshold+1
行的数据集。在这一点上,你可以做到.count()
。
整数阈值 = 100; int totalRowsAfterLimit = myDataset.limit(threshold+1).count(); 如果(totalRowsAfterLimit > 阈值){ // 计算 } 别的 { System.out.println("没有足够的数据进行计算"); }
limit(threshold+1)
将确保您的基础工作只读取有限数量的记录,并且它会更快地完成。
推荐阅读
- python-3.x - 在熊猫中查找与正则表达式匹配的列名的索引
- database - 为什么要在数据设计中维护双向指针?
- javascript - Highchart.js 显示的组织结构图显示不正确
- visual-studio-code - VSCode:我可以通过文件扩展名启用扩展名吗?
- powershell - 有没有办法使用 PowerShell 记录对话?
- freeradius - 如何在 FreeRADIUS 中的 C 模块回调之间传递任意数据
- python - 导入 Paraview 时,python 出现这个错误该怎么办?
- asp.net-core - Razor 页面 Datatables.net Ajax 回发
- python - pip install paramiko 出现错误 - ModuleNotFoundError: No module named 'zlib'
- azure - Spring Boot 使用 HikariCP 通过 JDBC 超时连接到 Vertica