apache-spark - Spark RAPIDS - 操作未替换为 GPU 版本
问题描述
我是 Rapids 的新手,无法理解支持的操作。
我有以下格式的数据:
+------------+----------+
| kmer|source_seq|
+------------+----------+
|TGTCGGTTTAA$| 4|
|ACCACCACCAC$| 8|
|GCATAATTTCC$| 1|
|CCGTCAAAGCG$| 7|
|CCGTCCCGTGG$| 6|
|GCGCTGTTATG$| 2|
|GAGCATAGGTG$| 5|
|CGGCGGATTCT$| 0|
|GGCGCGAGGGT$| 3|
|CCACCACCAC$A| 8|
|CACCACCAC$AA| 8|
|CCCAAAAAAAAA| 0|
|AAGAAAAAAAAA| 5|
|AAGAAAAAAAAA| 0|
|TGTAAAAAAAAA| 0|
|CCACAAAAAAAA| 8|
|AGACAAAAAAAA| 7|
|CCCCAAAAAAAA| 0|
|CAAGAAAAAAAA| 5|
|TAAGAAAAAAAA| 0|
+------------+----------+
我正在尝试使用以下代码找出哪些“kmer”具有哪些“source_seq”:
val w = Window.partitionBy("kmer")
x.withColumn("source_seqs", collect_list("source_seq").over(w))
// Result is something like this:
+------------+----------+-----------+
| kmer|source_seq|source_seqs|
+------------+----------+-----------+
|AAAACAAGACCA| 2| [2]|
|AAAACAAGCAGC| 4| [4]|
|AAAACCACGAGC| 3| [3]|
|AAAACCGCCAAA| 7| [7]|
|AAAACCGGTGTG| 1| [1]|
|AAAACCTATATC| 5| [5]|
|AAAACGACTTCT| 6| [6]|
|AAAACGCGCAAG| 3| [3]|
|AAAAGGCCTATT| 7| [7]|
|AAAAGGCGTTCG| 3| [3]|
|AAAAGGCTGTGA| 1| [1]|
|AAAAGGTCTACC| 2| [2]|
|AAAAGTCGAGCA| 7| [7, 0]|
|AAAAGTCGAGCA| 0| [7, 0]|
|AAAATCCGATCA| 0| [0]|
|AAAATCGAGCGG| 0| [0]|
|AAAATCGTTGAA| 7| [7]|
|AAAATGGACAAG| 1| [1]|
|AAAATTGCACCA| 3| [3]|
|AAACACCGCCGT| 3| [3]|
+------------+----------+-----------+
Spark Rapids 支持的操作符文档提到collect_list
仅受窗口支持,据我所知,这是我在代码中所做的。
但是,查看查询计划,很容易看出collect_list
不是由 GPU 执行的:
scala> x.withColumn("source_seqs", collect_list("source_seq").over(w)).explain
== Physical Plan ==
Window [collect_list(source_seq#302L, 0, 0) windowspecdefinition(kmer#301, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS max_source#658], [kmer#301]
+- GpuColumnarToRow false
+- GpuSort [kmer#301 ASC NULLS FIRST], false, RequireSingleBatch, 0
+- GpuCoalesceBatches RequireSingleBatch
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(kmer#301, 200), ENSURE_REQUIREMENTS, [id=#1496]
+- GpuFileGpuScan csv [kmer#301,source_seq#302L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/cloud-user/phase1/example/1620833755/part-00000], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<kmer:string,source_seq:bigint>
与具有不同功能的类似查询不同,我们可以看到使用 GPU 执行的窗口:
scala> x.withColumn("min_source", min("source_seq").over(w)).explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [gpumin(source_seq#302L) gpuwindowspecdefinition(kmer#301, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS max_source#648L], [kmer#301], false
+- GpuSort [kmer#301 ASC NULLS FIRST], false, RequireSingleBatch, 0
+- GpuCoalesceBatches RequireSingleBatch
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(kmer#301, 200), ENSURE_REQUIREMENTS, [id=#1431]
+- GpuFileGpuScan csv [kmer#301,source_seq#302L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/home/cloud-user/phase1/example/1620833755/part-00000], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<kmer:string,source_seq:bigint>
我是否以某种方式错误地理解了支持的操作文档,或者我是否以错误的方式编写了代码?对此的任何帮助将不胜感激。
解决方案
是的,正如 Mithun 提到的,从 0.5 版本开始, spark.rapids.sql.expression.CollectList 开始为真。但是在 0.4 版本中它是错误的: https ://github.com/NVIDIA/spark-rapids/blob/branch-0.4/docs/configs.md
这是我在 0.5+ 版本上测试的计划:
val w = Window.partitionBy("name")
val resultdf=dfread.withColumn("values", collect_list("value").over(w))
resultdf.explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuWindow [collect_list(value#134L, 0, 0) gpuwindowspecdefinition(name#133, gpuspecifiedwindowframe(RowFrame, gpuspecialframeboundary(unboundedpreceding$()), gpuspecialframeboundary(unboundedfollowing$()))) AS values#138], [name#133], false
+- GpuCoalesceBatches RequireSingleBatch
+- GpuSort [name#133 ASC NULLS FIRST], false, com.nvidia.spark.rapids.OutOfCoreSort$@28e73bd1
+- GpuShuffleCoalesce 2147483647
+- GpuColumnarExchange gpuhashpartitioning(name#133, 200), ENSURE_REQUIREMENTS, [id=#563]
+- GpuFileGpuScan csv [name#133,value#134L] Batched: true, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/tmp/df], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:string,value:bigint>
推荐阅读
- angular - 具有在多个模块中使用的通用基类的 Angular 解析器
- python - Python将列表的字典转换为集合的字典?
- powershell - 禁用启动程序
- sql-server - 执行报表时 SSRS 日期参数格式发生变化
- azure - Azure Cloud Shell 中的 Ansible 2.7
- ruby-on-rails - 在 Rails 中按下按钮后,如何使用排序的电影列表更新视图?
- java - WebFlux:上传文件的问题
- c - 最佳性能加法模 2^32 实现
- c# - 转换 IEnumerable
> 到 IObservable 带异常处理 - c++ - 在调试中构建 C++ 类的详细信息