azure-databricks - azure databricks 计算所有表中的行 - 有没有更好的方法
问题描述
我正在尝试找到获取所有数据块表的行数的最佳方法。这就是我想出的:
for row in dvdbs.rdd.collect():
tmp = "show tables from " + row['databaseName'] + " like 'xxx*'"
if row['databaseName'] == 'default':
dftbls = sqlContext.sql(tmp)
else:
dftbls = dftbls.union(sqlContext.sql(tmp))
tmplist = []
for row in dftbls.rdd.collect():
tmp = 'select * from ' + row['database'] + '.' + row['tableName']
tmpdf = sqlContext.sql(tmp)
tmplist.append((row['database'], row['tableName'],tmpdf.count()))
columns = ['database', 'tableName', 'rowCount']
df = spark.createDataFrame(tmplist, columns)
display(df)
解决方案
我发现这明显更快......
dftbl = sqlContext.sql("show tables")
dfdbs = sqlContext.sql("show databases")
for row in dfdbs.rdd.collect():
tmp = "show tables from " + row['databaseName']
if row['databaseName'] == 'default':
dftbls = sqlContext.sql(tmp)
else:
dftbls = dftbls.union(sqlContext.sql(tmp))
tmplist = []
for row in dftbls.rdd.collect():
try:
tmp = 'select count(*) myrowcnt from ' + row['database'] + '.' + row['tableName']
tmpdf = sqlContext.sql(tmp)
myrowcnt= tmpdf.collect()[0]['myrowcnt']
tmplist.append((row['database'], row['tableName'],myrowcnt))
except:
tmplist.append((row['database'], row['tableName'],-1))
columns = ['database', 'tableName', 'rowCount']
df = spark.createDataFrame(tmplist, columns)
display(df)
推荐阅读
- cluster-computing - 如何使作业调度程序在单个节点上运行,而不是同时在两个节点上运行?
- c++ - 如何在自定义 riscv 平台上支持 c/c++ 标准库(newlib)?
- javascript - 如何对我的无序列表元素旁边的按钮进行编程以在单击时删除其关联列表元素?
- r - 包括泊松模型中的偏移量是候选模型的两倍
- python - 根据ndarray中的索引设置值的通用方法?
- angular - Visual Studio如何自定义Angular/Asp.NetCore/IdentityServer登录模板?
- discord.js - 欢迎事件有时不会一直触发 discord.js
- r - 如何将变量标签保存为从 R 到 Stata 的标签?
- angular - CdkDrag 移动得太快了
- python - python tkinter: Progressbar indeterminate: mac os: 如何增加移动的明亮区域的大小