首页 > 解决方案 > azure databricks 计算所有表中的行 - 有没有更好的方法

问题描述

我正在尝试找到获取所有数据块表的行数的最佳方法。这就是我想出的:

for row in dvdbs.rdd.collect():
   tmp = "show tables from " + row['databaseName'] + " like 'xxx*'"
   if row['databaseName'] == 'default':
       dftbls = sqlContext.sql(tmp)
   else:
     dftbls = dftbls.union(sqlContext.sql(tmp))
tmplist = []
for row in dftbls.rdd.collect():
    tmp = 'select * from ' + row['database'] + '.' + row['tableName']
    tmpdf = sqlContext.sql(tmp)
    tmplist.append((row['database'], row['tableName'],tmpdf.count()))
columns =  ['database', 'tableName', 'rowCount']     
df = spark.createDataFrame(tmplist, columns)    
display(df)

标签: azure-databricks

解决方案


我发现这明显更快......

dftbl = sqlContext.sql("show tables")
dfdbs = sqlContext.sql("show databases")
for row in dfdbs.rdd.collect():
    tmp = "show tables from " + row['databaseName'] 
    if row['databaseName'] == 'default':
        dftbls = sqlContext.sql(tmp)
    else:
       dftbls = dftbls.union(sqlContext.sql(tmp))
tmplist = []
for row in dftbls.rdd.collect():
    try:
      tmp = 'select count(*) myrowcnt from ' + row['database'] + '.' + row['tableName']
      tmpdf = sqlContext.sql(tmp)
      myrowcnt= tmpdf.collect()[0]['myrowcnt'] 
      tmplist.append((row['database'], row['tableName'],myrowcnt))
    except:
      tmplist.append((row['database'], row['tableName'],-1))

columns =  ['database', 'tableName', 'rowCount']     
df = spark.createDataFrame(tmplist, columns)    
display(df)

推荐阅读