首页 > 解决方案 > 如何将计算的数据放入 Spark.Sql 代码中的 INSERT 查询中

问题描述

我试图将 4 个数据插入到表中:对于插入,我使用下面的代码-

spark.sql("INSERT INTO countDirectorySize VALUES (`" +str(root) + "`, `" +str(dirSize) + "`, `" +str(today1) + "`, `" +str(today2)+ "`)")

让我把整个代码放在这里以便更好地理解-

import os
from datetime import date
size=0
root = "/dbfs/mnt/datalake/.../"
today1 = date.today()
dirSize = size/(1024*1024)
today2 = date.today()

spark.sql("INSERT INTO countDirectorySize VALUES (`" +str(root) + "`, `" +str(dirSize) + "`, `" +str(today1) + "`, `" +str(today2)+ "`)")

并且此代码在插入时会出错。我究竟做错了什么。

错误:

cannot resolve '`/dbfs/mnt/datalake/.../`' given input columns: []; line 1 pos 39;\n'InsertIntoTable 'UnresolvedRelation `countDirectorySize`, false, false\n+- 'UnresolvedInlineTable [col1, col2, col3, col4], [List('/dbfs/mnt/datalake/.../, '`9.256933212280273`, '2020-06-01, '2020-06-01)]\n"

标签: pythonapache-sparkpysparkapache-spark-sql

解决方案


我建议不要使用普通的插入语句,而是构建一个数据框并将其注册为一个表。

这是我对您的代码所做的:

import os
from datetime import date    
data = []    
size=0
root = "/Users/hbohra/Desktop/"
today1 = date.today()
for path, subdirs, files in os.walk(root):
    for name in files:
        size += os.path.getsize(os.path.join(path, name))
        dirSize = size/(1024*1024)
today2 = date.today()
data.append((root, dirSize, today1, today2))    
df = spark.createDataFrame(data, schema=['dir_name', 'dir_size', 'created_date', 'modified_date'])
df.registerTempTable('countDirectorySize')

#spark.sql("INSERT INTO countDirectorySize VALUES (`" +str(root) + "`, `" +str(dirSize) + "`, `" +str(today1) + "`, `" +str(today2)+ "`)")

df2 = spark.sql('select * from countDirectorySize')

输出:

>>> df2.show()
+--------------------+-----------------+------------+-------------+
|            dir_name|         dir_size|created_date|modified_date|
+--------------------+-----------------+------------+-------------+
|/Users/hbohra/Des...|428.3244695663452|  2020-06-01|   2020-06-01|
+--------------------+-----------------+------------+-------------+

>>>

让我知道这是否有帮助。


推荐阅读