json - Explicitly providing schema in a form of a json in Spark / Mongodb integration
问题描述
When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema
As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:
case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)
explicitDF.printSchema()
I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?
解决方案
Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0
).
For the sampling it'll read all documents unless you specify the configuration option called samplingRatio
like this:
sparkSession.read.option("samplingRatio", 0.1)
For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.
Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read
推荐阅读
- facebook - 在 Facebook 分析 SDK 中,应该将什么用作带有 FBSDKAppEventParameterNameOrderID 键的事件参数?
- mysql - Puppet:file_line 评估 os.path.join
- google-apps-script - 在这个条件下我在做什么?
- c# - 无法识别 MySql 查询中的 C# 变量
- javascript - 如何将自定义 JS 文件添加到 Angular 项目中?
- xml - JMeter - 使用 Xpath 验证多个元素
- orm - 如何让 Coldfusion Reactor 忽略表中的某些列
- php - 从数据库获取时密码验证()不起作用
- django - 关于 django 的子域
- c++ - 高效访问基于多个字段的数据