java - 对具有值 LOOKING numeric 的 String 列的数据集进行分区和存储。再次读取时,数据仍然是“字符串”但丢失了零
问题描述
在Spark 3.0.2
,我正在Dataset
一个镶木地板文件中写一个。我编写它的代码以这种方式结束:
etablissements = etablissements.repartition(col("codeDepartement"));
etablissements = etablissements.sortWithinPartitions(col("siret"));
etablissements = etablissements.persist();
// Write it in a file named with the year of data, selections, and sorting in it's name.
// Underlying statement writing the parquet file is :
// ds.write().partitionBy(colonnesPartionnement /* = codeDepartement */)
saveToStore(etablissements, new String[] {"codeDepartement"},
"{0}_{1,number,#0}_{2}_{3}", "etablissements", anneeSIRENE, actifsSeulement,
communesValides);
有codeDepartment
一个StringType
, 因为法国的部门代码是一个三字符代码。
# schema() :
|-- codeDepartement: string (nullable = true)
它在此show()
输出的最后三分之一处可见(城市名称大写前的三列),并且具有值"01"
::
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|codeDepartement|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+---------------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |01 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |01 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |01 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |
我看到我的镶木地板文件下的文件夹很好:
codeDepartement=01
codeDepartement=2A
codeDepartement=75
codeDepartement=971
注意:由于某些值,例如2A
(对于Corse),部门代码永远不能转换为数值。
这些snappy.parquet
块分别存储在/data/tmp/etablissements_2020_true_true/codeDepartement=01
文件夹中,这样:没关系。
在阅读时,我尝试从该商店读取内容。搜索城市代码(在法国以部门代码开头)开头的城市"01"
:读取到期的镶木地板文件和块:
2021-03-24 07:14:33.825 INFO 13860 --- [er for task 106] o.a.s.s.e.datasources.FileScanRDD : Reading File path: file:/data/tmp/etablissements_2020_true_true/codeDepartement=01/part-00024-f7d33eea-6d79-4f1a-bf35-0666dcc5e0f5.c000.snappy.parquet, range: 0-5246504, partition values: [1]
当显示部门时(现在位于 datasetshow()
命令的末尾),它现在具有 for value"1"
而不是"01"
:
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|siren |nic |siret |statutDiffusionEtablissement|dateCreationEtablissement|trancheEffectifSalarie|anneeEffectifsEtablissement|activiteArtisanRegistreDesMetiers|dateDernierTraitement|etablissementSiege|nombrePeriodesEtablissement|complementAdresse |numeroVoie|indiceRepetition|typeDeVoie|libelleVoie |codePostal|nomCommuneEtrangere|distributionSpeciale|codeCommune|cedex|libelleCedex |codePaysEtranger|nomPaysEtranger|complementAdresseSecondaire|numeroVoieSecondaire|indiceRepetitionSecondaire|typeDeVoieSecondaire|libelleVoieSecondaire|codePostalSecondaire|nomCommuneSecondaire|nomCommuneEtrangereSecondaire|distributionSpecialeSecondaire|codeCommuneSecondaire|cedexSecondaire|libelleCedexSecondaire|codePaysEtrangerSecondaire|nomPaysEtrangerSecondaire|dateDebutHistorisation|etatAdministratifEtablissement|enseigne1 |enseigne2|enseigne3|denominationEtablissement|activitePrincipale|nomenclatureActivitePrincipale|caractereEmployeurEtablissement|active|anneeValiditeEffectifSalarie|caractereEmployeur|siege|nombrePeriodes|typeCommune|codeRegion|arrondissement|typeNomEtCharniere|nomMajuscules |nomCommune |libelle |codeCanton|codeCommuneParente|strateCommune|sirenCommune|populationTotale|populationMunicipale|populationCompteApart|sirenCommuneMembre|codeEPCI |nomEPCI |libelleNAF |codeDepartement|
+---------+-----+--------------+----------------------------+-------------------------+----------------------+---------------------------+---------------------------------+---------------------+------------------+---------------------------+--------------------------+----------+----------------+----------+-------------------------+----------+-------------------+--------------------+-----------+-----+---------------------+----------------+---------------+---------------------------+--------------------+--------------------------+--------------------+---------------------+--------------------+--------------------+-----------------------------+------------------------------+---------------------+---------------+----------------------+--------------------------+-------------------------+----------------------+------------------------------+-------------------+---------+---------+-------------------------+------------------+------------------------------+-------------------------------+------+----------------------------+------------------+-----+--------------+-----------+----------+--------------+------------------+------------------------+------------------------+------------------------+----------+------------------+-------------+------------+----------------+--------------------+---------------------+------------------+---------+----------------------------------+---------------------------------------------------------------------------------------------+---------------+
|015850944|00024|01585094400024|O |2007-04-01 |11 |2017 |null |2019-11-14T14:00:12 |false |2 |ZONE INDUSTRIELLE |null |null |CHE |DE THIL |01700 |null |null |01376 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |25.73B |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |SAINT MAURICE DE BEYNOST|Saint-Maurice-de-Beynost|Saint-Maurice-de-Beynost|0113 |null |5 |210103768 |4006 |3967 |39 |210103768 |240100800|CC de Miribel et du Plateau |Fabrication d'autres outillages |1 |
|015851793|00479|01585179300479|O |2005-01-01 |11 |2017 |null |2019-06-24T13:04:28 |false |2 |null |null |null |null |ZONE INDUST LA FONTAINE |01290 |null |null |01134 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2008-01-01 |A |null |null |null |null |46.73A |NAFRev2 |O |true |2017 |true |false|2 |COM |84 |012 |0 |CROTTET |Crottet |Crottet |0123 |null |3 |210101341 |1777 |1734 |43 |210101341 |200070555|CC de la Veyle |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00743|01585179300743|O |2012-09-01 |02 |2017 |null |2019-06-24T13:04:28 |false |1 |ZA ACTIPARC |null |null |null |PRE LION |01190 |null |null |01057 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2012-09-01 |A |null |null |null |DORAS |46.73A |NAFRev2 |O |true |2017 |true |false|1 |COM |84 |012 |0 |BOZ |Boz |Boz |0117 |null |3 |210100574 |519 |512 |7 |210100574 |200071371|CC Bresse et Saône |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
|015851793|00917|01585179300917|O |2020-01-01 |null |null |null |2020-01-31T16:13:25 |false |1 |null |28 |null |AV |DE MARBOZ |01000 |null |null |01053 |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |null |2020-01-01 |A |CLEAU |null |null |null |46.73A |NAFRev2 |O |true |null |true |false|1 |COM |84 |012 |0 |BOURG EN BRESSE |Bourg-en-Bresse |Bourg-en-Bresse |0199 |null |8 |210100533 |43306 |41527 |1779 |210100533 |200071751|CA du Bassin de Bourg-en-Bresse |Commerce de gros (commerce interentreprises) de bois et de matériaux de construction |1 |
即使它仍然被镶木地板文件声明为StringType
:
|-- codeDepartement: string (nullable = true)
发生了什么 ?
我倾向于让repartition()
声明成为造成这种混乱的原因,但我不知道如何。如果该命令是欺骗性的,并且分区无法按字符串值进行分区,那么程序如何按字母中的红色、蓝色和黄色来分区数据?
我不了解我面临的整体行为(问题?)。
解决方案
我能够重现这个问题。
spark.sql("select '01' key, 123 val union all select 'ab', 456").show()
+---+---+
|key|val|
+---+---+
| 01|123|
| ab|456|
+---+---+
spark.sql("select '01' key, 123 val union all select 'ab', 456").write().partitionBy("key").parquet("test")
spark.read().parquet("test").show()
+---+---+
|val|key|
+---+---+
|456| ab|
|123| 1|
+---+---+
要解决此问题,您可以在阅读时提供架构:
spark.read().schema(spark.read().parquet("test").schema).parquet("test").show()
+---+---+
|val|key|
+---+---+
|456| ab|
|123| 01|
+---+---+
(在 Pyspark 中测试,希望可以在 Java 中工作)
推荐阅读
- python - 用 sns 绘制箱线图
- selenium-webdriver - 如何使用 WebDriver 调用预定义的 Geb Page 元素?
- scala - 如何使用scala通过spark中的一个或多个字符串参数传递selectExpr中的列名?
- javascript - 无法将EventListener 添加到行标签?
- antlr - 决策可以使用多种选择匹配诸如“MULOP LETTER”之类的输入:1、2
- php - 搜索 csv 文件名的特定部分并附加到新的 csv 文件(PHP)
- git - 服务器上的 Git 存储库恢复我的更改,但具有与本地相同的日志
- regex - 没有全局标志的 Ruby 正则表达式
- webrtc - WebRTC - 没有流显然意味着没有 ICE 转换
- java - Java For 循环逻辑