首页 > 解决方案 > 在 Scala 中使用 Regex 解析字符串以创建对象

问题描述

我有一个字符串输入列表,我想使用正则表达式将其转换为对象列表。在下面的代码中,为了简单起见,我不是创建对象而是将它们打印出来stdout

我能够支持一些输入字符串,但不能支持整个列表。有人可以分享我做错了什么吗?

  lazy val TIMESTAMP_PATTERN: Regex = """(year|month|day|hour)\(([a-zA-Z_]+)[,]?([a-zA-Z_]*)\)""".r
  lazy val BUCKET_PATTERN: Regex = """(bucket)\((.+)(,)(.+)[,]?(.*)\)""".r

  Seq(
    "year(timestamp)",
    "year(timestamp, _MY_YEAR)",
    "month(timestamp)",
    "month(timestamp, _MY_MONTH)",
    "day(timestamp)",
    "day(timestamp, _MY_DAY)",
    "hour(timestamp)",
    "hour(timestamp, _MY_HOUR)",
    "bucket(id, 32)",
    "bucket(id, 32, _MY_BUCKET)",
  ).foreach { input => input match {
      case TIMESTAMP_PATTERN(transform, sourceColumn, targetColumn) => println(s"$transform ::: $sourceColumn :::- $targetColumn")
      case BUCKET_PATTERN(sourceColumn, numBuckets) => println(s"bucket ::: $sourceColumn ::: $numBuckets")
      case BUCKET_PATTERN(sourceColumn, numBuckets, targetColumn) => println(s"bucket ::: $sourceColumn ::: $numBuckets ::: $targetColumn")
      case z => println(s"Unexpected match: $z")
    }
  }

输出

year ::: timestamp :::- 
Unexpected match: year(timestamp, _MY_YEAR)
month ::: timestamp :::- 
Unexpected match: month(timestamp, _MY_MONTH)
day ::: timestamp :::- 
Unexpected match: day(timestamp, _MY_DAY)
hour ::: timestamp :::- 
Unexpected match: hour(timestamp, _MY_HOUR)
Unexpected match: bucket(id, 32)
Unexpected match: bucket(id, 32, _MY_BUCKET)

标签: regexscala

解决方案


我在您的正则表达式和匹配中做了一些修复:

lazy val TIMESTAMP_PATTERN: Regex = """(year|month|day|hour)\((\w+)(?:,\s+)?(\w*)\)""".r
lazy val BUCKET_PATTERN: Regex = """bucket\((\w+),(?:\s+)?(\w+)(?:,\s+)?(\w*)\)""".r

  Seq(
    "year(timestamp)",
    "year(timestamp, _MY_YEAR)",
    "month(timestamp)",
    "month(timestamp, _MY_MONTH)",
    "day(timestamp)",
    "day(timestamp, _MY_DAY)",
    "hour(timestamp)",
    "hour(timestamp, _MY_HOUR)",
    "bucket(id, 32)",
    "bucket(id, 32, _MY_BUCKET)",
  ).foreach {
    case TIMESTAMP_PATTERN(transform, sourceColumn, "") => println(s"$transform ::: $sourceColumn")
    case TIMESTAMP_PATTERN(transform, sourceColumn, targetColumn) => println(s"$transform ::: $sourceColumn :::- $targetColumn")
    case BUCKET_PATTERN(sourceColumn, numBuckets, "") => println(s"bucket ::: $sourceColumn ::: $numBuckets")
    case BUCKET_PATTERN(sourceColumn, numBuckets, targetColumn) => println(s"bucket ::: $sourceColumn ::: $numBuckets ::: $targetColumn")
    case z => println(s"Unexpected match: $z")
  }

现在的输出是:

year ::: timestamp
year ::: timestamp :::- _MY_YEAR
month ::: timestamp
month ::: timestamp :::- _MY_MONTH
day ::: timestamp
day ::: timestamp :::- _MY_DAY
hour ::: timestamp
hour ::: timestamp :::- _MY_HOUR
bucket ::: id ::: 32
bucket ::: id ::: 32 ::: _MY_BUCKET

以下是我所做的更改:

  • 添加?:到由,s 和空格组成的组中,以使这些组不被捕获。使用这种方法,空格仍然是可选的,但不会影响结尾匹配。
  • 从存储桶中删除 (),因此它不是捕获组
  • 由于最后一个匹配项是可选的并且可以为空,因此修改了具有较少项目的模式以匹配这种情况。请注意,如果最后一列不存在,则其捕获组将为空。

推荐阅读