regex - REGEXP_EXTRACT with URL in Hive
问题描述
I want to extract a word between '/bla-bla-bla/' and 'a12345' in the URL, which is "this-is-the-word"
using regexp_extract in Hive.
INPUT: www.website.com/bla-bla-bla/this-is-the-word.a12345.anotherword.blabla
DESIRED OUTPUT: this-is-the-word
I've tried below, but none of them worked. What RegEx will achieve my desired output from this input?
regexp_extract(URL,'^.*[/]bla[-]bla[-]bla[/]([a-z]+)\\.(a([0-9]+))*$',1)
regexp_extract(URL,'^.*[/]bla-bla-bla[/]([a-z]*)[.]a([0-9]+)*$',1)
解决方案
You may use
regexp_extract(URL,'^.*/bla-bla-bla/([^/.]+)\.a[0-9].*$', 1)
See this regex demo
It matches
^
- start of string.*
- any 0+ chars other than line break chars, as many as possible/bla-bla-bla/
- a literal/bla-bla-bla/
substring([^/.]+)
- Group 1 (what you will get since the next argument is1
): 1 or more chars other than/
and.
\.a
- a.a
substring[0-9]
- a digit.*$
- the rest of the string to its end.
推荐阅读
- android - 无效的模板编译
- vb.net - 如何转换日期格式(SQL Server)?
- ios - 使用macOS Mojave和Xcode 10.0成功构建iOS框架但框架文件为0KB
- javascript - Angularjs:对象数组中的ng-repeat过滤器对象
- node.js - Mongoose 子文档不更新时间戳
- excel - 如何在Excel中的组中查找不同的值
- r - ggplot2的y轴上是否可以有整数和小数的组合?
- php - 重写规则以处理同一域中的 mediawiki 和 wordpress 安装
- python - 如何从python中的许多word文档中的多个表中提取所有数据(直接从MS Word中提取数据)?
- javascript - three.js 的 EventDispatcher 可以用于类之间的通信吗?