首页 > 解决方案 > REGEXP_EXTRACT with URL in Hive

问题描述

I want to extract a word between '/bla-bla-bla/' and 'a12345' in the URL, which is "this-is-the-word" using regexp_extract in Hive.

INPUT: www.website.com/bla-bla-bla/this-is-the-word.a12345.anotherword.blabla

DESIRED OUTPUT: this-is-the-word

I've tried below, but none of them worked. What RegEx will achieve my desired output from this input?

regexp_extract(URL,'^.*[/]bla[-]bla[-]bla[/]([a-z]+)\\.(a([0-9]+))*$',1)
regexp_extract(URL,'^.*[/]bla-bla-bla[/]([a-z]*)[.]a([0-9]+)*$',1)

标签: regexhive

解决方案


You may use

regexp_extract(URL,'^.*/bla-bla-bla/([^/.]+)\.a[0-9].*$', 1)

See this regex demo

It matches

  • ^ - start of string
  • .* - any 0+ chars other than line break chars, as many as possible
  • /bla-bla-bla/ - a literal /bla-bla-bla/ substring
  • ([^/.]+) - Group 1 (what you will get since the next argument is 1): 1 or more chars other than / and .
  • \.a - a .a substring
  • [0-9] - a digit
  • .*$ - the rest of the string to its end.

推荐阅读