首页 > 解决方案 > Oracle SQL:在两个字符之间提取多个文本

问题描述

我有如下表:

|-------------|---------------------------------------------------|
|ID.          | CONTENT                                           |
|-------------|---------------------------------------------------|
|1            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|2            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content.
|3            |<TITLE> <SUB-TITLE-1> Content <SUB-TITLE-2>Content. <SUB-TITLE-3> Content
|-------------|---------------------------------------------------|

我想提取 <> 之间的所有文本,所以它将如下所示:

|-------------|-------------------------------------------------|
|ID.          | CONTENT                                         |
|-------------|-------------------------------------------------|
|1            |TITLE                                            |
|1            |SUB-TITLE-1                                      |
|1            |SUB-TITLE-2                                      |
|2            |TITLE                                            |
|2            |SUB-TITLE-1                                      |
|2            |SUB-TITLE-2                                      |
|3            |TITLE                                            |
|3            |SUB-TITLE-1                                      |
|3            |SUB-TITLE-2                                      | 
|3            |SUB-TITLE-3                                      |
|-------------|-------------------------------------------------|

如何做到这一点?我正在尝试通过正则表达式来做,但我想我迷路了。我的 Oracle 版本是 18c,如果这有帮助的话......

标签: sqlregexoracle

解决方案


您可以使用的第四个参数REGEXP_SUBSTR来指定匹配的出现。

要获取第 1 次、第 2 次和第 3 次出现的行,您可以使用 dual 的子查询进行交叉连接。

WITH test_data AS (
    SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
    SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)

SELECT t.content_id,
       REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, s.match_occurrence, 'i', 1) AS content_match
FROM test_data t
CROSS JOIN (
    SELECT 1 AS match_occurrence FROM dual UNION
    SELECT 2 AS match_occurrence FROM dual UNION
    SELECT 3 AS match_occurrence FROM dual UNION
    SELECT 4 AS match_occurrence FROM dual
    /* ... etc, with the number of rows equal to the maximum number of matches that can appear */
) s
WHERE REGEXP_SUBSTR(t.content_data, '<.*?>', 1, s.match_occurrence) IS NOT NULL /* Only return records that have a match for the given occurrence */
ORDER BY t.content_id, s.match_occurrence

借用CONNECT_BY_LEVEL巴巴罗斯的出色答案,您可以更简洁地做到这一点:

WITH test_data AS (
    SELECT 1 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.<A third sub-title>' AS content_data FROM dual UNION
    SELECT 2 AS content_id, '<TITLE> <SUB-TITLE-1> Content<SUB-TITLE-2>Content.' AS content_data FROM dual
)

SELECT t.content_id,
       REGEXP_SUBSTR(t.content_data, '<(.*?)>', 1, LEVEL, 'i', 1) AS content_match
FROM test_data t
CONNECT BY 
   LEVEL <= REGEXP_COUNT(t.content_data, '<.*?>')
   AND PRIOR sys_guid() IS NOT NULL
   AND PRIOR content_id = content_id
ORDER BY t.content_id, LEVEL

请注意,该CONNECT_BY_LEVEL方法在大型数据集上可能会较慢,因此如果性能是一个问题,我会避免这种情况。


推荐阅读