首页 > 解决方案 > 如何在 Redshift 中使用两个不同的分隔符从两列中拆分数据?

问题描述

我有一个CTE有这样的数据。它几乎遵循两种格式,counts并且process_ids将有这两种类型的数据。

client_id      day              counts      process_ids
--------------------------------------------------------------------------------------------
abc1          Feb-01-2021        3        C1,C2 | C3,C4,C5 | C6,C7
abc2          Feb-05-2021       2, 3      C10,C11,C12 | C13,C14 # C15,C16 | C17,C18

现在我想CTE在将其拆分后从上面得到下面counts的输出process_ids-

client_id      day              counts      process_ids
--------------------------------------------------------
abc1           Feb-01-2021        3           C1
abc1           Feb-01-2021        3           C2
abc1           Feb-01-2021        3           C3
abc1           Feb-01-2021        3           C4
abc1           Feb-01-2021        3           C5
abc1           Feb-01-2021        3           C6
abc1           Feb-01-2021        3           C7
abc2           Feb-05-2021        2           C10
abc2           Feb-05-2021        2           C11
abc2           Feb-05-2021        2           C12
abc2           Feb-05-2021        2           C13
abc2           Feb-05-2021        2           C14
abc2           Feb-05-2021        3           C15
abc2           Feb-05-2021        3           C16
abc2           Feb-05-2021        3           C17
abc2           Feb-05-2021        3           C18

基本上,如果它们遵循任何这些格式,这个想法是基于以下两个用例进行拆分counts和基础。process_ids

用例 1

如果counts列只有一位数并且process_ids列有|分隔符。

在此处输入图像描述

用例 2

如果counts列只有两位数,由,分隔符分隔,并且process_ids列有#分隔符和pipe.

在此处输入图像描述

我在Amazon Redshift这里工作,我很困惑如何根据需要将它们分开。

这有可能做到吗?

标签: sqlamazon-web-servicescsvamazon-redshiftdelimiter

解决方案


乍一看,这可能看起来有点毛茸茸,但它是由扎实的技术建立起来的,并给出了预期的结果......

SQL

WITH seq_0_9 AS (
  SELECT 0 AS d
  UNION ALL SELECT 1 AS d
  UNION ALL SELECT 2 AS d
  UNION ALL SELECT 3 AS d
  UNION ALL SELECT 4 AS d
  UNION ALL SELECT 5 AS d
  UNION ALL SELECT 6 AS d
  UNION ALL SELECT 7 AS d
  UNION ALL SELECT 8 AS d
  UNION ALL SELECT 9 AS d
),
numbers AS (
  SELECT a.d + b.d * 10 + c.d * 100 + 1 AS n
  FROM seq_0_9 a, seq_0_9 b, seq_0_9 c
),
processed AS
  (SELECT client_id,
          day,
          REPLACE(counts, ' ', '') AS counts,
          REPLACE(REPLACE(process_ids, ' ', ''), '|', ',') AS process_ids
   FROM tbl),
split_pids AS
  (SELECT
     client_id, 
     day,
     counts,
     split_part(process_ids, '#', n) AS process_ids,
     n AS n1
   FROM processed
   CROSS JOIN numbers
   WHERE 
     split_part(process_ids, '#', n) IS NOT NULL
     AND split_part(process_ids, '#', n) != ''),
split_counts AS
  (SELECT
     client_id, 
     day,
     split_part(counts, ',', n) AS counts,
     process_ids,
     n1,
     n AS n2
   FROM split_pids
   CROSS JOIN numbers
   WHERE
     split_part(counts, ',', n) IS NOT NULL
     and split_part(counts, ',', n) != ''),
matched_up AS
  (SELECT * FROM split_counts WHERE n1 = n2)
SELECT
  client_id, 
  day,
  counts,
  split_part(process_ids, ',', n) AS process_ids
FROM
  matched_up
CROSS JOIN
  numbers
WHERE
  split_part(process_ids, ',', n) IS NOT NULL
  AND split_part(process_ids, ',', n) != '';

演示

在线 rextester 演示(使用 PostgreSQL,但应与 Redshift 兼容):https ://rextester.com/FNA16497

简要说明

此技术用于生成数字表(从 1 到 1000 包括在内)。然后将该技术与多个公用表表达式一起使用多次,以在单个 SQL 语句中实现它。


推荐阅读