首页 > 解决方案 > 如何查看表中的字符串是否包含在不同表中指定的字符串

问题描述

我一直在为以下问题绞尽脑汁。我有两个 BigQuery 表,其中一个表包含大约 300 万个搜索词:

search_term
number of people named joe
how to paint a table black
top 100 pop songs
lovely horses
..

以及 8000 个“关键字”的列表。关键字表中的每一行都是一些单词。

keyword
name joe
horses 
baby kitten
song top 100
..

对于“查询”表中的每个字符串,我想检查它是否包含“关键字”表中的任何单词。但是,关键字在“查询”字符串中出现的顺序无关紧要,中间可以有其他单词。这是我正在寻找的结果表:

search term                    contains_keywords
number of people named joe     TRUE
how to paint a table black     FALSE
top 100 pop songs              TRUE
lovely horses                  TRUE
..  

我想出了以下代码来检查每个搜索词(作为一个整体)是否与关键字完全匹配,但我不知道如何拆分关键字然后检查每个查询是否包含这些词。这是我到目前为止的代码,但我非常感谢任何帮助或正确方向的指示。

SELECT 
  *
, CASE 
    WHEN search_term IN (
      SELECT
        keyword
      FROM 
        keywords)
    THEN true
    ELSE false
  END AS contains_keyword
FROM search_terms

编辑:我提供了以上两个示例表,以便更轻松地尝试您自己的代码。


WITH 

search_terms AS (
  SELECT 'number of people named joe' AS search_term UNION ALL
  SELECT 'how to paint a table black' AS search_term UNION ALL
  SELECT 'top 100 pop songs'          AS search_term UNION ALL
  SELECT 'lovely horses'              AS search_term

  )

,keywords AS (
  SELECT 'name joe'         AS keyword UNION ALL
  SELECT 'horses lovely'    AS keyword UNION ALL  
  SELECT 'baby kitten'      AS keyword UNION ALL
  SELECT 'song top 100'     AS keyword
  )  
SELECT * FROM search_terms

标签: sqlgoogle-bigquery

解决方案


Below is for BigQuery Standard SQL

Using regex pattern consisting 8000+ words - can be quite a resource hog!
Below is workaround

#standardSQL
SELECT search_term, 
  ( SELECT COUNT(1)
    FROM UNNEST(SPLIT(search_term, ' ')) word
    JOIN UNNEST(keywords) word
    USING(word)
  ) > 0 AS contains_keyword
FROM `project.dataset.search_terms`,
UNNEST([STRUCT(ARRAY(
  SELECT DISTINCT keyword
  FROM `project.dataset.keywords`, UNNEST(SPLIT(keyword, ' ')) keyword
) AS keywords)])

If to apply to sample data from your question - output is

enter image description here


推荐阅读