首页 > 解决方案 > Hive 删除重复类型的关系

问题描述

我是一个新的 SQL 用户,我需要你的帮助!我正在基于 SQL 数据库进行关系图分析,这是我可以利用的唯一资源,但无法保持两个人之间不同类型的关系

我有这张桌子:

+---------+----------------+--------+----------------+-----------+
| left_id | left_rel       | common | right_rel      | right_id  |
+---------+----------------+--------+----------------+-----------+
| Tom     | graduated_from | MIT    | worked_in      | Jerry     |
| Tom     | graduated_from | MIT    | graduated_from | Lee       |
| Lee     | graduated_from | MIT    | graduated_from | Tom       |
| Tom     | graduated_from | MIT    | graduated_from | Jerry     |
+---------+----------------+--------+----------------+-----------+

我想要一张表,其中两个人之间的相同关系被简化为一个单一的关系。所以,汤姆和李都毕业于麻省理工学院的关系有两条记录,我想删除那些重复的。我可以接受让 Tom 作为 Left_id 或 Right_id,前提是他的 id 和 rel 映射

希望这段创建表代码可以为您节省一些时间。

CREATE TABLE tmp (
      left_id varchar(20)
    , left_rel varchar(50)
    , common varchar(50)
    , right_rel varchar(50)
    , right_id varchar(20)
);
INSERT INTO tmp VALUES ("Tom", "graduated_from", "MIT", "worked_in", "Jerry");
INSERT INTO tmp VALUES ("Tom", "graduated_from", "MIT", "graduated_from", "Lee");
INSERT INTO tmp VALUES ("Lee", "graduated_from", "MIT", "graduated_from", "Tom");
INSERT INTO tmp VALUES ("Tom", "graduated_from", "MIT", "graduated_from", "Jerry");

到目前为止我尝试过的

CREATE TEMPORARY TABLE tmp1
AS 
SELECT 
        greatest(CONCAT_WS(':', left_id, left_rel), CONCAT_WS(':', right_id, right_rel)) AS Left_Part
      , Common
      , least(CONCAT_WS(':', left_id, left_rel), CONCAT_WS(':', right_id, right_rel)) AS Right_Part
  FROM tmp
;

CREATE TEMPORARY TABLE tmp2
AS 
SELECT 
        T1.Left_Part
      , T1.Common
      , T1.Right_Part
  FROM ( SELECT 
                 *
               , ROW_NUMBER() OVER(DISTRIBUTE BY Left_Part, Right_Part) AS rn
           FROM tmp1
        ) T1
 WHERE T1.rn = 1
;

CREATE TEMPORARY TABLE result
AS 
SELECT 
        SPLIT(Left_Part,':')[0] as left_id
      , SPLIT(Left_Part,':')[1] as left_rel
      , Common
      , SPLIT(Right_Part,':')[0] as right_id
      , SPLIT(Right_Part,':')[1] as right_rel
  FROM tmp2
;

给出结果

+---------+----------------+--------+----------------+-----------+
| left_id | left_rel       | common | right_rel      | right_id  |
+---------+----------------+--------+----------------+-----------+
| Tom     | graduated_from | MIT    | worked_in      | Jerry     |
| Tom     | graduated_from | MIT    | graduated_from | Lee       |
| Tom     | graduated_from | MIT    | graduated_from | Jerry     |
+---------+----------------+--------+----------------+-----------+

标签: sqlhive

解决方案


看来你需要:

常见的

id  name
1   MIT

关系类型

id rel_type
1  graduated_from 
2  worked_in

id  name
1   Tom
2   Lee
3   Jerry

关系

TypeId  PersonId CommonId
1       1        1
2       1        1
3       2        1

数据正在检索

SELECT p1.name, rt1.rel_type, c.name, rt2.rel_type, p2.name
FROM Relation r1
JOIN Relation r2 ON r1.CommonId = r2.CommonId 
                AND r1.PersonId != r2.PersonId  -- or maybe '<'
JOIN Common c ON r1.CommonId = c.id
JOIN RelationType rt1 ON r1.TypeId = rt1.id
JOIN RelationType rt2 ON r2.TypeId = rt2.id
JOIN Person p1 ON r1.PersonId = p1.id
JOIN Person p1 ON r2.PersonId = p2.id


推荐阅读