apache-pig - Apache Pig 中的 IN 运算符
问题描述
Apache Pig 是否有等效的 IN 运算符?我目前正在使用 Apache Pig 0.10.0
我想做类似的事情:
select count(distinct(o.order_id)),count(od.prod_id),count(od.prod_id)/count(distinct(o.order_id))
from orders o
inner join order_details od
on od.order_id=o.order_id
where o.order_id
in (
select *
from (select o.order_id
from orders o
inner join order_details od
on od.order_id = o.order_id
where(o.order_date between '2013-05-01' and '2013-05-31') and (od.prod_id=1274348)
) as subq
);
解决方案
这可能是 Pig 中的等效脚本。您可以创建任意数量的临时关系,以便在生成计数之前仅获取所需的数据。请注意,我将日期视为时间戳;您可以使用内置的ToDate
UDF,它可以将 UNIX 时间戳或日期作为字符数组转换为原生 Pig DateTime 类型。
-- Load in all of your data
-- Replace with actual paths
-- You may need to supply a delimiter value
o = LOAD 'orders' USING PigStorage() AS (
order_date:long,
order_id:chararray
);
od = LOAD 'order_details' USING PigStorage() AS (
order_id:chararray,
prod_id:chararray
);
-- Filter like WHERE in SQL
-- Replace 1000 and 2000 with actual timestamps
o_filtered = FILTER o BY order_date <= 2000 AND order_date >= 1000;
od_filtered = FILTER od BY prod_id == '1274348';
-- Inner join - only needed once in Pig
subq = JOIN o_filtered BY order_id, od_filtered BY order_id;
-- Drop fields not needed for final counts
subq_renamed = FOREACH subq GENERATE
o_filtered::order_id AS order_id,
od_filtered::prod_id AS prod_id;
-- To do the counts, need to group the data
subq_counts = FOREACH (GROUP subq_renamed ALL) {
dist_order_id = DISTINCT subq_renamed.order_id;
GENERATE
COUNT(dist_order_id) AS dist_order_id_count,
COUNT(subq_renamed.prod_id) AS prod_id_count;
}
-- Calculate the ratio count(od.prod_id)/count(distinct(o.order_id))
final_counts = FOREACH subq_counts GENERATE *,
(float)prod_id_count/dist_order_id_count AS prod_order_ratio;
推荐阅读
- java - 如何合并arraylist中两个不同变量的两个对应值?
- macos-big-sur - 使用 MacOS Big Sur 的 MySQL ODBC 驱动程序安装问题
- java - Java:如何通过字符串内容选择/调用函数?
- python - 你如何设置一个计时器,一旦完成,增加python 3.x中变量的值?
- android - TabHost 已弃用
- apache - .htaccess RewriteRule 用于搜索查询页面
- azure - 带有托管标识的事件网格 API 连接的 ARM 模板
- c# - 我如何分配给 ConsoleKeyInfo
- python - ModulNotFoundError 即使 !pip freeze 告诉您它是通过 WSL 在 ubuntu 上下载的
- database - 查询具有多语言列的表