首页 > 解决方案 > Create a flag which updated on the basis of Order date in Hive

问题描述

I want to create a flag column "order_type" which will get updated as per the date difference. As attached below example for input and output data, my requirement is if a consumer makes a purchase 12 months after their last purchase, they should be counted as a first time purchaser again.

Input Data below -

-----------------------------------------
|customer_id|   order_id    |order_date|
-----------------------------------------
|1234       |   1           |2017-07-06|
|1234       |   2           |2018-09-17|
|1234       |   3           |2018-09-20|
|1234       |   4           |2019-05-16|
|1234       |   5           |2020-09-15|
|-----------|---------------|----------|

Output Data

--------------------------------------------------------|
|customer_id|   order_id    |order_date| order_type     |
--------------------------------------------------------|
|1234       |   1           |2017-07-06| First purchase |
|1234       |   2           |2018-09-17| First purchase |
|1234       |   3           |2018-09-20| Second purchase|
|1234       |   4           |2019-05-16| Second plus purchase
|1234       |   5           |2020-09-15| First purchase |
|-----------|---------------|----------|----------------|

i.e. First purchase made - 2017-07-06 - First purchase (second purchase made) - 2018-09-17 (since date diff between first and second is over 12 months) - Second purchase (third purchase made) - 2018-09-20 (since date diff between second and third order less than 12 months period) - Second plus purchase (fourth purchase made) - 2019-05-16 (since date diff between third and fourth order less than 12 months period) - First purchase (last purchase made) - 2020-09-15 (since over 12 months)

标签: sqlhiveapache-spark-sqldatabricks

解决方案


Use lag() to get the previous order_date. Then a cumulative sum to define the groups and row_number() to enumerate:

select t.*,
       row_number() over (partition by customer_id, grp order by order_date) as order_type
from (select t.*,
             sum(case when prev_order_date > add_months(order_date, -12)
                      then 0 else 1
                 end) over (partition by customer_id order by order_date) as grp
      from (select t.*,
                   lag(order_date) over (partition by customer_id order by order_date) as prev_order_date
            from t
           ) t
      ) t;

This just displays the order type as a number rather than converting to an English phrase.


推荐阅读