如何在HIVE中做到团伙分析?

如何在HIVE中做到团伙分析?

比如我有如下数据
| 主叫 | 被叫 | 通联时间 |
| A | B | 2022-01-05 00:00:00 |
| A | C | 2022-01-04 00:00:00 |
| C | D |2022-01-03 00:01:00 |
| E | F | 2022-01-02 00:00:00 |
| F | T | 2022-01-01 00:01:00 |
| T | F | 2022-01-01 00:00:00 |

团伙分析轨规则如下:
如果A和B 有联系 A和B就认定为一伙的,同理最后要达到的效果如下:
A,B,C,D 为团伙1;
E,F,T为团伙2。

在不进行任何参数调整的情况下,Group By会分为Map和Reduce两个阶段,Map阶段扫描表按照需要聚合的字段,形成K-V对,在Reduce阶段会对相同的K进行Agg操作,执行计划如下所示

hive> set hive.map.aggr=true;
hive> explain select user_id, item_id, count(*) from tmp.log_user_behavior where date='20200831' group by user_id, item_id;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: log_user_behavior
            Statistics: Num rows: 34021 Data size: 6804393 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: user_id (type: string), item_id (type: string)
              outputColumnNames: user_id, item_id
              Statistics: Num rows: 34021 Data size: 6804393 Basic stats: COMPLETE Column stats: NONE
              Group By Operator   -- Map端部分聚合
                aggregations: count()
                keys: user_id (type: string), item_id (type: string)
                mode: hash    -- hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 34021 Data size: 6804393 Basic stats: COMPLETE Column stats: NONE
                Reduce Output Operator  -- Map端输出
                  key expressions: _col0 (type: string), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string)  -- 分区的基于两个列
                  Statistics: Num rows: 34021 Data size: 6804393 Basic stats: COMPLETE Column stats: NONE
                  value expressions: _col2 (type: bigint)
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          aggregations: count(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string)
          mode: mergepartial   -- 部分聚合
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 17010 Data size: 3402096 Basic stats: COMPLETE Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 17010 Data size: 3402096 Basic stats: COMPLETE Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink


这是一个求共同好友的经典问题,可以参考https://blog.csdn.net/caojian107/article/details/106413573/