如何在HIVE中做到团伙分析?
比如我有如下数据
| 主叫 | 被叫 | 通联时间 |
| A | B | 2022-01-05 00:00:00 |
| A | C | 2022-01-04 00:00:00 |
| C | D |2022-01-03 00:01:00 |
| E | F | 2022-01-02 00:00:00 |
| F | T | 2022-01-01 00:01:00 |
| T | F | 2022-01-01 00:00:00 |
团伙分析轨规则如下:
如果A和B 有联系 A和B就认定为一伙的,同理最后要达到的效果如下:
A,B,C,D 为团伙1;
E,F,T为团伙2。
在不进行任何参数调整的情况下,Group By会分为Map和Reduce两个阶段,Map阶段扫描表按照需要聚合的字段,形成K-V对,在Reduce阶段会对相同的K进行Agg操作,执行计划如下所示
hive> set hive.map.aggr=true;
hive> explain select user_id, item_id, count(*) from tmp.log_user_behavior where date='20200831' group by user_id, item_id;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: log_user_behavior
Statistics: Num rows: 34021 Data size: 6804393 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: user_id (type: string), item_id (type: string)
outputColumnNames: user_id, item_id
Statistics: Num rows: 34021 Data size: 6804393 Basic stats: COMPLETE Column stats: NONE
Group By Operator -- Map端部分聚合
aggregations: count()
keys: user_id (type: string), item_id (type: string)
mode: hash -- hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 34021 Data size: 6804393 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator -- Map端输出
key expressions: _col0 (type: string), _col1 (type: string)
sort order: ++
Map-reduce partition columns: _col0 (type: string), _col1 (type: string) -- 分区的基于两个列
Statistics: Num rows: 34021 Data size: 6804393 Basic stats: COMPLETE Column stats: NONE
value expressions: _col2 (type: bigint)
Execution mode: vectorized
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string), KEY._col1 (type: string)
mode: mergepartial -- 部分聚合
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 17010 Data size: 3402096 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 17010 Data size: 3402096 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
这是一个求共同好友的经典问题,可以参考https://blog.csdn.net/caojian107/article/details/106413573/