有如下hive记录表records,记录车辆的过车信息:
create table records(
id string, //记录编号
indate string, //过车记录时间
plate_no string, //车辆号牌
device_id int, //经过的设备编号
)
partitioned by(month string,day string)
row format delimited fields terminated by '\t' stored as ORC;
1。请使用HQL得到最近一个月内晚上(晚22点-早6点)出现记录最多的车辆号牌Top10及次数
2。请用spark RDD将上述表中indate、plate_no、device_id三个字段记录重复的数据只保留一条
3。有string A和string B,分别由4096个随机的0或1组成,样例为String A = "01010111001....110010",String B="10101110....100101",现有公式double C=(A和B逐位与的和)/(A中1的个数*B中1的个数),请用自己javahuoscala语言实现出满足此公式的方法
使用HQL得到最近一个月内晚上(晚22点-早6点)出现记录最多的车辆号牌Top10及次数:
SELECT plate_no, COUNT(*) AS cnt
FROM records
WHERE hour(indate) >= 22 OR hour(indate) < 6
AND month = '最近一个月'
GROUP BY plate_no
ORDER BY cnt DESC
LIMIT 10;
用spark RDD将上述表中indate、plate_no、device_id三个字段记录重复的数据只保留一条:
val records = sc.textFile("path_to_records_file")
val result = records.map(line => {
val fields = line.split("\t")
(fields(1) + "_" + fields(2) + "_" + fields(3), line)
})
.reduceByKey((a, b) => a)
.map(_._2)
result.saveAsTextFile("path_to_output_file")
public double calculateC(String a, String b) {
int aCount = 0;
int bCount = 0;
int andCount = 0;
for (int i = 0; i < 4096; i++) {
if (a.charAt(i) == '1') aCount++;
if (b.charAt(i) == '1') bCount++;
if (a.charAt(i) == '1' && b.charAt(i) == '1') andCount++;
}
double c = (double) andCount / (aCount * bCount);
return c;
}