我有一个表,表中有两个字段,一个是type,另一个是time,我现在要做的是,对于每一条记录,找出同一type在过去24小时内出现过多少次
如果使用pyspark,代码实现如下:
# 加载 Spark SQL 模块
from pyspark.sql import SparkSession
# 创建 SparkSession 实例
spark = SparkSession.builder \
.appName("CountTypeInPast24Hours") \
.getOrCreate()
# 读取表数据
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("/path/to/table.csv")
# 计算 24 小时前的时间
import datetime
time_24_hours_ago = datetime.datetime.now() - datetime.timedelta(hours=24)
# 创建一个新列,用来记录每条记录是否在 24 小时内出现过
from pyspark.sql.functions import col, when
df = df.withColumn("in_24_hours", when(col("time") > time_24_hours_ago, 1).otherwise(0))
# 对每个 type 分组,并统计在 24 小时内出现过的记录数
from pyspark.sql.functions import sum
df = df.groupBy("type").agg(sum("in_24_hours").alias("count_in_24_hours"))
# 显示结果
df.show()
scala版本如下
// 计算过去 24 小时内出现的记录
val oneDayAgo = System.currentTimeMillis() - 24 * 60 * 60 * 1000
val filteredData = inputData.filter(x => x.time > oneDayAgo)
// 为每一条记录创建一个元组 (type, 1)
val typeCountPairs = filteredData.map(x => (x.type, 1))
// 统计每一种类型在过去 24 小时内出现的次数
val typeCounts = typeCountPairs.reduceByKey((x, y) => x + y)
```sql
这样吗
with t1 as (select 1 type , '20221208 12:01:35' time union
select 1 type , '20221208 12:45:36' time union
select 1 type , '20221208 11:09:32' time union
select 1 type , '20221208 10:36:36' time union
select 1 type , '20221209 18:12:36' time union
select 2 type , '20221209 11:58:36' time union
select 2 type , '20221209 08:45:36' time union
select 2 type , '20221209 00:22:36' time union
select 3 type , '20221210 19:33:36' time union
select 3 type , '20221210 18:59:36' time ) -- 测试数据
select type,substr(time,1,8)"时间",max(ro) "出现次数"
from (select type,time ,row_number()over(partition by substr(time,1,8),type order by time) ro
from t1 ) t1
group by type,substr(time,1,8)
```