I have a query, which using two JOINs, returns me a list in this format:
unique_id | non_unique_id | timestamp
The full list is big (thousands of rows), the result of the query is only a few dozens of rows, as the query has WHERE timestamp >= 'some timestamp in the past'
So now, I have the list like this:
89 | 286 | 1406219705
87 | 286 | 1406219518
79 | 922 | 1406216949
78 | 228 | 1406216871
77 | 126 | 1406216748
76 | 939 | 1406216722
74 | 126 | 1406216352
64 | 939 | 1406212540
63 | 126 | 1406212522
49 | 228 | 1406205715
48 | 228 | 1406204851
37 | 228 | 1406196435
32 | 228 | 1406190209
23 | 126 | 1406182577 <- 'limiting timestamp'
18 | 871 | 1406181991
10 | 922 | 1406178816
9 | 764 | 1406178778
7 | 609 | 1406178699
5 | 126 | 1406177398
4 | 871 | 1406177379 <- 'some timestamp in the past'
So now, I only need to select rows between the 'limiting timestamp' and the end of the list ('some timestamp in the past'). I could have specified the 'limiting timestamp' in the WHERE condition for the original query, but the problem is: I need the resulting set to have no records with non_unique_id, that have already appeared in the list above the 'limiting timestamp'. This is how the result of the query should look like:
<- 'limiting timestamp'
18 | 871 | 1406181991
9 | 764 | 1406178778
7 | 609 | 1406178699
<- 'some timestamp in the past'
So the result will return 3 rows, which all have non_unique_id that did not appear in the results above. But if the 'non_unique_id' had already appeared in the list between 'limiting timestamp' and the 'some timestamp in the past', then only the first occurrence should be kept. Note: the last part condition is optional, as it will be pretty easy to extract the duplicate from the final list.
So far I was only able to come up with the solution of doing a JOIN between the list >= 'some timestamp in the past' and > 'limiting timestamp'. This way I'll see if there are any occurrences of the top list in the bottom list. However, it can be assumed that the query is complex and time needed to produce it's results shouldn't be doubled by running it again, but with a slightly different condition.
You can try this if SQL Server 2008+ (sqlfiddle: http://sqlfiddle.com/#!3/0bc33/3):
WITH cteOrdered
AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY t1.Non_Unique_ID ORDER BY t1.Timestamp) AS RID,
t1.*
FROM Table1 t1 LEFT JOIN
(SELECT Non_Unique_ID
FROM Table1
WHERE Timestamp < 1406177379 OR
Timestamp > 1406182577) t2
ON t1.Non_Unique_ID = t2.Non_Unique_ID
WHERE t2.Non_Unique_ID IS NULL AND
t1.Timestamp > 1406177379 AND
t1.Timestamp < 1406182577
)
SELECT Unique_ID,
Non_Unique_ID,
Timestamp
FROM cteOrdered
WHERE RID = 1;
I've added another row into the data
(18, 871, 1406181990),
to see if the query was producing you want. You said if there are duplicate non_unique_id's within the search range only the "first" occurrence should be kept. I take this is the one with the EARLIEST timestamp? If opposite, you can change this line
SELECT ROW_NUMBER() OVER (PARTITION BY t1.Non_Unique_ID ORDER BY t1.Timestamp) AS RID,
to
SELECT ROW_NUMBER() OVER (PARTITION BY t1.Non_Unique_ID ORDER BY t1.Timestamp DESC) AS RID,
and that will flip the order to retain the LATEST timestamp for the duplicates.
Try this:
SELECT *
FROM my_table AS t1
WHERE timestamp < @limiting_timestamp
AND timestamp > @some_timestamp_in_the_past
AND NOT EXISTS(SELECT 1
FROM my_table AS t2
WHERE timestamp > @limiting_timestamp
AND t1.non_unique_id = t2.non_unique_id)
That will get you the records between your timestamps that don't occur after the limiting time stamp. Note that if you want records that equal the filtering timestamps you can use the between keyword instead of < and >.
To eliminate you can use another subquery:
AND t1.timestamp = (SELECT MAX(timestamp)
FROM my_table AS t2
WHERE timestamp < @limiting_timestamp
AND timestamp > @some_timestamp_in_the_past
GROUP BY unique_id)