Please consider the following "tweets" table:
tweet_id user_id text
----------------------------
1 1 lorem ipsum
2 1 lorem ipsum
3 2 pear
4 1 dolor
5 3 foo
6 1 dolor
7 1 dolor
8 3 bar
9 3 baz
10 4 happy
11 4 happy
12 2 apple
13 3 foo
14 4 happy
In reality, the table contains millions of tweets from about 80,000 users. Many of there users are spam accounts, but they are hard to identify by hand. As a rule of thumb, spam accounts post the same message at least 3 times. That's why I want to fill the following tables, "duplicates" on the left and "duplicates_tweets" on the right:
duplicate_id user_id text cnt duplicate_id tweet_id
-------------------------------------- ----------------------
1 1 lorem ipsum 2 1 1
2 1 dolor 3 1 2
3 2 pear 1 2 4
4 2 apple 1 2 6
5 3 foo 2 2 7
6 3 bar 1 3 3
7 3 baz 1 4 12
8 4 happy 3 5 5
5 13
6 8
7 9
8 10
8 11
8 14
I can now very easily sort on cnt for instance, and see which users post the most duplicate messages. My question however, is how to go about this most efficiently. In other words: what query would be most efficient to fill these tables? And is it possible with just SQL or should I use PHP as an intermediary, for instance to take a tweet from the "tweets" database, scans for duplicates, fills the tables, and moves on to the next tweet? I'm afraid this would take ages to finish, so any help is greatly appreciated!
You can use the REPLACE
function in MySQL to UPDATE or INSERT a new row based on the key:
REPLACE duplicates
SELECT user_id, text
FROM (SELECT user_id, text, count(1) as count
FROM tweets
GROUP BY user_id, text
HAVING count(1) > 2))
Before you insert new tweet, check tweets table whether such tweet already exists. If so, insert tweet and insert it in duplicates and duplicates_tweets tables. Or use triggers on insert for tweets table.
Do you just want to pull out a list of possible spam tweets? Try this:
SELECT
user_id,
text,
COUNT(DISTINCT tweet_id)
FROM
tweets
GROUP BY
user_id,
text
HAVING
COUNT(DISTINCT tweet_id) >= 3
You can then use PHP to iterate over the result and INSERT
/UPDATE
a duplicate_tweets
table (although as Chris K mentioned, do you really need a duplicate_tweets
table when you can just use this query?).
I agree with what @MichaelRushton and @Kosta answered but I am wondering if you shouldn't need another table at all? If you build the query, you can ask the first table for the knowledge you are seeking. I especially like the trigger.
Probably, you could sort the table "tweets" by user_id and then by text:
SELECT * FROM tweets ORDER BY user_id DESC, text DESC
Afterwards you can iterate over the results in PHP:
<?php
// ...
$lastuser = -1;
$lasttext = "";
$ids = array();
while ($row = mysql_fetch_assoc($result)) {
if($row['user_id'] != $lastuser || $row['text'] != $lasttext) {
$ids = array();
}
$ids[] = $row['id'];
if(count($ids) >= 3) {
// flag items as spam
}
$lastuser = $row['user_id'];
$lasttext = $row['text'];
}
?>
If you use indexes in your MySQL database, you should be able to process N tweets in approximately N*log(N).