I have a table of words used in the title of articles. I want to find which words which are used the least in the set or article titles.
Example:
Titles:
"Congressman Joey of Texas does not sign bill C1234."
"The pretty blue bird flies at night in Texas."
"Congressman Bob of Arizona is the signs bill C1234."
The table would contain the following.
Table WORDS_LIST
----------------------------------------------------
| INDEX ID | WORD | ARTICLE ID |
----------------------------------------------------
| 1 | CONGRESSMAN | 1234 |
| 2 | JOEY | 1234 |
| 3 | SIGN | 1234 |
| 4 | BILL | 1234 |
| 5 | C1234 | 1234 |
| 6 | TEXAS | 1234 |
| 7 | PRETTY | 1235 |
| 8 | BLUE | 1245 |
| 9 | BIRD | 1245 |
| 10 | FLIES | 1245 |
| 11 | NIGHT | 1245 |
| 12 | TEXAS | 1245 |
| 13 | CONGRESSMAN | 1246 |
| 14 | BOB | 1246 |
| 15 | ARIZONA | 1246 |
| 16 | SIGNS | 1246 |
| 17 | BILL | 1246 |
| 18 | C1234 | 1246 |
----------------------------------------------------
In this case, the words "pretty,blue, flies, night" would be the used in the least number of articles.
I would appreciate any ideas on how to best create this query. So far below is what I started with. I can also write something in PHP but figured a query would be faster.
SELECT distinct a1.`word`, count(a1.`word`)
FROM mmdb.words_list a1
JOIN mmdb.words_list b1
ON a1.id = b1.id AND
upper(a1.word) = upper(b1.word)
where date(a1.`publish_date`) = '2017-06-09'
group by `word`
order by count(a1.`word`);
Try this. It's a bit more simple and should return the correct results:
SELECT `WORD`,
COUNT(*) as `num_articles`
FROM `WORDS_LIST`
WHERE date(`publish_date`) = '2017-06-09'
GROUP BY `WORD`
ORDER BY COUNT(*) ASC;
I don't see why a self-join is necessary. Just do something like this:
select wl.word, count(*)
from mmdb.words_list wl
where date(wl.`publish_date`) = '2017-06-09'
group by wl.word
order by count(*);
You can add a limit
to get a fixed number of words. If publish_date
is already a date, you should do the comparison as:
where publish_date = '2017-06-09'
If it has a time component:
where publish_date >= '2017-06-09' and publish_date < '2017-06-10'
This expression allows MySQL to use an index.