----------------------------------------
ColumnA | ColumnB | ColumnC |
----------------------------------------
Cat | Shirt | Pencil |
Dog | Shirt | Eraser |
Worm | Dress | Pen |
Cow | Shirt | Pen |
Cat | Shirt | Pen |
Cat | Jacket | Pen |
Cow | Shirt | Pen |
Cat | Shirt | Pen |
Cat | Jacket | Pen |
Cow | Shirt | Pen |
Cat | Shirt | Pen |
Cat | Jacket | Pen |
With the example data above I am trying to find the most re-occuring combinations which are a pair of 2 or greater.
For example
Shirt,Pen 6
Cat,Pen 6
Cat,Shirt 4
Jacket, Pen 3
Pen,Cow 3
Cat,Shirt,Pen 3
Cat,Jacket,Pen 3
Cow,Shirt,Pen 3
I need this for up to 10 columns of data.
Cat,Shirt
is the same as Shirt,Cat
.
What is the best algorithm to use? Preferably in SQL but I could also try PHP?
You can do this in SQL by identifying each row and adding an "empty" element. Note: this assumes that the values are different in each column -- or at least fungible (it doesn't matter which column one is in).
Let me also assume that each row has a unique id:
with t as (
select id, col
from data d outer apply
(values (col1), (col2), (col3), (NULL)) v(col)
)
select t1.col, t2.col, t3.col, count(*)
from t t1 join
t t2
on t1.id = t2.id and (t2.col > t1.col or t2.col is null) join
t t3
on t1.id = t3.id and (t3.col > t2.col or (t2.col is null and t3.col is null))
group by t1.col, t2.col, t3.col
order by count(*) desc;
One way may be would this
SELECT c1, c2, c3, count(*) FROM (
SELECT ColumnA AS c1, ColumnB AS c2, NULL AS c3 FROM your_table
UNION ALL
SELECT ColumnA AS c1, ColumnC AS c2, NULL AS c3 FROM your_table
UNION ALL
SELECT ColumnB AS c1, ColumnC AS c2, NULL AS c3 FROM your_table
UNION ALL
SELECT ColumnA AS c1, ColumnB AS c2, ColumnC AS c3 FROM your_table
) tt
group by c1, c2, c3
order by count(*) desc