What I'm looking for is the clubs with the highest ratings average based on number of votes and obviously the highest ratings.
What I have done is the following:
club weighted median
based on subtracting one standard deviation to weed clubs with low votesThe problem is I cannot determine why my data is not showing correctly. I imagine there is an issue with my calcuations. I am getting numbers in the 10s and being negative when I should be receiving a value from 0-5 (as is the rating scores).
I'm not really quite sure where my logic fails.
Here is my ratings code logic:
$sql="SELECT SUM(rating) AS sumrating,COUNT(reviews.rating) AS countrating FROM reviews";
$rs=mysqli_fetch_array(mysqli_query($scx_dbh,$sql));
// get the total summation of ratings against all reviews
$ratingssum=(int)$rs['sumrating'];
// get the total number of ratings against all reviews
$ratingscount=(int)$rs['countrating'];
// get the population mediam
$mean = $ratingssum / $ratingscount;
// determine the variance of the population
$variance = 0;
$sql="SELECT rating AS score FROM reviews";
$rs=mysqli_query($scx_dbh,$sql);
while($row=mysqli_fetch_array($rs)){
$score = (int)$row['score'];
$variance += pow(($score-$mean),2);
}
$variance = $variance/$ratingscount;
// loop through all clubs and implement new rating
$scores=array();
$sql="SELECT locid,COUNT(reviewid) AS locationrecordcount,AVG(rating) AS locationmedian FROM reviews GROUP BY locid";
$rs=mysqli_query($scx_dbh,$sql);
/// begin loop
while($row=mysqli_fetch_array($rs)){
// get the number of review votes for this club
$numvotes=(int)$row['locationrecordcount'];
// get the location id
$locId = (int)$row['locid'];
// find the standard deviation for this club (total variance * numclubvotes)
$standarddev=sqrt($variance*$numvotes);
// create the new rating for this club with 1 standard deviation less
$oldRating=$row['locationmedian'];
$newRating=$oldRating-$standarddev;
$scores[$locId] = array(
'numvotes'=>$numvotes,
'standard-deviation'=>$standarddev,
'original-rating'=> $oldRating,
'weighted-rating'=>$newRating
);
}
usort($scores,function($a,$b){
return $a['weighted-rating']-$b['weighted-rating'];
});
And here are my results:
top 10
[0] => Array
(
[numvotes] => 1121
[standard-deviation] => 68.898321138853
[original-rating] => 4.415700267618207
[weighted-rating] => -64.482620871235
)
[1] => Array
(
[numvotes] => 909
[standard-deviation] => 62.042283630954
[original-rating] => 3.1290979097910174
[weighted-rating] => -58.913185721163
)
[2] => Array
(
[numvotes] => 594
[standard-deviation] => 50.153247058093
[original-rating] => 4.414225589225589
[weighted-rating] => -45.739021468868
)
[3] => Array
(
[numvotes] => 505
[standard-deviation] => 46.243587892712
[original-rating] => 4.090099009900985
[weighted-rating] => -42.153488882811
)
[4] => Array
(
[numvotes] => 517
[standard-deviation] => 46.78979093937
[original-rating] => 4.661025145067699
[weighted-rating] => -42.128765794302
)
[5] => Array
(
[numvotes] => 505
[standard-deviation] => 46.243587892712
[original-rating] => 3.2117821782178173
[weighted-rating] => -43.031805714494
)
[6] => Array
(
[numvotes] => 398
[standard-deviation] => 41.053233483774
[original-rating] => 4.231155778894469
[weighted-rating] => -36.822077704879
)
[7] => Array
(
[numvotes] => 340
[standard-deviation] => 37.944190471069
[original-rating] => 3.9102941176470547
[weighted-rating] => -34.033896353422
)
[8] => Array
(
[numvotes] => 323
[standard-deviation] => 36.983422110177
[original-rating] => 3.261145510835913
[weighted-rating] => -33.722276599341
)
[9] => Array
(
[numvotes] => 280
[standard-deviation] => 34.433791770728
[original-rating] => 3.36767857142857
[weighted-rating] => -31.066113199299
)
[10] => Array
(
[numvotes] => 254
[standard-deviation] => 32.796136967109
[original-rating] => 3.1411417322834665
[weighted-rating] => -29.654995234825
)
Worst 10
[232] => Array
(
[numvotes] => 2
[standard-deviation] => 2.9101865621466
[original-rating] => 4.95
[weighted-rating] => 2.0398134378534
)
[233] => Array
(
[numvotes] => 2
[standard-deviation] => 2.9101865621466
[original-rating] => 5
[weighted-rating] => 2.0898134378534
)
[234] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 4
[weighted-rating] => 1.9421873473882
)
[235] => Array
(
[numvotes] => 2
[standard-deviation] => 2.9101865621466
[original-rating] => 4.8
[weighted-rating] => 1.8898134378534
)
[236] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 3.25
[weighted-rating] => 1.1921873473882
)
[237] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 5
[weighted-rating] => 2.9421873473882
)
[238] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 5
[weighted-rating] => 2.9421873473882
)
[239] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 4.1
[weighted-rating] => 2.0421873473882
)
[240] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 5
[weighted-rating] => 2.9421873473882
)
[241] => Array
(
[numvotes] => 2
[standard-deviation] => 2.9101865621466
[original-rating] => 5
[weighted-rating] => 2.0898134378534
)
)
UPDATE
OK so I recalculated the standard deviation
against the entire population. It is 2.0578126526118
.
Here is my current code:
$sql="SELECT SUM(reviews.rating) AS sumrating,COUNT(reviews.rating) AS countrating FROM reviews";
$rs=mysqli_fetch_array(mysqli_query($scx_dbh,$sql));
$ratingssum=(int)$rs['sumrating'];
$ratingscount=(int)$rs['countrating'];
$mean = $ratingssum / $ratingscount;
$variance = 0;
$sql="SELECT rating AS score FROM reviews";
$rs=mysqli_query($scx_dbh,$sql);
while($row=mysqli_fetch_array($rs)){
$score = (int)$row['score'];
$variance += pow(($score-$mean),2);
}
$variance = $variance/$ratingscount;
$standarddev=sqrt($variance);
$scores=array();
$sql="SELECT locid,COUNT(reviewid) AS locationrecordcount,AVG(rating) AS locationmedian FROM reviews GROUP BY locid";
$rs=mysqli_query($scx_dbh,$sql);
while($row=mysqli_fetch_array($rs)){
$numvotes=(int)$row['locationrecordcount'];
$locId = (int)$row['locid'];
$oldRating=$row['locationmedian'];
$newRating=$oldRating-$standarddev;
$scores[$locId] = array(
'numvotes'=>$numvotes,
'standard-deviation'=>$standarddev,
'original-rating'=> $oldRating,
'weighted-rating'=>$newRating
);
}
usort($scores,function($a,$b){
return (int)($a['weighted-rating']-$b['weighted-rating']);
});
1./ I think my sorting function is incorrect. After sorting using my sorting function, these are the top 5:
[0] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 0.2
[weighted-rating] => -1.8578126526118
)
[1] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 0.05
[weighted-rating] => -2.0078126526118
)
[2] => Array
(
[numvotes] => 4
[standard-deviation] => 2.0578126526118
[original-rating] => 0.7625
[weighted-rating] => -1.2953126526118
)
[3] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 0.1
[weighted-rating] => -1.9578126526118
)
[4] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 0.4
[weighted-rating] => -1.6578126526118
)
As you can see, besides the fact that they are producing negative numbers, it looks like the weighted-average
of position 1 (index 0) is -1.85
and position 2 (index 1) is -2.00
. I imagine there's an issue with either my algorithm or my sorting function in my code or else why are there negative numbers being sorted as first
.
Also, i am getting clubs at positions 1 when they have 1 vote. The purpose of this algorithm was to weed these clubs out so I can focus on the clubs that have 1000's of votes.
Here are the bottom 5:
[237] => Array
(
[numvotes] => 29
[standard-deviation] => 2.0578126526118
[original-rating] => 4.112068965517241
[weighted-rating] => 2.0542563129054
)
[238] => Array
(
[numvotes] => 5
[standard-deviation] => 2.0578126526118
[original-rating] => 3.8800000000000003
[weighted-rating] => 1.8221873473882
)
[239] => Array
(
[numvotes] => 31
[standard-deviation] => 2.0578126526118
[original-rating] => 3.7499999999999996
[weighted-rating] => 1.6921873473882
)
[240] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 5
[weighted-rating] => 2.9421873473882
)
[241] => Array
(
[numvotes] => 1
[standard-deviation] => 2.0578126526118
[original-rating] => 4.45
[weighted-rating] => 2.3921873473882
)
The same behavior exhibits on the bottom 5. I've got a weighted-average
of 2.39
for position 5 (index 241) and a weighted-average
of 2.94
for position 4 (index 240)
The standard deviation is calculated by the square root of the variation, not the square root of the variation multiplied by population (numvotes):
// find the standard deviation for this club (total variance)
$standarddev=sqrt($variance);
If you want to weigh each club by themselves then you need to calculate the variation (and standard deviation) for each club. To do this you need to sum only the votes for each club, not all votes, and then calculate the variation and standard deviation. Then the variation and standard deviation for all votes seems unnecessary.
Update:
What you are trying to accomplish (weed out clubs with few votes) cannot be done using standard deviation (σ).
Concider the following:
5/1=5, (5-5)^2 / 1=0, sqrt(0)=0
1/1=1, (1-1)^2 / 1=0, sqrt(0)=0
10/2=5, ((5-5)^2 + (5-5)^2) / 2=0, sqrt(0)=0
Now you'd think that you can just weed out the clubs with a low σ.
6/2=3, ((1-3)^2 + (5-3)^2) / 2=8, sqrt(8)=2.83
As you can see there is nothing here that says "hey, this club got lots of votes". The only thing σ says is how much spread on the votes there is. If there is no or a small spread (variation) then σ will be 0 or small and vice versa.
What you could try is to look at the difference between a clubs σ (Cσ) and the total σ (Tσ). If that value is near 0 (to a limit, lets say 0.1) then you'd know that there is a similar variation in that club as in the whole population. But this will still not guarantee that there are atleast x amount of votes. This calculation would be something like abs(Cσ - Tσ) < 0.1
.
Regarding your sorting function:
usort
expects that the returned integer is -1, 0 or 1 to function properly. When you start to subtract negative numbers you'll get rather strange results. The correct sorting function should look something like this:
usort($scores, function cmp($a, $b)
{
if ($a['weighted-rating'] == $b['weighted-rating']) {
return 0;
}
return ($a['weighted-rating'] < $b['weighted-rating']) ? -1 : 1;
}
$standarddev=sqrt($variance*$numvotes);
should be
$standarddev=sqrt($variance);
edit
Your problem is not being able to find errors in your logic. The reason for that is that you have one large complex function. You should look into test driven development and split your code up into small, easily testable units of work. For each unit of work you can test the expected output for different input values. This way it will be easier to exclude parts of your code, e.g. the stdCalculator, because that part is covered by a range of testcases.