在尝试获得平均值时排除异常值

I'm using cakephp for a website that allows users to log their substance use.

What I want to do is get the average dose of a specific substance. I've done this with setting a virtualfield

    $this->RecordDrugUnit->virtualFields['sum'] ='AVG(RecordDrugUnit.dose)';

The problem though is, if a user messed up and has a messed up value like 100000 grams of alcohol taken, then this will screw up the average. So I want to exclude to outliers, or somehow figure out a better way to gather the average.

Anyone have any input on this?

You can do this by excluding variables which are too far from the standard deviation, where 3 * standard deviation from the average is commonly considered an 'outlier'. If you only want to exclude the very far off the average, you can just increase the amount that stddev is multiplied by. Here's a very simplified, unoptimized approach you can use as a starting point for whatever virtualField you choose to use:

mysql> select * from test;
+------+
| a    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
| 1000 |
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
+------+
11 rows in set (0.00 sec)

mysql> select * from test where (ABS(test.a - (select avg(a) from test)) < 3*(select stddev(a) from test));
+------+
| a    |
+------+
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
|    1 |
|    2 |
|    3 |
|    4 |
|    5 |
+------+
10 rows in set (0.00 sec)

I believe if a virtualfield contains a select, that select is simply run, so you can use this directly. My quick, untested attempt at a virtualfield:

$this->RecordDrugUnit->virtualFields['sum'] = 'select AVG(rdu.dose) from RecordDrugUnit rdu where (ABS(rdu.dose - (select avg(dose) from rdu)) < 3*(select stddev(dose) from rdu))';

You can use the trimmed mean function for that. If you're interested I can add the procedure later today.

-- Add trimmed mean user defined function

-- ------------------------------ START FUNCTION --------------------------------
drop function if exists trimmed_mean;

delimiter //

create function trimmed_mean(
    -- data: comma separated list of numeric values, left-to-right sorted from low to high
    -- (you can use GROUP_CONCAT(x ORDER BY x) to let MySQL generate suchs lists for you
    data   text
    -- p: the percentage of the data points to trim.
,   p      tinyint
)
returns double
begin
    -- n: number of observations
    declare n int default 1 + length(data) - length(replace(data, ',', ''));
    -- m: number of observations to remove on both ends of the data set
    declare m int default n * p / 2 / 100;
    -- t: trimmed dataset
    declare t text default substring_index(substring_index(data, ',', n-m), ',', -(n-m-m));
    -- current character (for parsing numbers out of the dataset)
    declare c varchar(1);
    -- x: integer part of the data point, y: decimal part of the data point
    declare x, y varchar(32);
    -- z: number of decimals
    declare z int unsigned default 0;
    -- number of characters in the (trimmed) data set
    declare l int unsigned default length(t);
    -- i: current position in the data set, j: marks start of data point
    declare i, j int unsigned default 1;
    -- the sum of the integer parts of the data points
    declare v int default 0;
    -- d: the sum of the decimal parts of the data parts (as scaled integer), s: scaling factor
    declare d, s int unsigned default 0;
    repeat
        -- get the current character from the trimmed data set
        set c = substring(t, i, 1);
        -- check if current position is a data point separator (',') or end of data terminator ('')
        if substring(t, i, 1) in (',', '') then
            -- parse out a data point (from j up to i) into x; advance j and look for a decimal separator ('.')
            set x = substring(t, j, i - j),
                j = i + 1,
                d = instr(x, '.')
            ;
            -- if we have no decimals, then parse data point as integer and update our sum v with it.
            if d = 0 then
                set v = v + cast(x as signed);
            else
            -- we have decimals. Parse up to the decimal separator ('.') as integer and update our sum v.
            -- parse out the part after the decimal separator into y. Update the total number of decimals to keep track of in z
            -- Finally, pad our decimal parts sum with the number of decimals and prepend a 1 to not lose leading zeroes
                set v = v + cast(substring_index(x, '.', 1) as signed)
                ,   y = substring_index(x, '.', -1)
                ,   z = greatest(z, length(y))
                ,   d = cast(rpad(cast(d as char), '0', z + 1) as unsigned) + cast(rpad(concat('1', y), '0', z + 1) as unsigned)
                ;
            end if;
        end if;
        -- advance position to look at next character in the dataset.
        set i = i + 1;
    -- stop scanning when we ran trhough the dataset.
    until c = '' end repeat;
    -- compute the scaling factor s (1 followed by number of zeroes equal to max number of decimals)
    -- update n to the number of original observations minus the specified percentage.
    set s = cast(rpad('1', '0', z + 1) as unsigned)
    ,   n = (n - 2 * m)
    ;
    -- add sum of integer parts to the (downscaled) sum of decimal parts, and divide to get the mean.
    return (v + case d when 0 then 0 else (d - n * s) / s end) / n;
end
//

delimiter ;
-- ------------------------------- END FUNCTION ---------------------------------