Using mathematical methods to spot anomalies

theta · Mar 22, 2011

Hi all. If I had a list of numbers (1,000 rows) and wanted to find those which would appear to be anomalies, given the rest of population...what methods could I use?

15
19
65
87
4
23
100
53
14
7

Distance from the mean?
Normal (Gaussian) distribution?
Bi-nomial distribution?
Quantitative methods...

What would be the easist way to 'mark' these items with some sort of index (standard z-score) and perhaps graph it?

Would there be a way to 'tune' this to make it more/less senstitive...

energman58 · Mar 22, 2011

I would use the standard deviation of the population and flag anything that is outside say 2 or 3 SD from the mean. You could then tune this by adjusting the number of SDs to include or exclude more as you want - you can use the normal distribution to give you how many this will exclude if the population is normally distributed to give you an idea of how many "should" get thrown up (assuming you are sampling from a normal population of course

).

theta · Mar 22, 2011

Thanks for the quick reply!

How would I acheive this using formula if I had a column of 1,000 numbers (say column A1:A1000). No problem using helper columns etc.

New to stats! Calc the standard deviation and then the normal distribution

Would like to be able to visualise this in a graph (but that can wait until I resolve the formulas).

energman58 · Mar 23, 2011

STDEV(A1:A100) calculates teh standard deviation but it can only cope with 254 values. If your data is random that should not be too much of a problem for you - simply take an arbitrary 254 values and use the SD of that.

If its not random or you want a better value you can calculate the SD manually easily enough.

The SD is the square root of the variance and variance is:

((x-mean of x)^2 Summed for all values of x)/(n-1)

Cant remeber why its n-1 (it was a long time ago)

If you are coding it you can use a rearrangement of this such that teh variance is

1/(n-1)*(sum of x^2 for all values of x - (sum of all values of x)^2/n)

So the SD can be coded as:

Code:

=SQRT((1/(COUNT(A1:A1000)-1))*(SUMPRODUCT(A1:A1000,A1:A1000)-((SUM(A1:A1000)^2)/(COUNT(A1:A1000)))))

Hope that helps

pgc01 · Mar 23, 2011

Hi

To calculate the standard deviation you can also use directly:

=STDEV(A1:A1000)

energman58 · Mar 23, 2011

pgc01

STDEV is limited to a maximum sample of 254 according to the help unless a single array only counts as one.

pgc01 · Mar 23, 2011

energman58 said:
pgc01

STDEV is limited to a maximum sample of 254 according to the help unless a single array only counts as one.

Hi energman58

I guess you are referring to the maximum number of arguments of the function in the latest excel versions, not the number of values.

For ex., in this case I'm using the function with 2 arguments but with 20 values:

=STDEV(E1:E10,F1:F10)

mirabeau · Mar 23, 2011

theta said:
If I had a list of numbers (1,000 rows) and wanted to find those which would appear to be anomalies, given the rest of population...what methods could I use?

There's lots of interesting possibilities. But your results must depend on the criteria you want to use.

Years ago Emil Gumbel wrote a significant book on "The Statistics of Extremes" after many years of earlier research, http://en.wikipedia.org/wiki/Extreme_value_theory which has a bearing on this sort of problem.

Sometime later Paul Samuelson (who won a Nobel prize, but not for that) considered the interesting and related question of "How Deviant Can You Be?" http://en.wikipedia.org/wiki/Samuelson's_inequality, that also generated a significant literature.

There's lots of other background available on this sort of topic/problem. Some of it I guess could be coded into Excel if there proved to be sufficient interest.

Maybe also worth pointing out that some distributions don't even have standard distributions, so while Excel's StDev may well produce a number, it doesn't necessarily mean anything unless you have adequate confirming additional information.

And even for distributions that do have standard deviations, most of these are unlikely to be normal ... although you can of course assume whatever you like.

theta · Mar 23, 2011

Thanks guys this is helping alot.

Couldn't understand why things were being squared then rooted. Makes sense now...converts all differences into positive number, or 'true' difference...then mean and root reduces it back to the correct scale.

Keep it coming, this info is great

Hoping to look at a list of transactions and find those that appear unusual when compared to the rest of the population

theta · Mar 23, 2011

Hi guys, produced my first model using standard deviation. I can't help but notice that if you have an erroneous item, it is still contributing to the mean so may decrease the accuracy.

You could have 3 erroneous items, one of these being exceptionally large...enough to push the mean past the point where the other 2 aren't highlighted?

Using mathematical methods to spot anomalies

theta

Well-known Member

Excel Facts

energman58

Well-known Member

theta

Well-known Member

energman58

Well-known Member

pgc01

MrExcel MVP

energman58

Well-known Member

pgc01

MrExcel MVP

mirabeau

Banned user

theta

Well-known Member

theta

Well-known Member

Forum statistics

Share this page

Using mathematical methods to spot anomalies

Well-known Member

Excel Facts

Well-known Member

Well-known Member

Well-known Member

MrExcel MVP

Well-known Member

MrExcel MVP

Banned user

Well-known Member

Well-known Member

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock