Using mathematical methods to spot anomalies

theta

Well-known Member
Joined
Jun 9, 2009
Messages
960
Hi all. If I had a list of numbers (1,000 rows) and wanted to find those which would appear to be anomalies, given the rest of population...what methods could I use?

15
19
65
87
4
23
100
53
14
7

Distance from the mean?
Normal (Gaussian) distribution?
Bi-nomial distribution?
Quantitative methods...

What would be the easist way to 'mark' these items with some sort of index (standard z-score) and perhaps graph it?

Would there be a way to 'tune' this to make it more/less senstitive...
 

Some videos you may like

Excel Facts

How to change case of text in Excel?
Use =UPPER() for upper case, =LOWER() for lower case, and =PROPER() for proper case. PROPER won't capitalize second c in Mccartney

energman58

Well-known Member
Joined
Oct 25, 2010
Messages
553
I would use the standard deviation of the population and flag anything that is outside say 2 or 3 SD from the mean. You could then tune this by adjusting the number of SDs to include or exclude more as you want - you can use the normal distribution to give you how many this will exclude if the population is normally distributed to give you an idea of how many "should" get thrown up (assuming you are sampling from a normal population of course;)).
 

theta

Well-known Member
Joined
Jun 9, 2009
Messages
960
Thanks for the quick reply!

How would I acheive this using formula if I had a column of 1,000 numbers (say column A1:A1000). No problem using helper columns etc.

New to stats! Calc the standard deviation and then the normal distribution

Would like to be able to visualise this in a graph (but that can wait until I resolve the formulas).
 

energman58

Well-known Member
Joined
Oct 25, 2010
Messages
553
STDEV(A1:A100) calculates teh standard deviation but it can only cope with 254 values. If your data is random that should not be too much of a problem for you - simply take an arbitrary 254 values and use the SD of that.

If its not random or you want a better value you can calculate the SD manually easily enough.

The SD is the square root of the variance and variance is:

((x-mean of x)^2 Summed for all values of x)/(n-1)

Cant remeber why its n-1 (it was a long time ago)

If you are coding it you can use a rearrangement of this such that teh variance is

1/(n-1)*(sum of x^2 for all values of x - (sum of all values of x)^2/n)

So the SD can be coded as:

Code:
=SQRT((1/(COUNT(A1:A1000)-1))*(SUMPRODUCT(A1:A1000,A1:A1000)-((SUM(A1:A1000)^2)/(COUNT(A1:A1000)))))


Hope that helps
 

pgc01

MrExcel MVP
Joined
Apr 25, 2006
Messages
19,851

ADVERTISEMENT

Hi

To calculate the standard deviation you can also use directly:

=STDEV(A1:A1000)
 

energman58

Well-known Member
Joined
Oct 25, 2010
Messages
553
pgc01

STDEV is limited to a maximum sample of 254 according to the help unless a single array only counts as one.
 

pgc01

MrExcel MVP
Joined
Apr 25, 2006
Messages
19,851

ADVERTISEMENT

pgc01

STDEV is limited to a maximum sample of 254 according to the help unless a single array only counts as one.

Hi energman58

I guess you are referring to the maximum number of arguments of the function in the latest excel versions, not the number of values.

For ex., in this case I'm using the function with 2 arguments but with 20 values:

=STDEV(E1:E10,F1:F10)
 

mirabeau

Banned user
Joined
Nov 4, 2010
Messages
2,075
If I had a list of numbers (1,000 rows) and wanted to find those which would appear to be anomalies, given the rest of population...what methods could I use?
There's lots of interesting possibilities. But your results must depend on the criteria you want to use.

Years ago Emil Gumbel wrote a significant book on "The Statistics of Extremes" after many years of earlier research, http://en.wikipedia.org/wiki/Extreme_value_theory which has a bearing on this sort of problem.

Sometime later Paul Samuelson (who won a Nobel prize, but not for that) considered the interesting and related question of "How Deviant Can You Be?" http://en.wikipedia.org/wiki/Samuelson's_inequality, that also generated a significant literature.

There's lots of other background available on this sort of topic/problem. Some of it I guess could be coded into Excel if there proved to be sufficient interest.

Maybe also worth pointing out that some distributions don't even have standard distributions, so while Excel's StDev may well produce a number, it doesn't necessarily mean anything unless you have adequate confirming additional information.

And even for distributions that do have standard deviations, most of these are unlikely to be normal ... although you can of course assume whatever you like.
 

theta

Well-known Member
Joined
Jun 9, 2009
Messages
960
Thanks guys this is helping alot.

Couldn't understand why things were being squared then rooted. Makes sense now...converts all differences into positive number, or 'true' difference...then mean and root reduces it back to the correct scale.

Keep it coming, this info is great

Hoping to look at a list of transactions and find those that appear unusual when compared to the rest of the population
 

theta

Well-known Member
Joined
Jun 9, 2009
Messages
960
Hi guys, produced my first model using standard deviation. I can't help but notice that if you have an erroneous item, it is still contributing to the mean so may decrease the accuracy.

You could have 3 erroneous items, one of these being exceptionally large...enough to push the mean past the point where the other 2 aren't highlighted?
 

Watch MrExcel Video

Forum statistics

Threads
1,109,456
Messages
5,528,871
Members
409,842
Latest member
mfernandezcean
Top