probability of a range

DDT~123

Board Regular
Joined
May 31, 2012
Messages
220
I use Google Analytics to see the age groups of visitors to my website, then use facebook ads to target those age groups. With facebook, I'm able to narrow down to specific ages rather than age groups. According to Google Analytics, my visitors are in these ranges:
Age GroupCount of Visitors
18-2497
25-34279
35-44194

Obviously, a huge percentage of my visitors are between the ages of 25-34 but I'm trying to narrow down the other age groups. Let's say for the 18-24 age group, I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds. How would I determine probability of individual ages within the age groups?
 
Last edited:

Excel Facts

Copy PDF to Excel
Select data in PDF. Paste to Microsoft Word. Copy from Word and paste to Excel.
I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds. How would I determine probability of individual ages within the age groups?
I don't know that you can make that conclusion with any sort of certainty from the data provided. It does not give that level of detail.
You would have to make some assumptions, and then your probability would have a margin of error (and it would really be nothing more than a "guess").
If you had a few more points on data, you might be able to make a "graph", which would be a little more accurate.
But still, since your range are in increments of 10 units, and you are trying to get down to a band of 2 units, it would be a very imprecise result, not one you could have very much confidence in.
 
Upvote 0
I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds

Based on what data?!

If Google Analytics can break down the visitors by age (18, 19, etc) instead of age groups, the discrete probabilities for each age would be the number for that age divided by the total number.

For example, your data shows a total of 570 visitors. If 7 visitors were age 18, the estimated probability is 7/570 = 1.23%.

Of course, that is based on a "sample" of data. There is no way to know if that is a "representative" sample. But over time, the cumulative "sample" is likely to become more representative.

-----

Beyond that, there really is not sufficient data to develop a reliable probability distribution.

We have no idea of what probability distribution to expect, other than your vague assertion about the 23-24 age group ("more likely" and "more").

Nevertheless, we can have some "fun" with the existing data, relying on (huge!) assumptions.

This is not unlike assumptions that US pollsters use to predict election outcomes. And based on their predictions for the 2020 presidential election, we can see just how "reliable" that is (not!). ;)

Caveat: Possible TMI ahead. Proceed at your own risk.

-----

For a first-level approximation, we might assume that the frequency within each age group is uniformly distributed.

(Temporarily ignoring your assertion to the contrary for the 18-24 age group.)

This is demonstrated as follows.

probability of grouped data.xlsx
ABCDEF
1age range#ageavg #ageP(range)P(age)
218249713.85714317.02%2.43%
3253427927.90000048.95%4.89%
4354419419.40000034.04%3.40%
5570n100.00%total
631.457018wgtd avg
min 18-24
Rich (BB code):
Formulas:
D2: =C2 / (B2-A2+1)
E2: =C2 / $C$5
F2: =E2 / (B2-A2+1)
E5: =SUM(E2:E4)
C5: =SUM(C2:C4)
C6: =SUMPRODUCT((A2:A4 + B2:B4) / 2, C2:C4) / C5

Click on or hover the cursor over each cell to see formulas. Click the copy-to-clipboard icon in the upper-left under "f(x)", and paste into the indicated cells in an Excel worksheet.

P(range) in column E is the actual probability of each range. P(age) in column F is the guesstimated probability of each age in the range, again assuming a uniform distribution.

This is simplest assumption and calculation. It might be sufficient for some purposes. Only you can make that choice.

But remember: we have no reason to expect that distribution.

-----

For a second-level approximation , we might assume that the frequency for each age is normally distributed.

But remember: even though that is a common assumption, again we have no reason to expect that distribution.

A common approach is to assume that the extremes ages -- 18 and 45 (44+1) -- represent -4sd and +4sd respectively, where "sd" is the standard deviation (std dev).

But if we did that, we would get very unsatisfying results. The number of visitors would be less than 1 for ages 18-21 and 41-44. And the number of visitors would be 15, 470 and 85 for the age groups 18-24, 25-34 and 35-44, which is much different from the sample data in C2:C7.

Instead, we might assume that the age frequencies have a truncated normal distribution.

We use Solver [*] to determine that the limits of the truncated normal distribution are about +/-1.79sd, based on the goal number of visitors for the 18-24 age group (the smallest group) of 97, the same as the sample data.

[*] I do not know if we can determine the limits of the distribution algebraically. I do not have time to try. I use Solver for a quick solution.


This is demonstrated as follows, in addition to the formulas demonstrated above.

Rich (BB code):
Formulas:
B8: =C5 / (NORMSDIST(A38) - NORMSDIST(A11))
B9: =(A38-A11) / COUNT(B11:B37)
A11: -1.78977590994222 (derived by Solver)
C11: =$B$8 * (NORMSDIST(A12) - NORMSDIST(A11))
D11: =(NORMSDIST(A12) - NORMSDIST(A11)) / (NORMSDIST($A$38) - NORMSDIST($A$11))
E11: =ROUND(C11, 0)
F11: =E11 / $G$11
A12: =A11 + $B$9
E12: =ROUND(SUM($C$11:C12) - SUM($E$11:E11), 0)
A38: =-A11
Discrete distribution summary:
G11: =SUM(E11:E37)
G12: =SUMPRODUCT(B11:B37, E11:E37) / G11
G13: =SQRT(SUMPRODUCT((B11:B37 - G12)^2, E11:E37) / (G11 - 1))
G14: =SUMIFS($E$11:$E$37, $B$11:$B$37, ">="&A2, $B$11:$B$37, "<="&B2)
G17: =SUMIFS($F$11:$F$37, $B$11:$B$37, ">="&A2, $B$11:$B$37, "<="&B2)
I14: =G14/C2 - 1
Normal distribution summary:
J12: =SUMPRODUCT(B11:B37, D11:D37)
J13: =SQRT(SUMPRODUCT((B11:B37 - J12)^2, D11:D37))
J14: =SUMIFS($C$11:$C$37, $B$11:$B$37, ">="&A2, $B$11:$B$37, "<="&B2)
J17: =SUMIFS($D$11:$D$37, $B$11:$B$37, ">="&A2, $B$11:$B$37, "<="&B2)
Solver set-up:
Enter -4 into A11 to avoid Excel errors (#DIV/0) initially
Set objective: J14
To value: 97
By changing: A11
Deselect "Make unconstrained variables non-negative"
Solving method: GRG nonlinear

Notice the XL2BB scrollbar on the right.

It is difficult to know how much needs to be explained. You might understand the formulas well enough on your own. Feel free to ask specific questions.

The discrete probability distribution (column F) is based on the rounded age distribution in column E.

Note that the predicted number of visitors (G14:G16) is close to the sample data in C2:C4.

Also note that the number of visitors in the 23-24 age group -- SUM(E16:E17) = 40 -- is indeed more than the 18-19 age group -- SUM(E11:E12) = 17.

That said, remember: for all of the apparent precision of the method, this is merely an unreliable guess based on assumption after assumption after.... ;)

Again, if Google Analytics can provide statistics based on each age, instead of age groups, that is a better source for a discrete probability distribution.
 
Last edited:
Upvote 0
Errata....
J13: =SQRT(SUMPRODUCT((B11:B37 - J12)^2, D11:D37))

The formula in J13 should be:
Rich (BB code):
=SQRT(SUMPRODUCT((B11:B37 - J12)^2, C11:C37) / (G11 - 1))

I should also point out that the probabilities in column D and F are conditional probabilities. They apply only if the age is between 18 and 44 inclusively.

They cannot be used to estimate probabilities outside the sample age groups.
 
Upvote 0
joeu2004,

That is one detailed explanation. Nicely done!
 
Upvote 0
Note: Ignore my previous responses #3 and 4. They are completely wrong and misleading.

Nicely done!

Thanks. But apparently not! I made some horrible mistakes in my zeal to "make thing as simple as possible, and no simpler".

I have corrected the design, but I am having difficulty finding time to write a "replacement" response. I expect to post it later tonight (my time).

In the meantime, I did not want anyone to be misled by the incorrect design in my response #3 (modified by #4).
 
Upvote 0
Again, ignore my previous responses #3 and 4. Although the concepts are correct, there are major mistakes in the implementation. The following is intended to be a replacement.

I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds.

Based on what data?!

If Google Analytics can break down the visitors by age (18, 19, etc) instead of age groups, the number of visitors for each age divided by the total number of visitors would be the discrete probabilities for each age.

For example, your data shows a total of 570 visitors. If 7 visitors were age 18, the estimated probability would be 7/570 = 1.23%.

Of course, that is based on an arbitrary sample of data. There is no way to know if that is a "representative" sample. But over time, the cumulative sample is likely to become more representative.

Moreover, the discrete probability distribution can only be used to predict the number of visitors for each age within the range of data in the sample -- 18 to 44, in your example. In effect, they are conditional probabilities. For example, among visitors of ages 18 to 44, we can expect 1.23% to be 18, which is the probability that a visitor is age 18.

The discrete probability distribution provides no way to predict the probability of a visitor of age 45, for example.

-----

How would I determine probability of individual ages within the age groups?

There really is not sufficient data to develop a reliable probability distribution.

We have no idea what probability distribution to expect, other than your vague assertion about the 23-24 age group ("more likely" "more" than the 18-19 age group).

Nevertheless, we can have some "fun" with the existing data, relying on arbitrary assumptions (for no good reason). Hey, if it's good enough for POTUS, it's good enough for us. :rolleyes:

This is not unlike assumptions that US pollsters use to predict election outcomes. And based on their predictions for the 2020 presidential election, we can see just how "reliable" that is (not!). :rolleyes:

That said....

Caveat: Possible TMI ahead. Proceed at your own risk.

-----

For a first-level approximation, we might assume (for no good reason) that the frequency within each age group is uniformly distributed.

(Temporarily ignoring your assertion to the contrary for the 18-24 age group.)

This is demonstrated as follows.

probability of grouped data.xlsx
ABCDEF
1age rangegroup freqage freqP(group)P(age)
218249713.85714317.02%2.43%
3253427927.90000048.95%4.89%
4354419419.40000034.04%3.40%
min 18-24
Rich (BB code):
Formulas:
D2: =C2 / (B2-A2+1)
E2: =C2 / $B$6
F2: =E2 / (B2-A2+1)
B6: =SUM(C2:C4)

Click on or hover the cursor over each cell to see formulas. Click the copy-to-clipboard icon in the upper-left under "f(x)", and paste into the indicated cells in an Excel worksheet.

P(group) in column E is the actual probability for each age group. P(age) in column F is the probability for each age in the group, again assuming a uniform distribution within each age group.

This is simplest assumption and calculation. It might be sufficient for some purposes. Only you can make that choice.

But remember: we have no reason to expect that distribution.

-----

For a second-level approximation , we might assume (for no good reason) that the frequency for each age is normally distributed.

A common approach is to assume (for no good reason) that the extreme ages -- 18 and 45 (44+1) -- represent -4sd and +4sd respectively, where "sd" is the standard deviation (std dev).

But if we did that, we would be very dissatisfied with the results. The number of visitors would be less than 1 for ages 18-21 and 41-44. And the number of visitors would be 15, 470 and 85 for the age groups 18-24, 25-34 and 35-44, which is very different from the sample data in C2:C4.

Instead, we might assume that the known data (ages for 570 visitors) fits a subregion [**] of the normal distribution, such that the calculated frequencies meet some criteria.

For example, we can use Solver [*] to determine the limits of the subregion -- about +/-1.79sd -- such that the number of visitors for the 18-24 age group is 97, the same as the sample data.

This is demonstrated as follows, in addition to the formulas demonstrated above.

[*] I do not know if we can determine the limits of the subregion algebraically. I do not have time to try. I use Solver for a quick solution.


[**] Previously, I referred to the subregion as a truncated normal distribution. That was incorrect. However, if you have a mimimum and/or maximum age for visitors, that would require a truncated normal distribution, which affects the probability distribution. LMK.

Formulas:
Rich (BB code):
B7: =B6 / (NORMSDIST(A39) - NORMSDIST(A12))
B8: =AVERAGE(A2, B4+1)
B9: =(B12 - B8) / A12
C11: =$B$7 * D11
D11: =NORMSDIST(A12)
A12: -1.78977590994222 (derived by Solver; YMMV)
D12: =NORMSDIST(A13) - NORMSDIST(A12)
B39: 45 (formatted to display ">44")
D39: =1 - NORMSDIST(A39)
C40: =SUM(C11:C39)
C41: =SUM(C12:C38)
Distribution summary:
E11: =SUMPRODUCT(B12:B38, D12:D38) / SUM(D12:D38)
E12: =SQRT(SUMPRODUCT((B12:B38 - E11)^2, C12:C38) / (C41-1))
E13: =SUMIFS($C$12:$C$38, $B$12:$B$38, ">="&A2, $B$12:$B$38, "<="&B2)
E16: =E13 / $C$41
Solver set-up:
Enter -4 into A12 to avoid Excel errors (#DIV/0) initially
Set objective: E13
To value: 97 (from C2)
By changing: A12
Deselect "Make unconstrained variables non-negative"
Solving method: GRG nonlinear

Notice the XL2BB scrollbar on the right.

It is difficult to know how much to explain. You might understand the formulas well enough on your own. Feel free to ask specific questions.

Note that the predicted number of visitors in E13:E15 is close to the sample data in C2:C4.

Also note that the number of visitors in the 23-24 age group -- SUM(C17:C18) = 40 -- is indeed more than the 18-19 age group -- SUM(C12:C13) = 16.

But remember: for all of the apparent precision of the method, this is merely an unreliable guess based on arbitrary assumption after assumption after.... ;)

And again, if Google Analytics can provide statistics based on each age instead of age groups, the discrete probability distribution might be more reliable, since it is not based on assumptions.
 
Last edited by a moderator:
Upvote 0

Forum statistics

Threads
1,214,833
Messages
6,121,868
Members
449,053
Latest member
Mesh

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top