probability of a range

DDT~123 · Jan 2, 2021

I use Google Analytics to see the age groups of visitors to my website, then use facebook ads to target those age groups. With facebook, I'm able to narrow down to specific ages rather than age groups. According to Google Analytics, my visitors are in these ranges:

Age Group	Count of Visitors
18-24	97
25-34	279
35-44	194

Obviously, a huge percentage of my visitors are between the ages of 25-34 but I'm trying to narrow down the other age groups. Let's say for the 18-24 age group, I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds. How would I determine probability of individual ages within the age groups?

Joe4 · Jan 2, 2021

DDT~123 said:
I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds. How would I determine probability of individual ages within the age groups?

I don't know that you can make that conclusion with any sort of certainty from the data provided. It does not give that level of detail.
You would have to make some assumptions, and then your probability would have a margin of error (and it would really be nothing more than a "guess").
If you had a few more points on data, you might be able to make a "graph", which would be a little more accurate.
But still, since your range are in increments of 10 units, and you are trying to get down to a band of 2 units, it would be a very imprecise result, not one you could have very much confidence in.

joeu2004 · Jan 3, 2021

DDT~123 said:
I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds

Based on what data?!

If Google Analytics can break down the visitors by age (18, 19, etc) instead of age groups, the discrete probabilities for each age would be the number for that age divided by the total number.

For example, your data shows a total of 570 visitors. If 7 visitors were age 18, the estimated probability is 7/570 = 1.23%.

Of course, that is based on a "sample" of data. There is no way to know if that is a "representative" sample. But over time, the cumulative "sample" is likely to become more representative.

-----

Beyond that, there really is not sufficient data to develop a reliable probability distribution.

We have no idea of what probability distribution to expect, other than your vague assertion about the 23-24 age group ("more likely" and "more").

Nevertheless, we can have some "fun" with the existing data, relying on (huge!) assumptions.

This is not unlike assumptions that US pollsters use to predict election outcomes. And based on their predictions for the 2020 presidential election, we can see just how "reliable" that is (not!).

Caveat: Possible TMI ahead. Proceed at your own risk.

-----

For a first-level approximation, we might assume that the frequency within each age group is uniformly distributed.

(Temporarily ignoring your assertion to the contrary for the 18-24 age group.)

This is demonstrated as follows.

probability of grouped data.xlsx

A

B

C

D

E

F

1

age range

#age

avg #age

P(range)

P(age)

2

18

24

97

13.857143

17.02%

2.43%

3

25

34

279

27.900000

48.95%

4.89%

4

35

44

194

5

n

total

6

31.457018

wgtd avg

min 18-24

Rich (BB code):

Formulas:
D2: =C2 / (B2-A2+1)
E2: =C2 / $C$5
F2: =E2 / (B2-A2+1)
E5: =SUM(E2:E4)
C5: =SUM(C2:C4)
C6: =SUMPRODUCT((A2:A4 + B2:B4) / 2, C2:C4) / C5

Click on or hover the cursor over each cell to see formulas. Click the copy-to-clipboard icon in the upper-left under "f(x)", and paste into the indicated cells in an Excel worksheet.

P(range) in column E is the actual probability of each range. P(age) in column F is the guesstimated probability of each age in the range, again assuming a uniform distribution.

This is simplest assumption and calculation. It might be sufficient for some purposes. Only you can make that choice.

But remember: we have no reason to expect that distribution.

-----

For a second-level approximation , we might assume that the frequency for each age is normally distributed.

But remember: even though that is a common assumption, again we have no reason to expect that distribution.

A common approach is to assume that the extremes ages -- 18 and 45 (44+1) -- represent -4sd and +4sd respectively, where "sd" is the standard deviation (std dev).

But if we did that, we would get very unsatisfying results. The number of visitors would be less than 1 for ages 18-21 and 41-44. And the number of visitors would be 15, 470 and 85 for the age groups 18-24, 25-34 and 35-44, which is much different from the sample data in C2:C7.

Instead, we might assume that the age frequencies have a truncated normal distribution.

We use Solver [*] to determine that the limits of the truncated normal distribution are about +/-1.79sd, based on the goal number of visitors for the 18-24 age group (the smallest group) of 97, the same as the sample data.

[*] I do not know if we can determine the limits of the distribution algebraically. I do not have time to try. I use Solver for a quick solution.

This is demonstrated as follows, in addition to the formulas demonstrated above.

probability of grouped data.xlsx

A

B

C

D

E

F

G

H

I

J

8

norm n

615.211887

9

delta-z

0.132576

10

z

age

#age

norm prob

round #age

discrete prob

discrete distrib

norm distrib

11

-1.789776

18

n

12

19

mean

13

20

sd

%error

14

21

#18-24

15

22

#25-34

16

23

#35-44

17

24

%18-24

18

25

%25-34

19

26

%35-44

20

27

21

28

22

29

23

30

24

31

25

32

26

33

27

34

28

35

29

36

30

37

31

38

32

39

33

40

34

41

35

42

36

43

37

44

38

min 18-24

Rich (BB code):

Formulas:
B8: =C5 / (NORMSDIST(A38) - NORMSDIST(A11))
B9: =(A38-A11) / COUNT(B11:B37)
A11: -1.78977590994222 (derived by Solver)
C11: =$B$8 * (NORMSDIST(A12) - NORMSDIST(A11))
D11: =(NORMSDIST(A12) - NORMSDIST(A11)) / (NORMSDIST($A$38) - NORMSDIST($A$11))
E11: =ROUND(C11, 0)
F11: =E11 / $G$11
A12: =A11 + $B$9
E12: =ROUND(SUM($C$11:C12) - SUM($E$11:E11), 0)
A38: =-A11
Discrete distribution summary:
G11: =SUM(E11:E37)
G12: =SUMPRODUCT(B11:B37, E11:E37) / G11
G13: =SQRT(SUMPRODUCT((B11:B37 - G12)^2, E11:E37) / (G11 - 1))
G14: =SUMIFS($E$11:$E$37, $B$11:$B$37, ">="&A2, $B$11:$B$37, "<="&B2)
G17: =SUMIFS($F$11:$F$37, $B$11:$B$37, ">="&A2, $B$11:$B$37, "<="&B2)
I14: =G14/C2 - 1
Normal distribution summary:
J12: =SUMPRODUCT(B11:B37, D11:D37)
J13: =SQRT(SUMPRODUCT((B11:B37 - J12)^2, D11:D37))
J14: =SUMIFS($C$11:$C$37, $B$11:$B$37, ">="&A2, $B$11:$B$37, "<="&B2)
J17: =SUMIFS($D$11:$D$37, $B$11:$B$37, ">="&A2, $B$11:$B$37, "<="&B2)
Solver set-up:
Enter -4 into A11 to avoid Excel errors (#DIV/0) initially
Set objective: J14
To value: 97
By changing: A11
Deselect "Make unconstrained variables non-negative"
Solving method: GRG nonlinear

Notice the XL2BB scrollbar on the right.

It is difficult to know how much needs to be explained. You might understand the formulas well enough on your own. Feel free to ask specific questions.

The discrete probability distribution (column F) is based on the rounded age distribution in column E.

Note that the predicted number of visitors (G14:G16) is close to the sample data in C2:C4.

Also note that the number of visitors in the 23-24 age group -- SUM(E16:E17) = 40 -- is indeed more than the 18-19 age group -- SUM(E11:E12) = 17.

That said, remember: for all of the apparent precision of the method, this is merely an unreliable guess based on assumption after assumption after....

Again, if Google Analytics can provide statistics based on each age, instead of age groups, that is a better source for a discrete probability distribution.

joeu2004 · Jan 3, 2021

Errata....

joeu2004 said:
J13: =SQRT(SUMPRODUCT((B11:B37 - J12)^2, D11:D37))

The formula in J13 should be:

Rich (BB code):

=SQRT(SUMPRODUCT((B11:B37 - J12)^2, C11:C37) / (G11 - 1))

I should also point out that the probabilities in column D and F are conditional probabilities. They apply only if the age is between 18 and 44 inclusively.

They cannot be used to estimate probabilities outside the sample age groups.

Joe4 · Jan 3, 2021

joeu2004,

That is one detailed explanation. Nicely done!

joeu2004 · Jan 3, 2021

Note: Ignore my previous responses #3 and 4. They are completely wrong and misleading.

Joe4 said:
Nicely done!

Thanks. But apparently not! I made some horrible mistakes in my zeal to "make thing as simple as possible, and no simpler".

I have corrected the design, but I am having difficulty finding time to write a "replacement" response. I expect to post it later tonight (my time).

In the meantime, I did not want anyone to be misled by the incorrect design in my response #3 (modified by #4).

joeu2004 · Jan 4, 2021

Again, ignore my previous responses #3 and 4. Although the concepts are correct, there are major mistakes in the implementation. The following is intended to be a replacement.

DDT~123 said:
I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds.

Based on what data?!

If Google Analytics can break down the visitors by age (18, 19, etc) instead of age groups, the number of visitors for each age divided by the total number of visitors would be the discrete probabilities for each age.

For example, your data shows a total of 570 visitors. If 7 visitors were age 18, the estimated probability would be 7/570 = 1.23%.

Of course, that is based on an arbitrary sample of data. There is no way to know if that is a "representative" sample. But over time, the cumulative sample is likely to become more representative.

Moreover, the discrete probability distribution can only be used to predict the number of visitors for each age within the range of data in the sample -- 18 to 44, in your example. In effect, they are conditional probabilities. For example, among visitors of ages 18 to 44, we can expect 1.23% to be 18, which is the probability that a visitor is age 18.

The discrete probability distribution provides no way to predict the probability of a visitor of age 45, for example.

-----

DDT~123 said:
How would I determine probability of individual ages within the age groups?

There really is not sufficient data to develop a reliable probability distribution.

We have no idea what probability distribution to expect, other than your vague assertion about the 23-24 age group ("more likely" "more" than the 18-19 age group).

Nevertheless, we can have some "fun" with the existing data, relying on arbitrary assumptions (for no good reason). Hey, if it's good enough for POTUS, it's good enough for us.

This is not unlike assumptions that US pollsters use to predict election outcomes. And based on their predictions for the 2020 presidential election, we can see just how "reliable" that is (not!).

That said....

Caveat: Possible TMI ahead. Proceed at your own risk.

-----

For a first-level approximation, we might assume (for no good reason) that the frequency within each age group is uniformly distributed.

(Temporarily ignoring your assertion to the contrary for the 18-24 age group.)

This is demonstrated as follows.

probability of grouped data.xlsx

A

B

C

D

E

F

1

age range

group freq

age freq

P(group)

P(age)

2

18

24

97

13.857143

17.02%

2.43%

3

25

34

279

27.900000

48.95%

4.89%

4

35

44

194

19.400000

34.04%

3.40%

min 18-24

Rich (BB code):

Formulas:
D2: =C2 / (B2-A2+1)
E2: =C2 / $B$6
F2: =E2 / (B2-A2+1)
B6: =SUM(C2:C4)

Click on or hover the cursor over each cell to see formulas. Click the copy-to-clipboard icon in the upper-left under "f(x)", and paste into the indicated cells in an Excel worksheet.

P(group) in column E is the actual probability for each age group. P(age) in column F is the probability for each age in the group, again assuming a uniform distribution within each age group.

This is simplest assumption and calculation. It might be sufficient for some purposes. Only you can make that choice.

But remember: we have no reason to expect that distribution.

-----

For a second-level approximation , we might assume (for no good reason) that the frequency for each age is normally distributed.

A common approach is to assume (for no good reason) that the extreme ages -- 18 and 45 (44+1) -- represent -4sd and +4sd respectively, where "sd" is the standard deviation (std dev).

But if we did that, we would be very dissatisfied with the results. The number of visitors would be less than 1 for ages 18-21 and 41-44. And the number of visitors would be 15, 470 and 85 for the age groups 18-24, 25-34 and 35-44, which is very different from the sample data in C2:C4.

Instead, we might assume that the known data (ages for 570 visitors) fits a subregion [**] of the normal distribution, such that the calculated frequencies meet some criteria.

For example, we can use Solver [*] to determine the limits of the subregion -- about +/-1.79sd -- such that the number of visitors for the 18-24 age group is 97, the same as the sample data.

This is demonstrated as follows, in addition to the formulas demonstrated above.

[*] I do not know if we can determine the limits of the subregion algebraically. I do not have time to try. I use Solver for a quick solution.

[**] Previously, I referred to the subregion as a truncated normal distribution. That was incorrect. However, if you have a mimimum and/or maximum age for visitors, that would require a truncated normal distribution, which affects the probability distribution. LMK.

probability of grouped data.xlsx

A

B

C

D

E

F

7

norm n

615.211887

8

norm mean

31.500000

9

norm sd

7.542844

10

z

age

freq

P(age)

summary

11

<18

22.605943

3.67%

31.000000

wgtd avg

12

-1.789776

18

7.379196

1.20%

6.270528

wgtd sd

13

19

18-24 freq

14

20

25-34 freq

15

21

35-44 freq

16

22

P(18-24)

17

23

P(25-34)

18

24

P(35-44)

19

25

20

26

21

27

22

28

23

29

24

30

25

31

26

32

27

33

28

34

29

35

30

36

31

37

32

38

33

39

34

40

35

41

36

42

37

43

38

44

39

>44

40

total

615.211887

100.00%

41

18-44

570.000000

92.65%

min 18-24

Formulas:

Rich (BB code):

B7: =B6 / (NORMSDIST(A39) - NORMSDIST(A12))
B8: =AVERAGE(A2, B4+1)
B9: =(B12 - B8) / A12
C11: =$B$7 * D11
D11: =NORMSDIST(A12)
A12: -1.78977590994222 (derived by Solver; YMMV)
D12: =NORMSDIST(A13) - NORMSDIST(A12)
B39: 45 (formatted to display ">44")
D39: =1 - NORMSDIST(A39)
C40: =SUM(C11:C39)
C41: =SUM(C12:C38)
Distribution summary:
E11: =SUMPRODUCT(B12:B38, D12:D38) / SUM(D12:D38)
E12: =SQRT(SUMPRODUCT((B12:B38 - E11)^2, C12:C38) / (C41-1))
E13: =SUMIFS($C$12:$C$38, $B$12:$B$38, ">="&A2, $B$12:$B$38, "<="&B2)
E16: =E13 / $C$41
Solver set-up:
Enter -4 into A12 to avoid Excel errors (#DIV/0) initially
Set objective: E13
To value: 97 (from C2)
By changing: A12
Deselect "Make unconstrained variables non-negative"
Solving method: GRG nonlinear

Notice the XL2BB scrollbar on the right.

It is difficult to know how much to explain. You might understand the formulas well enough on your own. Feel free to ask specific questions.

Note that the predicted number of visitors in E13:E15 is close to the sample data in C2:C4.

Also note that the number of visitors in the 23-24 age group -- SUM(C17:C18) = 40 -- is indeed more than the 18-19 age group -- SUM(C12:C13) = 16.

But remember: for all of the apparent precision of the method, this is merely an unreliable guess based on arbitrary assumption after assumption after....

And again, if Google Analytics can provide statistics based on each age instead of age groups, the discrete probability distribution might be more reliable, since it is not based on assumptions.

probability of a range

DDT~123

Board Regular

Excel Facts

Joe4

MrExcel MVP, Junior Admin

joeu2004

Banned user

joeu2004

Banned user

Joe4

MrExcel MVP, Junior Admin

joeu2004

Banned user

joeu2004

Banned user

Similar threads

Forum statistics

Share this page

probability of a range

DDT~123

Board Regular

Excel Facts

Joe4

MrExcel MVP, Junior Admin

joeu2004

Banned user

joeu2004

Banned user

Joe4

MrExcel MVP, Junior Admin

joeu2004

Banned user

joeu2004

Banned user

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock