Again, ignore my previous responses #3 and 4. Although the concepts are correct, there are major mistakes in the implementation. The following is intended to be a replacement.
I'm more likely to have more visitors who are 23 and 24 years of age than I am 18 and 19 year olds.
Based on what data?!
If Google Analytics can break down the visitors by age (18, 19, etc) instead of age groups, the
number of visitors for each age divided by the total number of visitors would be the
discrete probabilities for each age.
For example, your data shows a total of 570 visitors. If 7 visitors were age 18, the estimated probability would be 7/570 = 1.23%.
Of course, that is based on an arbitrary sample of data. There is no way to know if that is a "representative" sample. But over time, the
cumulative sample is likely to become more representative.
Moreover, the discrete probability distribution can only be used to predict the number of visitors for each age within the range of data in the sample -- 18 to 44, in your example. In effect, they are
conditional probabilities. For example,
among visitors of ages 18 to 44, we can expect 1.23% to be 18, which is the probability that a visitor is age 18.
The discrete probability distribution provides no way to predict the probability of a visitor of age 45, for example.
-----
How would I determine probability of individual ages within the age groups?
There really is not sufficient data to develop a reliable probability distribution.
We have no idea what probability distribution to expect, other than your vague assertion about the 23-24 age group ("more likely" "more" than the 18-19 age group).
Nevertheless, we can have some "fun" with the existing data, relying on
arbitrary assumptions (for no good reason). Hey, if it's good enough for POTUS, it's good enough for us.
This is not unlike assumptions that US pollsters use to predict election outcomes. And based on their predictions for the 2020 presidential election, we can see just how "reliable" that is (not!).
That said....
Caveat: Possible TMI ahead. Proceed at your own risk.
-----
For a first-level approximation, we might assume (for no good reason) that the frequency within each age group is
uniformly distributed.
(Temporarily ignoring your assertion to the contrary for the 18-24 age group.)
This is demonstrated as follows.
Rich (BB code):
Formulas:
D2: =C2 / (B2-A2+1)
E2: =C2 / $B$6
F2: =E2 / (B2-A2+1)
B6: =SUM(C2:C4)
Click on or hover the cursor over each cell to see formulas. Click the copy-to-clipboard icon in the upper-left under "f(x)", and paste into the indicated cells in an Excel worksheet.
P(group) in column E is the actual probability for each age group. P(age) in column F is the probability for each age in the group, again assuming a uniform distribution within each age group.
This is simplest assumption and calculation. It might be sufficient for some purposes. Only you can make that choice.
But remember: we have no reason to expect that distribution.
-----
For a second-level approximation , we might assume (for no good reason) that the frequency for each age is
normally distributed.
A common approach is to assume (for no good reason) that the extreme ages -- 18 and 45 (44+1) -- represent -4sd and +4sd respectively, where "sd" is the standard deviation (std dev).
But if we did that, we would be very dissatisfied with the results. The number of visitors would be less than 1 for ages 18-21 and 41-44. And the number of visitors would be 15, 470 and 85 for the age groups 18-24, 25-34 and 35-44, which is very different from the sample data in C2:C4.
Instead, we might assume that the known data (ages for 570 visitors) fits a subregion [**] of the normal distribution, such that the calculated frequencies meet some criteria.
For example, we can use Solver [*] to determine the limits of the subregion -- about +/-1.79sd -- such that the number of visitors for the 18-24 age group is 97, the same as the sample data.
This is demonstrated as follows, in addition to the formulas demonstrated above.
[*] I do not know if we can determine the limits of the subregion algebraically. I do not have time to try. I use Solver for a quick solution.
[**] Previously, I referred to the subregion as a truncated normal distribution. That was incorrect. However, if you have a mimimum and/or maximum age for visitors, that would require a truncated normal distribution, which affects the probability distribution. LMK.
Formulas:
Rich (BB code):
B7: =B6 / (NORMSDIST(A39) - NORMSDIST(A12))
B8: =AVERAGE(A2, B4+1)
B9: =(B12 - B8) / A12
C11: =$B$7 * D11
D11: =NORMSDIST(A12)
A12: -1.78977590994222 (derived by Solver; YMMV)
D12: =NORMSDIST(A13) - NORMSDIST(A12)
B39: 45 (formatted to display ">44")
D39: =1 - NORMSDIST(A39)
C40: =SUM(C11:C39)
C41: =SUM(C12:C38)
Distribution summary:
E11: =SUMPRODUCT(B12:B38, D12:D38) / SUM(D12:D38)
E12: =SQRT(SUMPRODUCT((B12:B38 - E11)^2, C12:C38) / (C41-1))
E13: =SUMIFS($C$12:$C$38, $B$12:$B$38, ">="&A2, $B$12:$B$38, "<="&B2)
E16: =E13 / $C$41
Solver set-up:
Enter -4 into A12 to avoid Excel errors (#DIV/0) initially
Set objective: E13
To value: 97 (from C2)
By changing: A12
Deselect "Make unconstrained variables non-negative"
Solving method: GRG nonlinear
Notice the XL2BB scrollbar on the right.
It is difficult to know how much to explain. You might understand the formulas well enough on your own. Feel free to ask specific questions.
Note that the predicted number of visitors in E13:E15 is close to the sample data in C2:C4.
Also note that the number of visitors in the 23-24 age group -- SUM(C17:C18) = 40 -- is indeed more than the 18-19 age group -- SUM(C12:C13) = 16.
But remember: for all of the apparent precision of the method, this is merely an unreliable guess based on arbitrary assumption after assumption after....
And again, if Google Analytics can provide statistics based on each age instead of age groups, the discrete probability distribution might be more reliable, since it is not based on assumptions.