Get random top n percent by group

Darren Bartrup

Well-known Member
Joined
Mar 13, 2006
Messages
1,296
Office Version
  1. 365
Platform
  1. Windows
Hi all,

I need to extract a percentage of records for each team member from an original raw data table.

My raw data doesn't have a natural key, but can add an auto-number if needed.

All the fields, except date, are text fields in a table called tbl_PrimaryData:
dDate, sProcess, sPropID, sPropAddress, sSurname and sForename.

The sSurname & sForename combination are pushed to a separate table called tbl_TeamMember which has these fields: ID, Surname, Forename and SampleSize. SampleSize is a double formatted as a percent and is manually entered - usually set at 5%.
The SQL for this is:
Code:
INSERT INTO tbl_TeamMember ( Surname, Forename )
SELECT DISTINCT sSurname, sForename
FROM tbl_PrimaryData
WHERE sSurname & sForename NOT IN (SELECT sSurname & sForename FROM tbl_TeamMember);

So now I want to extract the SampleSize of records from tbl_PrimaryData for each team member and place the results in a table called tbl_Audits.

Both team member names and process text will be pushed out to separate tables to normalise the data.

If anyone has any thoughts on this please - I'm not having much luck. :(

Edit:
SQL like the below will give me random ordering of the records grouped by team member, but I can't seem to restrict it to the first n percentage.
Code:
SELECT      dDate,
            sProcess,
            sPropID,
            sPropAddress,
            sSurname,
            sForename
FROM        tbl_PrimaryData
ORDER BY    sSurname, sForename, RND(dDate)
 
Last edited:

Excel Facts

Control Word Wrap
Press Alt+Enter to move to a new row in a cell. Lets you control where the words wrap.
In general, you can usually use a MOD function to get every k-th record where k is some constant:

Select A.* FROM
(
--your query--
) AS A
WHERE A mod K = 0

If the ordering within the subquery can be also randomized then much better (like by day of week, or last name, or third letter of backwards last name or anything else that randomizes best. K should be chosen to get the number of records you want in the results (larger K equals fewer records, smaller K equals more records).

Note:
So, that was pseudocode. So MOD is probably a function in Access-world, and probably an operator in other worlds. Use the correct syntax as needed. Also this is a generic answer to the problem of randomization. Please clarify as needed!!
 
Last edited:
Upvote 0
Thanks Xenou. I'll have a play around and post back if I get the solution - think the correct syntax is A MOD K as in SELECT 5 MOD 10 returns 5.
 
Upvote 0
Good. I think I forgot to mention that in some cases I might try to number the records in some way (i.e., if necessary insert into a temp table, then use an autonumbered field as the one to push into the mod function). I guess I've gotten used to this with SQL Server where I can create the numbering on the fly in the query ... starting to lose my old Access expertise. If there is a candidate field (with sufficient randomness) then possibly an existing field (or partof an existing field) can be used for that.
 
Upvote 0
Slowly getting there - the requirements have changed a little since I realised I was forgetting one small point:
I need to extract a percentage of records for each team member, for each week commencing date from an original raw data table.

My original raw data table (tbl_PrimaryData) has:
  • ID - AutoNumber
  • dDate - Date
  • sProcess - Text
  • sPropCoID - Text
  • sPropertyAddress - Text
  • sSurname - Text
  • sForename - Text

A couple of Append queries execute to push sProcess out to tbl_Process and sSurname & SForename to tbl_TeamMember.

I then use this query to create a secondary primary data table (tbl_PrimaryData_Step1):
Code:
SELECT      CLng(PD.ID) AS ID, 
            IIf(Weekday([dDate],2)=1,[dDate],[dDate]-Weekday([dDate],3)) AS dWeekCommencing, 
            PD.dDate, 
            CLng(PR.ID) AS ProcessID, 
            PD.sPropCoID, 
            PD.sPropertyAddress, 
            CLng(TM.ID) AS TeamMemberID 
INTO        tbl_PrimaryData_Step1
FROM        (tbl_PrimaryData AS PD LEFT JOIN tbl_Process AS PR ON PD.sProcess = PR.Process) 
                LEFT JOIN tbl_TeamMember AS TM ON (PD.sForename = TM.Forename) AND 
                                                  (PD.sSurname = TM.Surname)

Another query updates a table (tbl_TeamMemberSampleSize) holding the total number of records per week per team member, and the required sample size.
These columns are:
  • TMID - Team Member ID (L Integer)
  • WeekCommencing - date
  • TotalSample - (L Integer)
  • SampleSize - (Dbl Percentage)
  • AuditSize - (L Integer)

For example, the first record here is:
  • 1 = TMID
  • 24/08/2015 = WeekCommencing
  • 56 = TotalSample
  • 5.00% - SampleSize
  • 3 - AuditSize

My final query so far is:
Code:
SELECT         ID, 
            dWeekCommencing, 
            TeamMemberID
FROM         tbl_PrimaryData_Step1 AS PD INNER JOIN tbl_TeamMemberSampleSize AS TMSS 
                ON (PD.dWeekCommencing = TMSS.WeekCommencing) AND (PD.TeamMemberID = TMSS.TMID)
WHERE         ID MOD AuditSize = 0

This doesn't return the correct number of records (and I wasn't expecting it to).
Based on the example above it gives 3, 6, 9, 12, etc - i.e. where the audit size fits exactly into the ID number.

So this gives me a random (enough for my requirements) set of records which I think will always exceed the number of records required.

So now I just need to return the first x records per group - based on the audit size. I've seen a query somewhere which does a self-join and counts how many records to return, but can't find it now.
 
Upvote 0
first x should be easy (right?).

Just:

Code:
Select TOP 100 A.* 
FROM
    (
        your query here
    ) AS A
 
Upvote 0
I wish it was that easy - but can't use a parameter in the SQL to select the top x.

It would have to use something like SELECT TOP (SELECT AuditSize FROM tbl_TeamMemberSampleSize WHERE....).* FROM (......) which Access can't do.
I've seen the solution somewhere - just wish I could remember where.
 
Upvote 0
Finally figured it out. Found an example that worked here: SQL(DML) help: How to select TOP N rows from a table for each group?

I've had to update the previous query to it - ID MOD AuditSize = 0 wasn't returning enough records, so I've updated it to ID MOD 2 = 0 to return every other record - might still need to work on that bit.

So my main table for this query has 3 columns - ID, dWeekCommencing and TeamMemberID.
tbl_TeamMemberSampleSize tells me how many records to return for each team member per week so this is referenced in the HAVING clause.

I can now link the ID back to the rest of the data I need for each team member. :)

Code:
SELECT      T1.ID, 
			T1.dWeekCommencing, 
			T1.TeamMemberID
FROM        qry_AuditSample_Step1 T1 INNER JOIN qry_AuditSample_Step1 T2
                ON T1.dWeekCommencing = T2.dWeekCommencing AND
                   T1.TeamMemberID = T2.TeamMemberID AND
                   T1.ID >= T2.ID
GROUP BY    T1.ID,
            T1.dWeekCommencing,
            T1.TeamMemberID
HAVING      COUNT(*) <= (
                            SELECT  AuditSize
                            FROM    tbl_TeamMemberSampleSize
                            WHERE   TMID = T1.TeamMemberID AND
                                    WeekCommencing = T1.dWeekCommencing
                        )
ORDER BY    T1.ID, T1.dWeekCommencing, T1.TeamMemberID
 
Upvote 0
Sounds good. Overall, it also seems like using a little vba code could also smooth out the rough spots (if getting away from pure SQL is not a problem). Still, sounds like problem is getting solved quickly now!
 
Upvote 0

Forum statistics

Threads
1,213,565
Messages
6,114,337
Members
448,568
Latest member
Honeymonster123

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top