Identifying missing value for gender column based on previous known values in transformation phase

absherzad

New Member
Joined
Jun 19, 2017
Messages
42
Hi everyone,
Here is my question:
I have two columns "Name", and "Gender". There are duplicate names, for some, the value for gender is specified, but for some, the value is missing.
The goal is to identify the missing ones based on other occurrences of the same name where the gender value is available.

Marcelo Branco provided the solution in Excel, which is very appreciated (Excel solution)
I am curious to know if it is possible with power query during the transformation phase.

The screenshot for better clarification.
 

Excel Facts

Format cells as date
Select range and press Ctrl+Shift+3 to format cells as date. (Shift 3 is the # sign which sort of looks like a small calendar).
Yes it is: you can use Group By, with aggregation "All Rows" and then adjust the generated code to have "Gender" filled down and filled up.
In this solution I also used the table type of "Source" for the nested table, so the datatype (text) is taken over to the nested tables.

Code:
let
    Source = Table1,
    #"Grouped Rows" = Table.Group(Source, {"Name"}, {{"AllData", each Table.FillUp(Table.FillDown(_,{"Gender"}),{"Gender"}), Value.Type(Source)}}),
    #"Expanded AllData" = Table.ExpandTableColumn(#"Grouped Rows", "AllData", {"Gender"}, {"Gender"})
in
    #"Expanded AllData"
 
Upvote 0
Thank you in advance for your response and solution. It is working like a charm :)
Another question, is it also possible to do the steps using (only) the GUI, without altering the generated code?

Yes it is: you can use Group By, with aggregation "All Rows" and then adjust the generated code to have "Gender" filled down and filled up.
In this solution I also used the table type of "Source" for the nested table, so the datatype (text) is taken over to the nested tables.

Code:
let
    Source = Table1,
    #"Grouped Rows" = Table.Group(Source, {"Name"}, {{"AllData", each Table.FillUp(Table.FillDown(_,{"Gender"}),{"Gender"}), Value.Type(Source)}}),
    #"Expanded AllData" = Table.ExpandTableColumn(#"Grouped Rows", "AllData", {"Gender"}, {"Gender"})
in
    #"Expanded AllData"
 
Upvote 0
Only with a slightly different approach: group by "Name", taking the minimum (or maximum) Gender plus aggregation "All Rows" (this is only required to expand the table back to its original size). Now expand the Gender column from the nested table (it is automatically renamed to Gender.1) and remove this column.

Code:
let
    Source = Table1,
    #"Grouped Rows" = Table.Group(Source, {"Name"}, {{"Gender", each List.Min([Gender]), type text}, {"AllData", each _, type table}}),
    #"Expanded AllData" = Table.ExpandTableColumn(#"Grouped Rows", "AllData", {"Gender"}, {"Gender.1"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded AllData",{"Gender.1"})
in
    #"Removed Columns"
 
Upvote 0
Hi MarcelBeug,
Following the previous discussion in this thread. Is this practical to auto fill (only) the missing value for the gender column based on the maximum existing values for that name? Please refer the attached screenshot for further clarification.

The following m code auto fill based on the maximum existing values for that name including both the missing values and the known values, as illustrated in the following attached screenshot:
Code:
let
    Source = Excel.CurrentWorkbook(){[Name="RawData"]}[Content],
    #"Grouped Rows" = Table.Group(Source, {"Name"}, {{"Gender", each List.Max([Gender]), type text}, {"AllData", each _, type table}}),
    #"Expanded AllData" = Table.ExpandTableColumn(#"Grouped Rows", "AllData", {"Gender"}, {"Gender.1"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded AllData",{"Gender.1"})
in
    #"Removed Columns"

Regards,
 
Upvote 0
Probably you mean the maximum number of occurrences of existing values.

My suggestion would be to create a separate query that determines the value with the maximum number of occurrences and then merge that with the original table.

Query FillValues:
Code:
let
    Source = Table4,
    #"Grouped Rows" = Table.Group(Source, {"Name", "Gender"}, {{"Count", each Table.RowCount(_), type number}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Gender] <> null)),
    #"Grouped Rows1" = Table.Group(#"Filtered Rows", {"Name"}, {{"AllData", each Table.Max(_, each [Count]), type record}}),
    #"Expanded AllData" = Table.ExpandRecordColumn(#"Grouped Rows1", "AllData", {"Gender"}, {"Gender"})
in
    #"Expanded AllData"

I manually adjusted the code in step #"Grouped Rows1".

Merge query:

Code:
let
    Source = Table.NestedJoin(Table4,{"Name"},FillValues,{"Name"},"NewColumn",JoinKind.LeftOuter),
    #"Expanded NewColumn" = Table.ExpandTableColumn(Source, "NewColumn", {"Gender"}, {"NewColumn.Gender"}),
    #"Added Custom" = Table.AddColumn(#"Expanded NewColumn", "Custom", each if [Gender] = null then [NewColumn.Gender] else [Gender]),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Gender", "NewColumn.Gender"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Custom", "Gender"}})
in
    #"Renamed Columns"
 
Upvote 0
Thank you for your reply.
There are 100 thousands of instances in the dataset. But it seems the "Power Query" do not read all the instances to choose the names with the maximum number of occurrences and only sees the first 1,000 instances, I assume.

Any solution for the mentioned issue?

Probably you mean the maximum number of occurrences of existing values.

My suggestion would be to create a separate query that determines the value with the maximum number of occurrences and then merge that with the original table.

Query FillValues:
Code:
let
    Source = Table4,
    #"Grouped Rows" = Table.Group(Source, {"Name", "Gender"}, {{"Count", each Table.RowCount(_), type number}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Gender] <> null)),
    #"Grouped Rows1" = Table.Group(#"Filtered Rows", {"Name"}, {{"AllData", each Table.Max(_, each [Count]), type record}}),
    #"Expanded AllData" = Table.ExpandRecordColumn(#"Grouped Rows1", "AllData", {"Gender"}, {"Gender"})
in
    #"Expanded AllData"

I manually adjusted the code in step #"Grouped Rows1".

Merge query:

Code:
let
    Source = Table.NestedJoin(Table4,{"Name"},FillValues,{"Name"},"NewColumn",JoinKind.LeftOuter),
    #"Expanded NewColumn" = Table.ExpandTableColumn(Source, "NewColumn", {"Gender"}, {"NewColumn.Gender"}),
    #"Added Custom" = Table.AddColumn(#"Expanded NewColumn", "Custom", each if [Gender] = null then [NewColumn.Gender] else [Gender]),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Gender", "NewColumn.Gender"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Custom", "Gender"}})
in
    #"Renamed Columns"
 
Upvote 0

Forum statistics

Threads
1,214,915
Messages
6,122,217
Members
449,074
Latest member
cancansova

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top