Identifying missing value for gender column based on previous known values in transformation phase

absherzad · Jun 19, 2017

Hi everyone,
Here is my question:
I have two columns "Name", and "Gender". There are duplicate names, for some, the value for gender is specified, but for some, the value is missing.
The goal is to identify the missing ones based on other occurrences of the same name where the gender value is available.

Marcelo Branco provided the solution in Excel, which is very appreciated (Excel solution)
I am curious to know if it is possible with power query during the transformation phase.

The screenshot for better clarification.

MarcelBeug · Jun 19, 2017

Yes it is: you can use Group By, with aggregation "All Rows" and then adjust the generated code to have "Gender" filled down and filled up.
In this solution I also used the table type of "Source" for the nested table, so the datatype (text) is taken over to the nested tables.

Code:

let
    Source = Table1,
    #"Grouped Rows" = Table.Group(Source, {"Name"}, {{"AllData", each Table.FillUp(Table.FillDown(_,{"Gender"}),{"Gender"}), Value.Type(Source)}}),
    #"Expanded AllData" = Table.ExpandTableColumn(#"Grouped Rows", "AllData", {"Gender"}, {"Gender"})
in
    #"Expanded AllData"

absherzad · Jun 19, 2017

Thank you in advance for your response and solution. It is working like a charm

Another question, is it also possible to do the steps using (only) the GUI, without altering the generated code?

MarcelBeug said:
Yes it is: you can use Group By, with aggregation "All Rows" and then adjust the generated code to have "Gender" filled down and filled up.
In this solution I also used the table type of "Source" for the nested table, so the datatype (text) is taken over to the nested tables.

Code:

let Source = Table1, #"Grouped Rows" = Table.Group(Source, {"Name"}, {{"AllData", each Table.FillUp(Table.FillDown(_,{"Gender"}),{"Gender"}), Value.Type(Source)}}), #"Expanded AllData" = Table.ExpandTableColumn(#"Grouped Rows", "AllData", {"Gender"}, {"Gender"}) in #"Expanded AllData"

MarcelBeug · Jun 19, 2017

Only with a slightly different approach: group by "Name", taking the minimum (or maximum) Gender plus aggregation "All Rows" (this is only required to expand the table back to its original size). Now expand the Gender column from the nested table (it is automatically renamed to Gender.1) and remove this column.

Code:

let
    Source = Table1,
    #"Grouped Rows" = Table.Group(Source, {"Name"}, {{"Gender", each List.Min([Gender]), type text}, {"AllData", each _, type table}}),
    #"Expanded AllData" = Table.ExpandTableColumn(#"Grouped Rows", "AllData", {"Gender"}, {"Gender.1"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded AllData",{"Gender.1"})
in
    #"Removed Columns"

absherzad · Jun 19, 2017

Thank you very much for another wonderful approach.

absherzad · Jul 14, 2017

Hi MarcelBeug,
Following the previous discussion in this thread. Is this practical to auto fill (only) the missing value for the gender column based on the maximum existing values for that name? Please refer the attached screenshot for further clarification.

The following m code auto fill based on the maximum existing values for that name including both the missing values and the known values, as illustrated in the following attached screenshot:

Code:

let
    Source = Excel.CurrentWorkbook(){[Name="RawData"]}[Content],
    #"Grouped Rows" = Table.Group(Source, {"Name"}, {{"Gender", each List.Max([Gender]), type text}, {"AllData", each _, type table}}),
    #"Expanded AllData" = Table.ExpandTableColumn(#"Grouped Rows", "AllData", {"Gender"}, {"Gender.1"}),
    #"Removed Columns" = Table.RemoveColumns(#"Expanded AllData",{"Gender.1"})
in
    #"Removed Columns"

Regards,

MarcelBeug · Jul 15, 2017

Probably you mean the maximum number of occurrences of existing values.

My suggestion would be to create a separate query that determines the value with the maximum number of occurrences and then merge that with the original table.

Query FillValues:

Code:

let
    Source = Table4,
    #"Grouped Rows" = Table.Group(Source, {"Name", "Gender"}, {{"Count", each Table.RowCount(_), type number}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Gender] <> null)),
    #"Grouped Rows1" = Table.Group(#"Filtered Rows", {"Name"}, {{"AllData", each Table.Max(_, each [Count]), type record}}),
    #"Expanded AllData" = Table.ExpandRecordColumn(#"Grouped Rows1", "AllData", {"Gender"}, {"Gender"})
in
    #"Expanded AllData"

I manually adjusted the code in step #"Grouped Rows1".

Merge query:

Code:

let
    Source = Table.NestedJoin(Table4,{"Name"},FillValues,{"Name"},"NewColumn",JoinKind.LeftOuter),
    #"Expanded NewColumn" = Table.ExpandTableColumn(Source, "NewColumn", {"Gender"}, {"NewColumn.Gender"}),
    #"Added Custom" = Table.AddColumn(#"Expanded NewColumn", "Custom", each if [Gender] = null then [NewColumn.Gender] else [Gender]),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Gender", "NewColumn.Gender"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Custom", "Gender"}})
in
    #"Renamed Columns"

absherzad · Jan 26, 2018

Thank you for your reply.
There are 100 thousands of instances in the dataset. But it seems the "Power Query" do not read all the instances to choose the names with the maximum number of occurrences and only sees the first 1,000 instances, I assume.

Any solution for the mentioned issue?

MarcelBeug said:

Probably you mean the maximum number of occurrences of existing values.

My suggestion would be to create a separate query that determines the value with the maximum number of occurrences and then merge that with the original table.

Query FillValues:

Code:

let
    Source = Table4,
    #"Grouped Rows" = Table.Group(Source, {"Name", "Gender"}, {{"Count", each Table.RowCount(_), type number}}),
    #"Filtered Rows" = Table.SelectRows(#"Grouped Rows", each ([Gender] <> null)),
    #"Grouped Rows1" = Table.Group(#"Filtered Rows", {"Name"}, {{"AllData", each Table.Max(_, each [Count]), type record}}),
    #"Expanded AllData" = Table.ExpandRecordColumn(#"Grouped Rows1", "AllData", {"Gender"}, {"Gender"})
in
    #"Expanded AllData"

I manually adjusted the code in step #"Grouped Rows1".

Merge query:

Code:

let
    Source = Table.NestedJoin(Table4,{"Name"},FillValues,{"Name"},"NewColumn",JoinKind.LeftOuter),
    #"Expanded NewColumn" = Table.ExpandTableColumn(Source, "NewColumn", {"Gender"}, {"NewColumn.Gender"}),
    #"Added Custom" = Table.AddColumn(#"Expanded NewColumn", "Custom", each if [Gender] = null then [NewColumn.Gender] else [Gender]),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Gender", "NewColumn.Gender"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"Custom", "Gender"}})
in
    #"Renamed Columns"

Identifying missing value for gender column based on previous known values in transformation phase

absherzad

New Member

Excel Facts

MarcelBeug

Well-known Member

absherzad

New Member

MarcelBeug

Well-known Member

absherzad

New Member

absherzad

New Member

MarcelBeug

Well-known Member

absherzad

New Member

Similar threads

Forum statistics

Share this page

Identifying missing value for gender column based on previous known values in transformation phase

New Member

Excel Facts

Well-known Member

New Member

Well-known Member

New Member

New Member

Well-known Member

New Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock