Power Query - Remove duplicates from Folder Download

pjmorris

Well-known Member
Joined
Aug 2, 2012
Messages
1,905
Office Version
2016
Platform
Windows
Hi,

I'm creating a personal account analysis system and using the 'From File, From Folder' process in Power Query. The issue I'm having is to identify where two downloads from a bank or credit card account cover the same period and hence have duplicate entries. In doing so I want to avoid removing genuine duplicate transactions. For example:

Download File 1 includes two transaction on 10th June for £12 at Weatherspoons.
Download File 2 covers the same period by accident and also has the two transactions for £12 at Weatherspoons on 10th June.

A simple removal of duplicates will result in only one transaction, which is incorrect. Including the file download file name will result in two entries, but will also result in two entries if I had only a single transaction that occurs in both downloads. I believe the definition of a duplicate is where:

The date, Amount and Payee are all identical, but the Download File Name is different. I know its counterintuitive, but...

I'd be grateful for any advice you give.

Many thanks.
 

Some videos you may like

Excel Facts

Why does 9 mean SUM in SUBTOTAL?
It is because Sum is the 9th alphabetically in Average, Count, CountA, Max, Min, Product, StDev.S, StDev.P, Sum, VAR.S, VAR.P.

Matt Allington

MrExcel MVP
Joined
Dec 18, 2014
Messages
1,222
At a guess.
1 Add a new Merged column using the 3 columns to make a unique key
2 branch the query and do a group by the key, distinct count of the file name
3 filter where count >1. This is a list of the duplicates
4 branch again from point 2, inner join with 3, remove duplicates. This is a list of duplicates, but 1 copy only
5 branch again from 2, left join, and remove matches. These are the non duplicated records
6 append 4 and 5

there may be a better way, but that is what comes to mind.
 

pjmorris

Well-known Member
Joined
Aug 2, 2012
Messages
1,905
Office Version
2016
Platform
Windows
Hi Matt,

Many thanks for this. I have a couple of queries:

1. I can follow the logic, just not sure what you mean by 'branch the query'? Do you mean 'Reference'? (that is what I've assumed.
2. The first four steps seem to remove the duplicates within a download, but these may be valid (buying two rounds at the pub will result in two exactly similar records as the download doesn't show time, only date). Its only those records in the second download that match something in the first that need to be removed.

I'll keep trying with your strategy as that occurred to me, but had no idea how to start.

Many thanks.

Peter
 

pjmorris

Well-known Member
Joined
Aug 2, 2012
Messages
1,905
Office Version
2016
Platform
Windows
Hi Matt,

I believe I've found the solution:

1. Group by Download File, Date, Payee and Amount - this collects any duplicate records on that day into a table.
2. Select Date, Payee and Amount and delete rows of duplicates. This leaves a single entry for each transaction including the table that contains the information about how may duplicates there are on one day.
3. Expand the table and delete any unnecessary columns, which is probably all the new column(s) you select.

Hope this helps someone.

Best Regards

Peter
 

Watch MrExcel Video

Forum statistics

Threads
1,102,642
Messages
5,488,071
Members
407,622
Latest member
plantaddict

This Week's Hot Topics

  • Timer in VBA - Stop, Start, Pause and Reset
    [CODE=vba][/CODE] Option Explicit Dim CmdStop As Boolean Dim Paused As Boolean Dim Start Dim TimerValue As Date Dim pausedTime As Date Sub...
  • how to updates multiple rows in muliselect listbox
    Hello everyone. I need help with below code. code is only chaning 1st row in mulitiselect list box. i know issue with code...
  • Delete Row from Table
    I am trying to delete a row from a table using VBA using a named range to find what I need to delete. My Range is finding the right cell. In the...
  • Assigning to a variable
    I have a for each block where I want to assign the value in column 5 of the found row to the variable Serv. [CODE=vba] For Each ws In...
  • Way to verify information
    Hi All, I don't know what to call this formula, and therefore can't search. I have a spreadsheet with information I want to reference...
  • Active Cell Address – Inactive Sheet
    How to use VBA to get the cell address of the active cell in an inactive worksheet and then place that cell address in a location on the current...
Top