Power Query - Remove duplicates from Folder Download

pjmorris

Well-known Member
Joined
Aug 2, 2012
Messages
2,158
Office Version
  1. 365
Platform
  1. Windows
Hi,

I'm creating a personal account analysis system and using the 'From File, From Folder' process in Power Query. The issue I'm having is to identify where two downloads from a bank or credit card account cover the same period and hence have duplicate entries. In doing so I want to avoid removing genuine duplicate transactions. For example:

Download File 1 includes two transaction on 10th June for £12 at Weatherspoons.
Download File 2 covers the same period by accident and also has the two transactions for £12 at Weatherspoons on 10th June.

A simple removal of duplicates will result in only one transaction, which is incorrect. Including the file download file name will result in two entries, but will also result in two entries if I had only a single transaction that occurs in both downloads. I believe the definition of a duplicate is where:

The date, Amount and Payee are all identical, but the Download File Name is different. I know its counterintuitive, but...

I'd be grateful for any advice you give.

Many thanks.
 

Excel Facts

Best way to learn Power Query?
Read M is for (Data) Monkey book by Ken Puls and Miguel Escobar. It is the complete guide to Power Query.
At a guess.
1 Add a new Merged column using the 3 columns to make a unique key
2 branch the query and do a group by the key, distinct count of the file name
3 filter where count >1. This is a list of the duplicates
4 branch again from point 2, inner join with 3, remove duplicates. This is a list of duplicates, but 1 copy only
5 branch again from 2, left join, and remove matches. These are the non duplicated records
6 append 4 and 5

there may be a better way, but that is what comes to mind.
 
Upvote 0
Hi Matt,

Many thanks for this. I have a couple of queries:

1. I can follow the logic, just not sure what you mean by 'branch the query'? Do you mean 'Reference'? (that is what I've assumed.
2. The first four steps seem to remove the duplicates within a download, but these may be valid (buying two rounds at the pub will result in two exactly similar records as the download doesn't show time, only date). Its only those records in the second download that match something in the first that need to be removed.

I'll keep trying with your strategy as that occurred to me, but had no idea how to start.

Many thanks.

Peter
 
Upvote 0
Hi Matt,

I believe I've found the solution:

1. Group by Download File, Date, Payee and Amount - this collects any duplicate records on that day into a table.
2. Select Date, Payee and Amount and delete rows of duplicates. This leaves a single entry for each transaction including the table that contains the information about how may duplicates there are on one day.
3. Expand the table and delete any unnecessary columns, which is probably all the new column(s) you select.

Hope this helps someone.

Best Regards

Peter
 
Upvote 0

Forum statistics

Threads
1,213,531
Messages
6,114,167
Members
448,554
Latest member
Gleisner2

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top