Complex Removing Duplicates Issue

jkupfer · Apr 1, 2009

Hi All,

I have a problem with duplicates in my database that I need to remove, but the problem is that some of the duplicates are OK and need to be left alone. Let me clarify....my database imports data that uses a "key" value so that every few days an individual key can get reused. Those are OK. The ones that need to be deleted are when two duplicate keys are used on the same day.

The duplicate query I build finds all duplicates, but what it really needs to do is find duplicates that occur on the day day. If anybody could modify this to do what I need, I would greatly appreciate it.

Code:

SELECT [Executions table].[Exec ID], [Executions table].[Exec date]
FROM [Executions table]
WHERE ((([Executions table].[Exec ID]) In (SELECT [Exec ID] FROM [Executions table] As Tmp GROUP BY [Exec ID] HAVING Count(*)>1 )))
ORDER BY [Executions table].[Exec ID];

And unfortunately I need to go back a few years and scrub data for these invalid duplicates, so I just cant start cleaning data going forward.

Thanks in advance.

jackd · Apr 1, 2009

jkupfer said:
Hi All,

I have a problem with duplicates in my database that I need to remove, but the problem is that some of the duplicates are OK and need to be left alone. Let me clarify....my database imports data that uses a "key" value so that every few days an individual key can get reused. Those are OK. The ones that need to be deleted are when two duplicate keys are used on the same day.

The duplicate query I build finds all duplicates, but what it really needs to do is find duplicates that occur on the day day. If anybody could modify this to do what I need, I would greatly appreciate it.

Code:

SELECT [Executions table].[Exec ID], [Executions table].[Exec date] FROM [Executions table] WHERE ((([Executions table].[Exec ID]) In (SELECT [Exec ID] FROM [Executions table] As Tmp GROUP BY [Exec ID] HAVING Count(*)>1 ))) ORDER BY [Executions table].[Exec ID];

And unfortunately I need to go back a few years and scrub data for these invalid duplicates, so I just cant start cleaning data going forward.

Thanks in advance.

A couple of comments/advice.

Why are you reusing your "keys"? The sooner you get a real key that remains persistent, the better off you will be.

Why is this cleanup/scrub required now? If you have to go back a few years, then this was not really important.

jackd · Apr 1, 2009

A couple of comments/advice.

Why are you reusing your "keys"? The sooner you get a real key that remains persistent, the better off you will be.

Why is this cleanup/scrub required now? If you have to go back a few years, then this was not really important.

Is it possible that you could have several duplicates for a given day?

Here's some code that should work. It will show those records where there are 1 or more duplicate(s).

Code:

select [exec id]
,[exec date] 
from 
[executions table]
group by [exec id], [exec date]
having count (*) >1
;

jkupfer · Apr 2, 2009

Hi Jack,

I guess the query was not nearly as complex I as made it out to be. My SQL skills are limited, but I spent an hour searching and playing around with what you did in about 30 seconds...still so much to learn apparently.

Anyway, I download this data from a 3rd party, so I have no control over them reusing the same keys. And I need to scrub the data now because I just started creating a new type of report using the data that has produced erroneous information because how these folks report transactions. For example, if transaction 1 occurs it uses key AAA. If there is an error with the transaction, they use key AAA again to correct transaction 1 using a negative number to zero out the transaction. Then they report the corrected data using a new BBB transaction.

When I remove the dupes that occurs on the same day it will be just like they never occurred, which is fine because the corrected data has already been entered in another transaction.

Thanks for your help.

James

jackd · Apr 2, 2009

jkupfer said:
Hi Jack,

I guess the query was not nearly as complex I as made it out to be. My SQL skills are limited, but I spent an hour searching and playing around with what you did in about 30 seconds...still so much to learn apparently.

Anyway, I download this data from a 3rd party, so I have no control over them reusing the same keys. And I need to scrub the data now because I just started creating a new type of report using the data that has produced erroneous information because how these folks report transactions. For example, if transaction 1 occurs it uses key AAA. If there is an error with the transaction, they use key AAA again to correct transaction 1 using a negative number to zero out the transaction. Then they report the corrected data using a new BBB transaction.

When I remove the dupes that occurs on the same day it will be just like they never occurred, which is fine because the corrected data has already been entered in another transaction.

Thanks for your help.

James

Instead of processing the data and then scrubbing the database to remove duplicates there must be a way to accomplish the intended result without the fear of missing something, doing something twice etc.

So if the third part sends trans 1, then determines (at their end) that something is wrong with it, they send a second trans1 with numbers to 0 out/nullify/delete the calculation done by the first trans 1.

This sounds very much like an intentional add, followed by a firm remove. Net result is 0.
Then the proper transaction is sent say trans 1000 or whatever.

My thinking is this

If you get all the transactions, and sort them by the trans identifier, duplicates will appear together. so, if you put these transactions in a separate table, saw Raw Data Table,you can sort them, then read through the records. If you get start with a funny number, like zaw22q for a variable called HoldKey, then compare the raw transaction to the hold key. The first one won't match, so move the key of the first transaction to Holdkey. Then read the next transaction, if the key matches hold key, then don't process the transaction. Go to the next transaction. This is just off the top of my head, and may very well need some work.

The point is, if you have all the raw transactions, remove the duplicates - just don't process them. Only accept the non duplicate transactions for processing.
Do your append transactions, then do your reports.

Ypu could move the duplicates with a date and time and batch identifier to an errors table just to have a record of what was rejected from processing.

Any way, bottom line is, fix a bad process and get a little more discipline into the transaction processing.

There is no need to be dealing with errors once they've been applied to your database.

Of course, if the duplicate transactions are truly a nullification/removal of the first transaction of a pair, then you should be able to process all transactions. The issue is that the second transaction record is not an append, but an update of the first.

You need a little more info about what might be in a transaction such that you would want to keep some data.

Send back some comments on this and we can go forward with a solution.

Complex Removing Duplicates Issue

jkupfer

New Member

Excel Facts

jackd

Well-known Member

jackd

Well-known Member

jkupfer

New Member

jackd

Well-known Member

Similar threads

Forum statistics

Share this page

Complex Removing Duplicates Issue

jkupfer

New Member

Excel Facts

jackd

Well-known Member

jackd

Well-known Member

jkupfer

New Member

jackd

Well-known Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock