Excel to SQL - Handling Duplicate Values

mskusace · Jan 30, 2019

I have an Excel spreadsheet which I have been automating via VBA macros, but it appears the processing time has increased significantly because one worksheet has 222,465 rows of data. That worksheet will only accumulate more data, so I would like to convert this into an environment that can process the data quickly. Would SQL in Microsoft Access be ideal or is there a better solution? I have R, Python, Microsoft Access, and MySQL Workbench.

I have 3 workbooks I am pulling data from and I do perform some data manipulation.

Here are the formulas I currently have in my spreadsheet:

Column: OB

Code:

=SUBSTITUTE(SUBSTITUTE([@[OBN]]," ",""),"-","")

This cleans up the OBN column removing all hyphens and spaces to standardize it based on the data table.

Columb DO

Code:

=IF(OR([@[DON]]="NA",ISBLANK(TRIM([@[DON]])),[@[DON]]="N/A",ISNA([@[DON]]),ISBLANK([@[DON]])),"",[@[DON]])

This makes the DO column blank if there is no value there or if it's NA or N/A, as that is how the report displays it. Otherwise, use the value of DON as DO.

Column: OB+DO (This is the ID in the tables below)

Code:

=IF([@DO]="",[@OB],CONCATENATE([@OB],"-",[@DO]))

Concatenate columns OB and DO with a hyphen if DO contains a value that is not NA or N/A. Otherwise, just use OB.

Now, what I am trying to handle is in the post I made here https://www.mrexcel.com/forum/excel-questions/1085373-handling-duplicate-values.html.

I apologize for duplicating the thread, but the problem has expanded if I have to convert the entire process.

Lookup Table in Worksheet 1.

ID	CODE	REASSIGN(CODE)
W12-05	AA	ZZ
W12-05	BB	ZZ
W14-01	CC	CC
W14-03	AA	AA
W15-01	DD	DD
W15-01	DD	DD

ID and CODE are generated from the report
REASSIGN(CODE) is what I am trying to accomplish. If there are multiple IDs (ex. W12-05) that are the same with different CODE values, then assign it to "ZZ" . However, if there are multiple IDs and they all have the same CODE values (ex. W15-01), then assign it to the same code value.

There are 2 Worksheets (Worksheet 2 and Worksheet 3) that house the ID and different data associated with that ID.

Data Table:

ID	LOOKUP REASSIGN(CODE)
W12-05	ZZ
W14-01	CC
W14-03	AA
W15-01	DD

How can I accomplish the REASSIGN(CODE) column in the first table? Something along the lines of:
If IDs duplicate and multiple different CODEs, assign to ZZ.
If IDs duplicate and same codes, assign to same CODE.

There might be instances where if there certain multiple codes for duplicate IDs, it will need to assign to a specific CODE. "ZZ" is the default when I don't know where it should go and it can be handled by another department. Sometimes I know that code "EE" and code "FF" belong to one department and there is logic to assign it to "GG" for example.

I want to combine all 3 Worksheets based on the ID column and I want all the data from Worksheet 2 and Worksheet 3 to be there, even if there isn't a match in IDs. Worksheet 1 will be used just as a lookup to pull in the REASSIGN(CODE).

Please let me know if I need to explain further or if there are any questions. I am open to ideas and other solutions, since this is just the logic that came into my head. I am not sure if it's better to create another table for the REASSIGN(CODE) or if my logic is sound.

mskusace · Feb 11, 2019

I had to move to SQL since Excel was not handling the data very well. I have it fully functioning now in SQL. If anyone is curious on the solution, I have posted it below.

SELECT [Table1].[ID], [Table1].

Code:

, Count([Table1].[CODE]) AS [CODE Count]
FROM [Table1]
GROUP BY [Table1].[ID], [Table1].[CODE];


I first do a count of CODE and group by ID and CODE. This gets me the count of each CODE per ID. For my example above, W12-05 would be listed twice and would have a count of 1 and 1, results in the table below.

Table2
[TABLE="width: 209"]
<tbody>[TR]
[TD]ID[/TD]
[TD]CODE[/TD]
[TD]CODE Count[/TD]
[/TR]
[TR]
[TD]W12-05[/TD]
[TD]AA[/TD]
[TD="align: right"]1[/TD]
[/TR]
[TR]
[TD]W12-05[/TD]
[TD]BB[/TD]
[TD="align: right"]1[/TD]
[/TR]
[TR]
[TD]W14-01[/TD]
[TD]CC[/TD]
[TD="align: right"]1[/TD]
[/TR]
[TR]
[TD]W14-03[/TD]
[TD]AA[/TD]
[TD="align: right"]1[/TD]
[/TR]
[TR]
[TD]W15-01[/TD]
[TD]DD[/TD]
[TD="align: right"]2[/TD]
[/TR]
</tbody>[/TABLE]

Then, I do a count of ID's and grouped by ID.

SELECT [Table1].[ID], Count([Table1].[ID]) AS [ID Count]
FROM [Table1]
GROUP BY [Table1].[ID];

Table3
[TABLE="width: 500"]
<tbody>[TR]
[TD]ID[/TD]
[TD]ID Count[/TD]
[/TR]
[TR]
[TD]W12-05[/TD]
[TD]2[/TD]
[/TR]
[TR]
[TD]W14-01[/TD]
[TD]1[/TD]
[/TR]
[TR]
[TD]W14-03[/TD]
[TD]1[/TD]
[/TR]
[TR]
[TD]W15-01[/TD]
[TD]2[/TD]
[/TR]
</tbody>[/TABLE]

This will get me the total count of each ID.

Next, I compare the  ID count and CODE Count for each ID.

SELECT [Table2].[ID], [Table2].[CODE], [Table2].[CODE Count], [Table3].[Count ID]
FROM [Table2] LEFT JOIN [Table3] ON [Table2].[ID] = [Table3].[ID];


This yields the following table.

Table4
[TABLE="width: 500"]
<tbody>[TR]
[TD]ID[/TD]
[TD]CODE[/TD]
[TD]CODE Count[/TD]
[TD]ID Count[/TD]
[/TR]
[TR]
[TD]W12-05[/TD]
[TD]AA[/TD]
[TD]1[/TD]
[TD]2[/TD]
[/TR]
[TR]
[TD]W12-05[/TD]
[TD]BB[/TD]
[TD]1[/TD]
[TD]2[/TD]
[/TR]
[TR]
[TD]W14-01[/TD]
[TD]CC[/TD]
[TD]1[/TD]
[TD]1[/TD]
[/TR]
[TR]
[TD]W14-03[/TD]
[TD]AA[/TD]
[TD]1[/TD]
[TD]1[/TD]
[/TR]
[TR]
[TD]W15-01[/TD]
[TD]DD[/TD]
[TD]2[/TD]
[TD]2[/TD]
[/TR]
</tbody>[/TABLE]

The last step I do is compare the count of CODE and ID.

SELECT [Table4].[ID], [Table4].[CODE Count], [Table4].[Count ID], IIf([Table4].[CODE Count]<>[Table4].[Count ID],"ZZ", [Table4].CODE) AS CODE
FROM [Table4];


If the CODE Count and ID Count do not match, assign it to "ZZ".

Table5
[TABLE="width: 500"]
<tbody>[TR]
[TD]ID[/TD]
[TD]CODE Count[/TD]
[TD]ID Count[/TD]
[TD]CODE[/TD]
[/TR]
[TR]
[TD]W12-05[/TD]
[TD]1[/TD]
[TD]2[/TD]
[TD]ZZ[/TD]
[/TR]
[TR]
[TD]W12-05[/TD]
[TD]1[/TD]
[TD]2[/TD]
[TD]ZZ[/TD]
[/TR]
[TR]
[TD]W14-01[/TD]
[TD]1[/TD]
[TD]1[/TD]
[TD]CC[/TD]
[/TR]
[TR]
[TD]W14-03[/TD]
[TD]1[/TD]
[TD]1[/TD]
[TD]AA[/TD]
[/TR]
[TR]
[TD]W15-01[/TD]
[TD]2[/TD]
[TD]2[/TD]
[TD]DD[/TD]
[/TR]
</tbody>[/TABLE]

From here, you can just take the DISTINCT values.

SELECT DISTINCT [Table5].[ID], [Table5].CODE
FROM [Table5];

[TABLE="width: 500"]
<tbody>[TR]
[TD]ID[/TD]
[TD]CODE[/TD]
[/TR]
[TR]
[TD]W12-05[/TD]
[TD]ZZ[/TD]
[/TR]
[TR]
[TD]W14-01[/TD]
[TD]CC[/TD]
[/TR]
[TR]
[TD]W14-03[/TD]
[TD]AA[/TD]
[/TR]
[TR]
[TD]W15-01[/TD]
[TD]DD[/TD]
[/TR]
</tbody>[/TABLE]



I am not sure if this is the optimal way, but it is a working solution for me. I hope this helps!

*EDIT*
Removed the [CODE] tags because I use [CODE] in my SQL and it was not displaying the code properly in the code block.

Excel to SQL - Handling Duplicate Values

mskusace

New Member

Excel Facts

mskusace

New Member

Similar threads

Forum statistics

Share this page

Excel to SQL - Handling Duplicate Values

mskusace

New Member

Excel Facts

mskusace

New Member

Similar threads

Forum statistics

Share this page

We've detected that you are using an adblocker.

Which adblocker are you using?

Disable AdBlock

Disable AdBlock Plus

Disable uBlock Origin

Disable uBlock