PDF to Excel Date Conversion Result Challenge

RedMonkey

New Member
Joined
Apr 27, 2011
Messages
45
Hello to the board!

The thread title was hard to determine because this question could easily be a complete clanker.

I am polishing a project where I have taken some truly horrible pdf scans and converted them to excel using Able2Extract, (Awesome program given Adobe stinks at getting data into excel.) so changing/cleaning my source data is at the best level possible and will not get any better.

I am down to one obstacle, the conversion scan has done a real number on dates. I am looking for a formula/idea because this is a 800 page conversion and fixing a few lines like a statement is not possible.

The source document: 06/14/2011
What comes out can be: 0611412011 or 06/1412011 or 06114/2011
1's are being randomly substitued for /


I've fixed strings using IF-Search-Substitute tricks, LFT/RGT/MID methods and others for all my other challenges but this one has me stumped. Given the errors are random and I can't just target the 1's.
The saving grace is no other problems occur in the date field.

Any ideas?
 

Excel Facts

Select a hidden cell
Somehide hide payroll data in column G? Press F5. Type G1. Enter. Look in formula bar while you arrow down through G.
Hi,

If the only mistake the OCR is making is that sometimes '/' is being misparsed as '1', then something like this vba function might work for you. (see code for comments).

However, '7', 'i', 'I', 'l' (lower case L), '|' (that bar symbol over the backslash), and maybe even '(' and ')' tend to get thrown into that mix too if the scan is dirty enough. If it's something like that... ouch? Time to do some real brute force method.

Code:
Function TryOnesToSlashForDate(sAttempt As String) As Variant
    Dim i                               As Long
    Dim j                               As Long
    Dim sTemp                           As String
    Dim alOnesLocation()                As Long
    Dim dFoundOne                       As Date
    Dim bAlreadyFound                   As Boolean
    Dim lNbrSlashes                     As Long
 
    'if the input is a date, use it
    If IsDate(sAttempt) Then
        TryOnesToSlashForDate = CDate(sAttempt)
        Exit Function
    End If
    'if there aren't any '1's in the string, not much we can do.
    If (Len(sAttempt) - Len(Replace(sAttempt, "1", ""))) = 0 Then
        TryOnesToSlashForDate = "No date possible"
        Exit Function
    End If
    'Find out the number of slashes already in the string.
    'if there's 0 slashes, then we need to do a double loop
    'if there's 1 slash, then we need to do a single loop
    'if there's more than 1, then a simple
    'try a slash instead of a one won't work at all.
    lNbrSlashes = Len(sAttempt) - Len(Replace(sAttempt, "/", ""))
    Select Case lNbrSlashes
    Case 0
        'find the locations of the '1's in the string
        'don't bother looking at first and last character:
        'won't get a valid date turning it into a / anyway.
        j = 0
        For i = 2 To Len(sAttempt) - 1
            If Mid(sAttempt, i, 1) = "1" Then
                ReDim Preserve alOnesLocation(0 To j)
                alOnesLocation(j) = i
                j = j + 1
            End If
        Next i
        'Do a brute force replacement of the one's with slashes.
        'function just returns the first value found
        'WARNING: function will return "Multiple Possibilities" if
        'there's more than one possibility
        'for example 111112011 could be 1/11/2011 or 11/1/2011...
        For i = 0 To UBound(alOnesLocation(), 1) - 1
            For j = i + 1 To UBound(alOnesLocation(), 1)
                'replace out the 1's with slashes
                sTemp = Left(sAttempt, alOnesLocation(i) - 1) & "/" _
                        & Mid(sAttempt, alOnesLocation(i) + 1)
                sTemp = Left(sTemp, alOnesLocation(j) - 1) & "/" _
                        & Mid(sAttempt, alOnesLocation(j) + 1)
                If IsDate(sTemp) Then
                    If Not bAlreadyFound Then
                        dFoundOne = CDate(sTemp)
                        bAlreadyFound = True
                    Else
                        TryOnesToSlashForDate = "Multiple Possibilities"
                        Exit Function
                    End If
                End If
            Next j
        Next i
    Case 1
        'as before, the first and last character in your string
        'can't be a slash with a valid date
        'Once again, check every possibility until you find the second
        'potential date.
        For i = 2 To Len(sAttempt) - 1
            If Mid(sAttempt, i, 1) = "1" Then
                sTemp = Left(sAttempt, i - 1) & "/" & Mid(sAttempt, i + 1)
                If IsDate(sTemp) Then
                    If Not bAlreadyFound Then
                        dFoundOne = CDate(sTemp)
                        bAlreadyFound = True
                    Else
                        TryOnesToSlashForDate = "Multiple Possibilities"
                        Exit Function
                    End If
                End If
            End If
        Next i
    Case Else
    End Select
    If bAlreadyFound Then
        'Found a unique value
        TryOnesToSlashForDate = dFoundOne
    Else
        'Couldn't find a valid date by replacing 1's to /'s
        TryOnesToSlashForDate = "No date possible"
    End If
End Function
 
Upvote 0
If the date output is always 10 characters (which it appears to be based on your output), the text view of the date would be:

Code:
=LEFT(B2,2) & "/" & MID(B2,4,2) & "/" & MID(B2,7,4)

This assumes that you know where the dates are, and that you want to use a formula. Obviously substitute the cell of interest into the formula in place of my "B2" references.

In VBA, if you can find the dates you can use a similar approach (e.g. treating it like a string). You may be able to use the "10-character" characteristic to help identify cells that should contain dates. If I understand correctly, the cells have either 10 digits, or 10 characters that are either digits or slashes. An IF statement that looks for cells containing only digits and slashes might help find the cells.

I can add more specifics later...late to a meeting :)


Hope that helps,

Cindy
 
Upvote 0
I knew a fresh look would point out the obvious to me. Thanks Cindy and Chris! I was so fixated on previous work with IFERROR-IF-SEARCH strings I lost the basics again.

Since I am fortunate to always have 10 characters and the separator is always a 1 or / I'm going with the formula. Once we go through this process with all my vendors then stage 2 is to see what lends itself to VBA and what doesn't. The VBA ideas will sure help with the smaller vendors (This 800 page monster requires extra TLC after the dates are fixed.)

Anything you can share regarding a VBA version Cindy, when you have time, would be wonderful
 
Upvote 0
The following arbitrarily assumes the dates are all in column C. If you know what column(s) may contain dates, you can change the code for those columns. The column number is highlighted in red. Given the size of your report, it will take a while. I suggest you try it on a subset of your data first to see if it gets you close to what you need.
Cindy
Code:
Sub FixBadDates()
'assumes dates are in column C (represented by a "3" in the cells function), although the rows aren't known
For n = 1 To ActiveSheet.UsedRange.Rows.Count
    TestVal = ActiveSheet.Cells(n, [COLOR=red]3[/COLOR]).Value
    FixVal = TestVal
    If Len(TestVal) = 10 Then 'if 10 characters, then check for just numbers and "/"
        For i = 0 To 9
            TestVal = Replace(TestVal, i, "")
        Next i
        TestVal = Replace(TestVal, "/", "")
        If Len(TestVal) = 0 Then 'the string only contained numerals and "/" so apparently a date
            ActiveSheet.Cells(n, [COLOR=red]3[/COLOR]).Value = Left(FixVal, 2) & "/" & Mid(FixVal, 4, 2) & "/" & Mid(FixVal, 7, 4)
        End If
    End If
Next n
End Sub
 
Upvote 0
Thank you for the code Cindy, since I have the report broken into 6 sections this works as a base very nicely once I tested it on a mock-up report.

I'd show a copy of what I am dealing with but there are far too many confidentialty issues with that, suffice to say that the program I am using cannot spit out a conversion in any template form. What it does produce it does consistently so my formulas are working.
Once I convince others that a little more quality control while scanning this stuff makes sense then the VBA portion of my work will be possible.

Thanks again!
 
Upvote 0

Forum statistics

Threads
1,224,581
Messages
6,179,668
Members
452,936
Latest member
anamikabhargaw

We've detected that you are using an adblocker.

We have a great community of people providing Excel help here, but the hosting costs are enormous. You can help keep this site running by allowing ads on MrExcel.com.
Allow Ads at MrExcel

Which adblocker are you using?

Disable AdBlock

Follow these easy steps to disable AdBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the icon in the browser’s toolbar.
2)Click on the "Pause on this site" option.
Go back

Disable AdBlock Plus

Follow these easy steps to disable AdBlock Plus

1)Click on the icon in the browser’s toolbar.
2)Click on the toggle to disable it for "mrexcel.com".
Go back

Disable uBlock Origin

Follow these easy steps to disable uBlock Origin

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back

Disable uBlock

Follow these easy steps to disable uBlock

1)Click on the icon in the browser’s toolbar.
2)Click on the "Power" button.
3)Click on the "Refresh" button.
Go back
Back
Top