Automate ScanSnap OCR process on your Mac with AppleScript (Snow Leopard Edition)

Some time back I published an AppleScript that allows one to automatically run OCR in the background on scanned files generated by your Fujitsu ScanSnap, while you to continue scanning more files. ScanSnap owners should all be familiar with this: the out-of-the-box configuration of the ScanSnap Manager and Abbyy Finereader force the scan and OCR stages to run in lockstep: scan 1…OCR 1…scan 2…OCR 2… and so on. This script allowed you to scan regardless of the OCR processing going on.

As it turns out, my original script does not work in Snow Leopard, and I promised that I would one day clean up and publish my new and improved version.

Chris posted a comment today as a gentle reminder, so here is the new and improved version without further delay…

The Details

Unfortunately, Snow Leopard came around and caused some indigestion. For starters, the ScanSnap Manager didn’t work correctly and Abbyy Finereader would not process anything made by the ScanSnap. A couple of months later they got everything straightened out and delivered new versions of each product.

The new version of the Abbyy Finereader product does not play well with my original script.

Since I cannot do without this important functionality, I rolled up my sleeves and rewrote most of the script. The new version works in Snow Leopard quite nicely with one small annoyance: you really don’t want to try to use the machine for anything other than scanning or OCR while it is going because the new Finereader version keeps bouncing the darned icon all the time it is running and that is quite annoying to watch.

Fortunately, I really don’t need to use my machine for anything else while it is chewing on the docs; I just wanted to be able to continue scanning at the same time!

Note: Before going forward, note that you will need to upgrade the ScanSnap Manager and Abbyy Finereader to the Snow Leopard versions first! Get the files here.

Here is a link to the new script

And here’s the code itself:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
(*

NOTE: This script was written for Snow Leopard. It may work
on Leopard, but I never tried it.

This is a folder listener script that will act as a queue, receiving
PDF files from the ScanSnap scanner and feeding them, one by one, to
the Abbyy FineReader OCR software.

This allows you to keep scanning while the OCR job runs in the background
on all of the unprocessed files.

Why do we want to do this?

The ScanSnap Manager software does not support this by default, so
when you scan in a file, it sends it to FineReader for OCR. You then
must wait until FineReader finishes its work before scanning in another
document.

This script allows you to keep scanning without waiting for OCR.

Installation:

o   Copy this script to:

    <home>/Library/Scripts/Folder Action Scripts

    You may have to create the "Folder Action Scripts" folder.

o   Open a Finder window and navigate to the parent folder
  of the scanned documents folder.

o Right click (control-click) the scanned documents folder and
  choose:

    Folder Actions Setup...

o At this point if folder actions are not enabled, you will
  likely have to enable them and add the script manually.
    - check "Enable Folder Actions"
    - Use the "+" buttons on the left and right sides to add the
      scan folder and then this script.
   
o Otherwise, a list of scripts will come up. Choose this script
  from the "Choose a Script to Attach" dialog.

o Close all windows.

Copyright (C) 2010 Tad Harrison
*)

property ocrFileSuffix : " processed by FineReader.pdf"
property ocrApplicationName : "Scan to Searchable PDF"
property ocrApplicationWindow : "Converting the document"
property ocrLockFileName : "OCR in Progress"
on adding folder items to this_folder after receiving added_items
    set lockFilePath to (POSIX path of (path to desktop folder as text)) & ocrLockFileName
    try
        logEvent("=== Run OCR on New Folder Items ===")
        -- Test for lockfile; exit if lockfile exists
        tell application "System Events" to set lockFileExists to exists file lockFilePath
        if lockFileExists then
            logEvent("Other script running. Exiting...")
            return
        else
            do shell script "/usr/bin/touch \"" & lockFilePath & "\""
        end if
        -- Main loop
        set moreWorkToDo to true
        repeat while moreWorkToDo
            set aFile to getNextFile(this_folder)
            if not aFile = "" then
                ocrFile(aFile)
            else
                set moreWorkToDo to false
            end if
        end repeat
        logEvent("No more work.")
        exitApp(ocrApplicationName)
    on error errorStr number errNum
        display dialog "Error " & errNum & " while running OCR: " & errorStr
        set my isRunning to false
    end try
    -- Get rid of the lockfile, ignoring any errors
    try
        do shell script "/bin/rm \"" & lockFilePath & "\""
    end try
end adding folder items to
(*
Name: ocrFile
Description: Runs OCR on the next un-OCR'd file
Parameters:
  aFile - the file to be OCR'd
*)

on ocrFile(aFile)
    set posixFilePath to POSIX path of aFile
    set posixOcrFilePath to getPosixOcrFilePath(posixFilePath)
    logEvent("OCR: " & posixFilePath)
    tell application ocrApplicationName to open aFile
    --
    -- Now sit in a loop checking once per second for the OCR file
    -- Give up after five minutes
    --
    with timeout of 300 seconds
        set ocrFileExists to false
        repeat until ocrFileExists
            set ocrFileExists to posixFileExists(posixOcrFilePath)
            if ocrFileExists then
                logEvent("OCR file generated.")
                -- Wait 5 even if the file was found, to let things settle
                delay 5
            else
                -- Wait a second before checking again
                delay 1
            end if
        end repeat
    end timeout
end ocrFile
(*
Name: appIsRunning
Description: Determines if a particular application is running.
Parameters:
    appName - the name of the application to be tested
Returns: True if the application is running; otherwise False
*)

on appIsRunning(appName)
    tell application "System Events" to (name of processes) contains appName
end appIsRunning
(*
Name: posixFileExists
Description: Determines if a particular file exists.
Parameters:
    posixFilePath - the POSIX path to the file
Returns: True if the file exists; otherwise False
*)

on posixFileExists(posixFilePath)
    tell application "System Events" to exists file posixFilePath
end posixFileExists
(*
Name: exitApp
Description: Exits the specified app if it is running.
Parameters:
    appName - the application name
*)

on exitApp(appName)
    if appIsRunning(appName) then
        tell application appName to quit
    end if
end exitApp
(*
Name: getPosixOcrFilePath
Description: Gets the OCR output filename for a given input filename.
Parameters:
    posixFilePath - the full path to the source file
Return: the POSIX path of the OCR output file
*)

on getPosixOcrFilePath(posixFilePath)
    set posixBaseName to do shell script ¬
        "filename=" & quoted form of posixFilePath & "; echo ${filename%\\.*}"
    set posixOcrFilePath to posixBaseName & ocrFileSuffix
    return posixOcrFilePath
end getPosixOcrFilePath
(*
Name: getNextFile
Description: Finds the next unprocessed ScanSnap PDF
Return: the file or ""
*)

on getNextFile(aFolder)
    logEvent("Getting next file...")
    set masterFileList to list folder aFolder ¬
        without invisibles
    set posixPath to POSIX path of aFolder
    repeat with i from 1 to count masterFileList
        set fileName to item i of masterFileList
        set posixFilePath to posixPath & fileName
        log posixFilePath
        --
        -- Construct a FineReader file name from our file
        --
        set posixOcrFilePath to getPosixOcrFilePath(posixFilePath)
        --
        -- See if the FineReader file we constructed exists
        --
        set ocrFileExists to posixFileExists(posixOcrFilePath)
        tell me to set fileCreator to getSpotlightInfo for "kMDItemCreator" from posixFilePath
        log ("Creator: " & fileCreator)
        if not ocrFileExists and fileCreator = "ScanSnap Manager" then
            return POSIX file posixFilePath
        end if
    end repeat
    return ""
end getNextFile
(*
Name: getSpotlightInfo
Description: Gets a named attribute from metadata for a specific file.
Parameters:
    for myattribute - the name of the attribute
    from myfile - the name of the file
Returns: the attribute value or "" if none found
*)

on getSpotlightInfo for myattribute from myfile
    try
        set this_kMDItemResult to ""
       
        tell application "Finder"
            set this_item to myfile as string
            set this_item to POSIX path of this_item
            set this_kMDItem to myattribute
            set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & " -raw -nullMarker None " & quoted form of this_item)
            log "Result: " & theResult as string
            repeat with j from 1 to number of items in theResult
                set this_kMDItemResult to this_kMDItemResult & item j of theResult as string
                if j < number of items in theResult then
                    set this_kMDItemResult to this_kMDItemResult & " "
                end if
            end repeat
        end tell
    on error
        set this_kMDItemResult to ""
    end try
    return this_kMDItemResult
end getSpotlightInfo
(*
Name: logEvent
Description: Write an event to an event log
Parameters:
    themessage - the message to write to the log
*)

on logEvent(themessage)
    set theLine to (do shell script ¬
        "date  +'%Y-%m-%d %H:%M:%S'" as string) ¬
        & " " & themessage
    do shell script "echo " & theLine & ¬
        " >> ~/Library/Logs/AppleScript-events.log"
end logEvent

Installation

  • Use the Script Editor to save this script as Run OCR on New Folder Items under User Home/Library/Scripts/Folder Action Scripts
    You may have to create the Folder Action Scripts folder.
  • Now open a Finder window and navigate to the parent folder of your scanned documents folder.
  • Right click (control-click) the scanned documents folder and choose Folder Actions Setup…
  • At this point if folder actions are not enabled, you will likely have to enable them and add the script manually.
    • Check Enable Folder Actions
    • Use the “+” buttons on the left and right sides to add the scan folder and then this script.
  • Otherwise, a list of scripts will come up. Choose this script from the Choose a Script to Attach dialog.
  • Close all windows.

That’s it! The script will be invoked automatically every time a new file appears in your scanned documents folder.

Please let me know if you have any ideas that can improve this script. I’m not an AppleScript guru, so someone might just know how to keep that annoying Finereader icon from jumping.

25 Responses to “Automate ScanSnap OCR process on your Mac with AppleScript (Snow Leopard Edition)”

  1. Automate ScanSnap OCR process on your Mac with AppleScript | Paper Jammed writes:

    [...] Update: The script on this page works only with Leopard (10.5). Get the Snow Leopard version here [...]

  2. Rene Nederhand writes:

    Excellent script!

    I had to change the string “ScanSnap Manager” to “ScanSnap Manager S1500M” to get it working, but now it is working like a charm.

    The only thing I noticed is the quality of the output of the PDF’s from Finereader. It doesn’t look as good as the original scan. Do you have any ideas how to improve that?

  3. Tad writes:

    Glad to hear it works for you!
    And yes, I would expect that a couple text strings might need to be tweaked to go from one ScanSnap to the other.

    As far as the quality, this is likely a consequence of settings in Finereader.
    I launched FineReader for ScanSnap Preferences and clicked on the Scan to Searchable PDF tab.

    On this tab I have selected the following:

    Save mode: Text under page image
    Quality: High (for printing)
    Format: Automatic

    I imagine your Quality setting might be on one of the lower settings right now.

  4. Tad writes:

    I guess I wasn’t searching for the right thing when I tried to stop the icon from bouncing.
    Anyway, many sites out there had the following instructions:

    To stop all dock bouncing forever, type the following in a terminal window:

    defaults write com.apple.dock no-bouncing -bool TRUE

    and then

    killall Dock

    To reverse the process, do the same thing, but use FALSE instead of TRUE.

  5. Tjebbe van Tijen writes:

    I am considering buying the ScanSnapz for the Mac, it bundles with ABBYY FinerReader and I am just testing this last piece of software for another workflow, whereby I want to use OCR on photographs of book-pages and big size screenshots of on-line digital facsimile (like Google Books). As I have a 30 inch monitor my screenshots of book pages can be rather good. The whole process is managed by Filemaker Pro (10) with several plug-ins. OCR from screenshots with ABBYY works surprisingly well. So now I want to script the steps of the process. As ABBYY does not have an Applescript functionality, but seems to have some scripting engine ( ABBYY FineReader Engine) I was wondering whether you have any experience with this… I also saw some mentioning of a Command Line Interface (CLI) maybe such command could be called from Applescript to run these in UNIX? Let me know if these are realistic tracks to follow… Thanks in advance

  6. Tad writes:

    Actually, for low throughput purposes, the AppleScript that I put together to send docs to the ABBYY app works just fine. You don’t have fine grained control over the OCR settings on a per doc basis, but it is definitely possible to have AppleScript feed documents to ABBYY.

    If you are looking into a more substantial solution, for a small organization or business, you might be interested in talking with ABBYY about licensing their engine. This is what we did at work for a particular job: we bought a license to use their engine on a single Linux server for a specific number of pages per year.

    What they provided was their SDK along with platform-specific binaries. We were able to compile a command line client to the SDK that worked wonders. We then used a bash script on a cron job to wrap the command line client, periodically checking a source directory for files and feeding them to the CLI.

  7. mark writes:

    Hi. Interesting post. I was wondering whether you then go on to name the files or just save them and subsequently find them using the OCR search. I’m trying to come up with a method whereby after the file has been through the ocr process it is automatically named based on whether certain terms appear in the file itself. For example if a scan of a citibank bank statement contains the words citibank + statement, then the file would be saved into a particular folder as citibank_statement_todays date hence removing a lot lof the tedious manual naming.
    I’d appreciate any thoughts you have on the subject.
    Thanks
    Mark

  8. Tad writes:

    Hi Mark,

    No, in my home life the docs are few enough that I name and file them by hand. I imagine that some of the heavy hitters in the document management field will automatically populate keywords based on content, but then they also have an annoying habit of squirreling your documents away into the deep recesses of their black box.

    With available command line tools it ought to be pretty easy to do some regular expression matches with the content to attempt to categorize and then name a document.
    My main concern would be the inherent slushiness of the OCR process. How do you write a simple app that looks for errors like “Citihank” and “Ciflbank” and correctly interprets them as “Citibank”?

    I guess if I were writing such an app, it would display thumbnails of documents along with their proposed names, and a list of “undetermined” docs. Then you could go down the list and approve or reject the filename changes. Certainly most would be correct, but one or two would be misnamed.

  9. M writes:

    Hi Tad,

    I am trying to make this work with ABBY FineReader Express 8 for Mac. Making some progress in that I can hand over to the program the PDF I want to work on; then I still have to tell the program “do it”, afterwards “save”; then next file.

    Do you think there’s a way to remote control the program entirely? I always want to use the same settings, but unattended.

    Thanks,

    M

  10. Tad writes:

    Hi M,

    This sounds like pure AppleScript, isn’t it? Even if the apps don’t provide any explicit AppleScript support, you can still have it drive the keyboard for you and automate the process.

    The goal behind my own script was to allow FineReader to run unattended while I was sipping tea somewhere else, and it does that just swimmingly.

    One point of difference: you are using FineReader Express 8, while I am working with FineReader for ScanSnap. If anything, I would assume yours provides more functionality.

  11. M writes:

    Hi Tad,

    yes, that didn’t work out on my side. I’ve now bought the Linux engine of ABBYY (150 EUR for 12000 pages a year) and written a wrapper around it that recursively and autonomously iterates my directory structure and runs the engine on all documents that have not yet been treated. I’ve made it open source:

    http://www.mnsoft.org/547.0.html?&cHash=9120c122ed&tx_ttnewsbackPid=544&tx_ttnewstt_news=30

    and

    pdfocrwrapper.sourceforge.net.

    Tested on thousands of pages, works perfectly.

    HTH,

    M

  12. Tad writes:

    Glad to hear you found a solution. As I said further up, for one situation at work we ended up licensing the ABBYY engine for Linux and running everything in batch mode.

    There was no great coding work on our part: we simply built their command line example and then wrapped that with a bash script.

    Some stuff works fine in a desktop scripting model, but there is a threshold beyond which a more batch-level solution is best.

  13. PDFPen OCR Applescript To Automatically Make PDFs Searchable | DocumentSnap writes:

    [...] However, that ability is not what this post is about. PDFPen will also OCR PDFs to make them searchable, and I wanted a way to OCR a bunch of documents automatically with an Applescript, similar to what has been done with Adobe Acrobat and with ABBYY FineReader. [...]

  14. Thomas writes:

    The problem is that a digital encrypted signature will not display if someone views the pdf using other software than adobe’s. If on the other hand you make a custom stamp with an imgae of your “wet ink” signature then anyone can see it regardless of software.. ofcourse it is less “safe” to distribute your document with a signature image that might be “stolen”

  15. Chris writes:

    Hi,

    Thanks so much for this very useful guide. I’m new to all of this, and you seem very well versed, so maybe you could help with this question.

    My desired workflow is 1) scan, 2) OCR, 3) import to Yojimbo.

    Is there any way I could modify this script to do that?

  16. Carsten writes:

    Hi Tad,

    Thanks for updating the script to work under Snow Leopard with the latest ABBY and Scan Manager. It turns out that your script has been obsoleted by ABBY’s ability to queue scans in the latest version. I discovered this by accident when I dropped several PDFs generated via the ScanSnao S510M. It just works. Also, using the “Scan to PDF” feature of the ScanSnap Manager works as one would expect. One doesn’t need to wait for OCR of one document to be complete before starting the next scan. It’s a wonderful thing after all and now works as it should have from the very beginning.

  17. Matt writes:

    Carsten,
    Do you know the version number of the latest FineReader for ScanSnap (for Mac)? Any idea where we can find the update that allows the scan queueing you stumbled across? And is the update available to owners of the S510M? I ask because I believe that the most current version of the ScanSnap Manager software is only available to owners of the news S1500M.

  18. Andy writes:

    I was glad to find this script but it for some reason won’t work on my system. Checking the log it always quits the script with the event “Other script running. Exiting…”. Any suggestions on what I can do to fix this?

  19. snapscan touch writes:

    Some time back I published an AppleScript that allows one to The new version of the Abbyy Finereader product does not play well with my original script. . do shell script “/usr/bin/touch \”" & lockFilePath & “\”" ..

  20. Dennis writes:

    Is there anyway to make ABBY FineReader for ScanSamp for mac just do the first page of the PDF. When ScanSnap Manager calls FineReader it just OCRs the first page.

    Thanks

  21. AjEG writes:

    Is there any chance this script will work with Fuji ScanSnap S1500M, ABBYY FineReader 8 (bundled with scansnap) and FileMaker Pro 11 all on OSX Lion? I’m looking for a way to automate scanning business cards, performing OCR and adding the captured data to a FileMaker database. Just wondering. Thanks.

  22. Glen writes:

    I just wanted you to know that your script works perfectly with Mac OS Lion, running on a ScanSnap S500M after downloading the Lion update for the ScanSnap manager at:

    http://www.fujitsu.com/us/services/computing/peripherals/scanners/support/lion_download2.html

    Nicely done!

  23. Glen writes:

    I should add to the previous comment that I am running ABBYY FineRead for ScanSnap 4.1 which can be downloaded from:

    https://store.abbyyusa.com/cgi-bin/dlreg?t=99dZGGhhN5nFUozfAWWE&k=94273683

  24. Jared writes:

    Converting to Paperless office. Bought Neatworks – software blows… I’m going to use filemaker instead. What would an OCR workflow look like with the above script and fielmaker. At what point can the file inserted into filemaker as a record.. can it be automated as an add-on to your script?

  25. Adam Aronson writes:

    Great stuff here! Thanks for the script Tad!

    Reading the comments above I see a few people looking for the ability to integrate ScanSnap with FileMaker. Our company has a solution that we’d like to have a few people test for us. If you’re interested please visit our site (http://www.fullcityconsulting.com) and fill out the contact us form. I’ll follow up with you via email. Thanks!

Leave a Reply