Automate ScanSnap OCR process on your Mac with AppleScript (Snow Leopard Edition)
Monday, 4 January 2010
Some time back I published an AppleScript that allows one to automatically run OCR in the background on scanned files generated by your Fujitsu ScanSnap, while you to continue scanning more files. ScanSnap owners should all be familiar with this: the out-of-the-box configuration of the ScanSnap Manager and Abbyy Finereader force the scan and OCR stages to run in lockstep: scan 1…OCR 1…scan 2…OCR 2… and so on. This script allowed you to scan regardless of the OCR processing going on.
As it turns out, my original script does not work in Snow Leopard, and I promised that I would one day clean up and publish my new and improved version.
Chris posted a comment today as a gentle reminder, so here is the new and improved version without further delay…
The Details
Unfortunately, Snow Leopard came around and caused some indigestion. For starters, the ScanSnap Manager didn’t work correctly and Abbyy Finereader would not process anything made by the ScanSnap. A couple of months later they got everything straightened out and delivered new versions of each product.
The new version of the Abbyy Finereader product does not play well with my original script.
Since I cannot do without this important functionality, I rolled up my sleeves and rewrote most of the script. The new version works in Snow Leopard quite nicely with one small annoyance: you really don’t want to try to use the machine for anything other than scanning or OCR while it is going because the new Finereader version keeps bouncing the darned icon all the time it is running and that is quite annoying to watch.
Fortunately, I really don’t need to use my machine for anything else while it is chewing on the docs; I just wanted to be able to continue scanning at the same time!
Note: Before going forward, note that you will need to upgrade the ScanSnap Manager and Abbyy Finereader to the Snow Leopard versions first! Get the files here.
Here is a link to the new script…
And here’s the code itself:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 | (* NOTE: This script was written for Snow Leopard. It may work on Leopard, but I never tried it. This is a folder listener script that will act as a queue, receiving PDF files from the ScanSnap scanner and feeding them, one by one, to the Abbyy FineReader OCR software. This allows you to keep scanning while the OCR job runs in the background on all of the unprocessed files. Why do we want to do this? The ScanSnap Manager software does not support this by default, so when you scan in a file, it sends it to FineReader for OCR. You then must wait until FineReader finishes its work before scanning in another document. This script allows you to keep scanning without waiting for OCR. Installation: o Copy this script to: <home>/Library/Scripts/Folder Action Scripts You may have to create the "Folder Action Scripts" folder. o Open a Finder window and navigate to the parent folder of the scanned documents folder. o Right click (control-click) the scanned documents folder and choose: Folder Actions Setup... o At this point if folder actions are not enabled, you will likely have to enable them and add the script manually. - check "Enable Folder Actions" - Use the "+" buttons on the left and right sides to add the scan folder and then this script. o Otherwise, a list of scripts will come up. Choose this script from the "Choose a Script to Attach" dialog. o Close all windows. Copyright (C) 2010 Tad Harrison *) property ocrFileSuffix : " processed by FineReader.pdf" property ocrApplicationName : "Scan to Searchable PDF" property ocrApplicationWindow : "Converting the document" property ocrLockFileName : "OCR in Progress" on adding folder items to this_folder after receiving added_items set lockFilePath to (POSIX path of (path to desktop folder as text)) & ocrLockFileName try logEvent("=== Run OCR on New Folder Items ===") -- Test for lockfile; exit if lockfile exists tell application "System Events" to set lockFileExists to exists file lockFilePath if lockFileExists then logEvent("Other script running. Exiting...") return else do shell script "/usr/bin/touch \"" & lockFilePath & "\"" end if -- Main loop set moreWorkToDo to true repeat while moreWorkToDo set aFile to getNextFile(this_folder) if not aFile = "" then ocrFile(aFile) else set moreWorkToDo to false end if end repeat logEvent("No more work.") exitApp(ocrApplicationName) on error errorStr number errNum display dialog "Error " & errNum & " while running OCR: " & errorStr set my isRunning to false end try -- Get rid of the lockfile, ignoring any errors try do shell script "/bin/rm \"" & lockFilePath & "\"" end try end adding folder items to (* Name: ocrFile Description: Runs OCR on the next un-OCR'd file Parameters: aFile - the file to be OCR'd *) on ocrFile(aFile) set posixFilePath to POSIX path of aFile set posixOcrFilePath to getPosixOcrFilePath(posixFilePath) logEvent("OCR: " & posixFilePath) tell application ocrApplicationName to open aFile -- -- Now sit in a loop checking once per second for the OCR file -- Give up after five minutes -- with timeout of 300 seconds set ocrFileExists to false repeat until ocrFileExists set ocrFileExists to posixFileExists(posixOcrFilePath) if ocrFileExists then logEvent("OCR file generated.") -- Wait 5 even if the file was found, to let things settle delay 5 else -- Wait a second before checking again delay 1 end if end repeat end timeout end ocrFile (* Name: appIsRunning Description: Determines if a particular application is running. Parameters: appName - the name of the application to be tested Returns: True if the application is running; otherwise False *) on appIsRunning(appName) tell application "System Events" to (name of processes) contains appName end appIsRunning (* Name: posixFileExists Description: Determines if a particular file exists. Parameters: posixFilePath - the POSIX path to the file Returns: True if the file exists; otherwise False *) on posixFileExists(posixFilePath) tell application "System Events" to exists file posixFilePath end posixFileExists (* Name: exitApp Description: Exits the specified app if it is running. Parameters: appName - the application name *) on exitApp(appName) if appIsRunning(appName) then tell application appName to quit end if end exitApp (* Name: getPosixOcrFilePath Description: Gets the OCR output filename for a given input filename. Parameters: posixFilePath - the full path to the source file Return: the POSIX path of the OCR output file *) on getPosixOcrFilePath(posixFilePath) set posixBaseName to do shell script ¬ "filename=" & quoted form of posixFilePath & "; echo ${filename%\\.*}" set posixOcrFilePath to posixBaseName & ocrFileSuffix return posixOcrFilePath end getPosixOcrFilePath (* Name: getNextFile Description: Finds the next unprocessed ScanSnap PDF Return: the file or "" *) on getNextFile(aFolder) logEvent("Getting next file...") set masterFileList to list folder aFolder ¬ without invisibles set posixPath to POSIX path of aFolder repeat with i from 1 to count masterFileList set fileName to item i of masterFileList set posixFilePath to posixPath & fileName log posixFilePath -- -- Construct a FineReader file name from our file -- set posixOcrFilePath to getPosixOcrFilePath(posixFilePath) -- -- See if the FineReader file we constructed exists -- set ocrFileExists to posixFileExists(posixOcrFilePath) tell me to set fileCreator to getSpotlightInfo for "kMDItemCreator" from posixFilePath log ("Creator: " & fileCreator) if not ocrFileExists and fileCreator = "ScanSnap Manager" then return POSIX file posixFilePath end if end repeat return "" end getNextFile (* Name: getSpotlightInfo Description: Gets a named attribute from metadata for a specific file. Parameters: for myattribute - the name of the attribute from myfile - the name of the file Returns: the attribute value or "" if none found *) on getSpotlightInfo for myattribute from myfile try set this_kMDItemResult to "" tell application "Finder" set this_item to myfile as string set this_item to POSIX path of this_item set this_kMDItem to myattribute set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & " -raw -nullMarker None " & quoted form of this_item) log "Result: " & theResult as string repeat with j from 1 to number of items in theResult set this_kMDItemResult to this_kMDItemResult & item j of theResult as string if j < number of items in theResult then set this_kMDItemResult to this_kMDItemResult & " " end if end repeat end tell on error set this_kMDItemResult to "" end try return this_kMDItemResult end getSpotlightInfo (* Name: logEvent Description: Write an event to an event log Parameters: themessage - the message to write to the log *) on logEvent(themessage) set theLine to (do shell script ¬ "date +'%Y-%m-%d %H:%M:%S'" as string) ¬ & " " & themessage do shell script "echo " & theLine & ¬ " >> ~/Library/Logs/AppleScript-events.log" end logEvent |
Installation
- Use the Script Editor to save this script as Run OCR on New Folder Items under User Home/Library/Scripts/Folder Action Scripts
You may have to create the Folder Action Scripts folder. - Now open a Finder window and navigate to the parent folder of your scanned documents folder.
- Right click (control-click) the scanned documents folder and choose Folder Actions Setup…
- At this point if folder actions are not enabled, you will likely have to enable them and add the script manually.
- Check Enable Folder Actions
- Use the “+” buttons on the left and right sides to add the scan folder and then this script.
- Otherwise, a list of scripts will come up. Choose this script from the Choose a Script to Attach dialog.
- Close all windows.
That’s it! The script will be invoked automatically every time a new file appears in your scanned documents folder.
Please let me know if you have any ideas that can improve this script. I’m not an AppleScript guru, so someone might just know how to keep that annoying Finereader icon from jumping.
No. 1 — January 4th, 2010 at 8:55 pm
[...] Update: The script on this page works only with Leopard (10.5). Get the Snow Leopard version here [...]
No. 2 — January 7th, 2010 at 2:32 am
Excellent script!
I had to change the string “ScanSnap Manager” to “ScanSnap Manager S1500M” to get it working, but now it is working like a charm.
The only thing I noticed is the quality of the output of the PDF’s from Finereader. It doesn’t look as good as the original scan. Do you have any ideas how to improve that?
No. 3 — January 8th, 2010 at 5:10 pm
Glad to hear it works for you!
And yes, I would expect that a couple text strings might need to be tweaked to go from one ScanSnap to the other.
As far as the quality, this is likely a consequence of settings in Finereader.
I launched FineReader for ScanSnap Preferences and clicked on the Scan to Searchable PDF tab.
On this tab I have selected the following:
Save mode: Text under page image
Quality: High (for printing)
Format: Automatic
I imagine your Quality setting might be on one of the lower settings right now.
No. 4 — March 25th, 2010 at 6:28 pm
I guess I wasn’t searching for the right thing when I tried to stop the icon from bouncing.
Anyway, many sites out there had the following instructions:
To stop all dock bouncing forever, type the following in a terminal window:
defaults write com.apple.dock no-bouncing -bool TRUE
and then
killall Dock
To reverse the process, do the same thing, but use FALSE instead of TRUE.
No. 5 — April 2nd, 2010 at 11:18 am
I am considering buying the ScanSnapz for the Mac, it bundles with ABBYY FinerReader and I am just testing this last piece of software for another workflow, whereby I want to use OCR on photographs of book-pages and big size screenshots of on-line digital facsimile (like Google Books). As I have a 30 inch monitor my screenshots of book pages can be rather good. The whole process is managed by Filemaker Pro (10) with several plug-ins. OCR from screenshots with ABBYY works surprisingly well. So now I want to script the steps of the process. As ABBYY does not have an Applescript functionality, but seems to have some scripting engine ( ABBYY FineReader Engine) I was wondering whether you have any experience with this… I also saw some mentioning of a Command Line Interface (CLI) maybe such command could be called from Applescript to run these in UNIX? Let me know if these are realistic tracks to follow… Thanks in advance
No. 6 — April 2nd, 2010 at 12:29 pm
Actually, for low throughput purposes, the AppleScript that I put together to send docs to the ABBYY app works just fine. You don’t have fine grained control over the OCR settings on a per doc basis, but it is definitely possible to have AppleScript feed documents to ABBYY.
If you are looking into a more substantial solution, for a small organization or business, you might be interested in talking with ABBYY about licensing their engine. This is what we did at work for a particular job: we bought a license to use their engine on a single Linux server for a specific number of pages per year.
What they provided was their SDK along with platform-specific binaries. We were able to compile a command line client to the SDK that worked wonders. We then used a bash script on a cron job to wrap the command line client, periodically checking a source directory for files and feeding them to the CLI.
No. 7 — April 14th, 2010 at 4:43 pm
Hi. Interesting post. I was wondering whether you then go on to name the files or just save them and subsequently find them using the OCR search. I’m trying to come up with a method whereby after the file has been through the ocr process it is automatically named based on whether certain terms appear in the file itself. For example if a scan of a citibank bank statement contains the words citibank + statement, then the file would be saved into a particular folder as citibank_statement_todays date hence removing a lot lof the tedious manual naming.
I’d appreciate any thoughts you have on the subject.
Thanks
Mark
No. 8 — April 15th, 2010 at 8:20 pm
Hi Mark,
No, in my home life the docs are few enough that I name and file them by hand. I imagine that some of the heavy hitters in the document management field will automatically populate keywords based on content, but then they also have an annoying habit of squirreling your documents away into the deep recesses of their black box.
With available command line tools it ought to be pretty easy to do some regular expression matches with the content to attempt to categorize and then name a document.
My main concern would be the inherent slushiness of the OCR process. How do you write a simple app that looks for errors like “Citihank” and “Ciflbank” and correctly interprets them as “Citibank”?
I guess if I were writing such an app, it would display thumbnails of documents along with their proposed names, and a list of “undetermined” docs. Then you could go down the list and approve or reject the filename changes. Certainly most would be correct, but one or two would be misnamed.
No. 9 — June 25th, 2010 at 2:04 am
Hi Tad,
I am trying to make this work with ABBY FineReader Express 8 for Mac. Making some progress in that I can hand over to the program the PDF I want to work on; then I still have to tell the program “do it”, afterwards “save”; then next file.
Do you think there’s a way to remote control the program entirely? I always want to use the same settings, but unattended.
Thanks,
M
No. 10 — June 29th, 2010 at 9:24 pm
Hi M,
This sounds like pure AppleScript, isn’t it? Even if the apps don’t provide any explicit AppleScript support, you can still have it drive the keyboard for you and automate the process.
The goal behind my own script was to allow FineReader to run unattended while I was sipping tea somewhere else, and it does that just swimmingly.
One point of difference: you are using FineReader Express 8, while I am working with FineReader for ScanSnap. If anything, I would assume yours provides more functionality.
No. 11 — July 1st, 2010 at 11:10 am
Hi Tad,
yes, that didn’t work out on my side. I’ve now bought the Linux engine of ABBYY (150 EUR for 12000 pages a year) and written a wrapper around it that recursively and autonomously iterates my directory structure and runs the engine on all documents that have not yet been treated. I’ve made it open source:
http://www.mnsoft.org/547.0.html?&cHash=9120c122ed&tx_ttnewsbackPid=544&tx_ttnewstt_news=30
and
pdfocrwrapper.sourceforge.net.
Tested on thousands of pages, works perfectly.
HTH,
M
No. 12 — July 1st, 2010 at 2:23 pm
Glad to hear you found a solution. As I said further up, for one situation at work we ended up licensing the ABBYY engine for Linux and running everything in batch mode.
There was no great coding work on our part: we simply built their command line example and then wrapped that with a bash script.
Some stuff works fine in a desktop scripting model, but there is a threshold beyond which a more batch-level solution is best.
No. 13 — September 28th, 2010 at 12:11 pm
[...] However, that ability is not what this post is about. PDFPen will also OCR PDFs to make them searchable, and I wanted a way to OCR a bunch of documents automatically with an Applescript, similar to what has been done with Adobe Acrobat and with ABBYY FineReader. [...]
No. 14 — October 1st, 2010 at 2:02 am
The problem is that a digital encrypted signature will not display if someone views the pdf using other software than adobe’s. If on the other hand you make a custom stamp with an imgae of your “wet ink” signature then anyone can see it regardless of software.. ofcourse it is less “safe” to distribute your document with a signature image that might be “stolen”
No. 15 — October 1st, 2010 at 9:58 am
Hi,
Thanks so much for this very useful guide. I’m new to all of this, and you seem very well versed, so maybe you could help with this question.
My desired workflow is 1) scan, 2) OCR, 3) import to Yojimbo.
Is there any way I could modify this script to do that?
No. 16 — November 14th, 2010 at 10:52 pm
Hi Tad,
Thanks for updating the script to work under Snow Leopard with the latest ABBY and Scan Manager. It turns out that your script has been obsoleted by ABBY’s ability to queue scans in the latest version. I discovered this by accident when I dropped several PDFs generated via the ScanSnao S510M. It just works. Also, using the “Scan to PDF” feature of the ScanSnap Manager works as one would expect. One doesn’t need to wait for OCR of one document to be complete before starting the next scan. It’s a wonderful thing after all and now works as it should have from the very beginning.
No. 17 — November 17th, 2010 at 7:45 pm
Carsten,
Do you know the version number of the latest FineReader for ScanSnap (for Mac)? Any idea where we can find the update that allows the scan queueing you stumbled across? And is the update available to owners of the S510M? I ask because I believe that the most current version of the ScanSnap Manager software is only available to owners of the news S1500M.
No. 18 — April 28th, 2011 at 2:53 pm
I was glad to find this script but it for some reason won’t work on my system. Checking the log it always quits the script with the event “Other script running. Exiting…”. Any suggestions on what I can do to fix this?
No. 19 — August 6th, 2011 at 10:27 am
Some time back I published an AppleScript that allows one to The new version of the Abbyy Finereader product does not play well with my original script. . do shell script “/usr/bin/touch \”" & lockFilePath & “\”" ..
No. 20 — October 1st, 2011 at 3:07 pm
Is there anyway to make ABBY FineReader for ScanSamp for mac just do the first page of the PDF. When ScanSnap Manager calls FineReader it just OCRs the first page.
Thanks
No. 21 — December 14th, 2011 at 2:03 pm
Is there any chance this script will work with Fuji ScanSnap S1500M, ABBYY FineReader 8 (bundled with scansnap) and FileMaker Pro 11 all on OSX Lion? I’m looking for a way to automate scanning business cards, performing OCR and adding the captured data to a FileMaker database. Just wondering. Thanks.
No. 22 — January 2nd, 2012 at 1:49 am
I just wanted you to know that your script works perfectly with Mac OS Lion, running on a ScanSnap S500M after downloading the Lion update for the ScanSnap manager at:
http://www.fujitsu.com/us/services/computing/peripherals/scanners/support/lion_download2.html
Nicely done!
No. 23 — January 2nd, 2012 at 1:56 am
I should add to the previous comment that I am running ABBYY FineRead for ScanSnap 4.1 which can be downloaded from:
https://store.abbyyusa.com/cgi-bin/dlreg?t=99dZGGhhN5nFUozfAWWE&k=94273683
No. 24 — January 16th, 2012 at 5:11 pm
Converting to Paperless office. Bought Neatworks – software blows… I’m going to use filemaker instead. What would an OCR workflow look like with the above script and fielmaker. At what point can the file inserted into filemaker as a record.. can it be automated as an add-on to your script?
No. 25 — March 13th, 2012 at 9:55 am
Great stuff here! Thanks for the script Tad!
Reading the comments above I see a few people looking for the ability to integrate ScanSnap with FileMaker. Our company has a solution that we’d like to have a few people test for us. If you’re interested please visit our site (http://www.fullcityconsulting.com) and fill out the contact us form. I’ll follow up with you via email. Thanks!