High speed OCR and parallel processing

6 views (last 30 days)
I've written a number of programs to read tabular data in images using Matlab's ocr function. I have cleaned up the image files before using OCR (binarize, etc.). However it is taking about ~4 secs per 100 rows of single column data. Unfortunately I have hundreds of thousands of rows to work with so I need a way to speed this up. Using ROI or cropping the image into individual table cells didn't make much difference. Can someone help by pointing out some options?
  • Is there a way to make OCR run faster?
  • I have seen some documentation on parallel processing and was wondering if that could help. My computer has 4 cores. Should I explore the following?
  1. hyperthreading
  2. increase number of workers more than the number of cores
  3. increase number of threads per worker.
In essence I'm looking to split the hundreds of image files to be processed separately and want to maximise the speed.
Thank you.

Accepted Answer

Walter Roberson
Walter Roberson on 29 Jan 2017
To get ocr() to maybe run faster you would need to train a custom network. This assumes that fonts and handwriting sloppiness are more restricted for your situation (eg. one font of one size) ; if you have a general written OCR problem then training your own network is not likely to speed anything up.
The general task of OCR could, I suspect, be done faster using different algorithms. I say that thinking about the speed of the automatic mail sorters. On the other hand those do not have to deal with hundreds of rows.
You need to profile your code. Hyperthreading is an advantage if you are waiting on I/O. If you are busy with computations then Hyperthreading can slow things down. Assigning more threads than cores or more workers than cores leads to contention for resources unless they typically spend a lot of time waiting for interrupts.
parfor and SPMD are not always more productive. They are most effective for low IO high computation where the matrices involved are small or moderate and you do not do extensive tasks such as eigenvalues or \ operation. With larger matrices and vectorized code especially code that does linear algebra then you would typically get better performance leaving it not explicitly parallel so that it can use the multithreaded high performance libraries (those have much lower overhead than creating workers)

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!