Project Description

Multi-Frame Integration for Mobile Scanning & OCR

Multi-Frame Integration for Mobile Scanning & OCR

Reflection and bad lighting conditions can be a big issue when your are performing mobile scanning on a document. You are tediously trying to find a proper angle permitting the retrieval of all the relevant information at once and even if you’re able to scan the document, the scan result could be poor or incomplete. Looking at more than a single frame and integrate the redundant information of multiple frames with multi-frame integration for OCR enhances your mobile scanning experience.

Integration of Multiple Frames in OCR

Although it is a trivial idea, merging the information of multiple frames in OCR & mobile scanning technology is not as simple. This process is often referred to as multiple frame or temporal integration. Given the task of OCR, one can reduce the approaches to the following two ways:  

  1. Perform OCR on each frame separately and merge the resulting text.
  2. Merge the frames to get a single “more suited” image to perform mobile scanning with OCR on. Depending on the situation this could mean that the result should be without any reflections, obstacles or distortions. Even increased resolutions could be possible if the input image quality is insufficient.

In the following I will consider a few problems which you have to handle in this context, before performing actual mobile OCR.

Plane Rectification & Text Detection

So let’s say you want to scan a business card (a planar rectangular surface) and you move your camera around it. Obviously you get a different perspective on it in each frame and you do not have this perfect view from above like for example the one you get when using an office scanner.

In other words, the business card in your image is perspectively distorted. This has influence on the results of the scans and your OCR experience. We need to warp our image such that the card is not geometrically distorted anymore. A (slightly contrived) example:

business card

Another big problem is to find the text you want to recognize. Since you need to move your camera around to get different lightning conditions you can not tell the user to place the camera statically such that the text is within a viewfinder window or similar. Your software has to find the target text yourself.

Given a high resolution image made by nowadays mobile phones, this could be an expensive task. We will not go into details here but there are several attempts and this is a topic in OCR research as well. For those of you who are interested in more information regarding this, you’ll find some interesting resources and links at the end of this article.  

Target Tracking & Actual Multi-Frame Integration

Since the rectification and the text search in general are quite expensive tasks, you may not want to do it on every single frame but only on a few keyframes. To not lose the information of the remaining frames you can track your target over time. This is generally faster than text detection in each frame separately.

There are tons of different approaches to track different objects in videos like people or cars. Nevertheless, like in our business card example, let us assume that the text is printed on a nearly planar surface (So, to be on the safe side, don’t create an origami bird out of your text document). This assumptions makes things a lot easier. In simple terms, the math under the hood changes from nonlinear (pure evil) to linear (marvelous & wonderful).

Let’s keep things marvelous, wonderful & linear.

A simple approach could be based on salient feature. That means nothing other than finding the same distinctive corners or blobs in consecutive frames. A visualization of it can look like this:

business card with green overlay

Finally, we do have several frames and know where to find the required text in each frame. As already mentioned in the beginning, we can now perform OCR on each frame separately and simply compare the results. If the same text is recognized multiple times, it is likely that the recognition was successful.

However, to give an (even more contrived) example for the second category, the following figure illustrates the fusion of two overexposed frames using a simple (but regularly exploited) minimum operator (only applied to the overlapping region for presentation purposes):

business cards with and without reflection

Of course the given examples are quite simplified, nevertheless the concepts remain the same. As you can see, MFI can remove bothering reflections and improve the certainty of your recognition result. However, a lot of components must work well together to obtain results better than single frame approaches.

More Information on Multi-Frame Integration

These resources give you even more insights into the possibilities of Multi-Frame Integration in OCR. Text detection and recognition, especially in video imagery, is still a hot topic in nowadays research as you can see and lots of awesome variations are published every year.

Follow us for the Latest News on AI, OCR & Mobile Scanning Technology

Want to stay up-to-date on all the latest mobile scanning technology? Follow us on Facebook, LinkedIn or Twitter to stay in the know about developments in mobile scanning, OCR & AI!

You can also join our mailing list to get all Anyline announcements sent straight to your inbox. Sign-up is simple and we don’t send too many emails!

Finally, if you have questions, suggestions or feedback on this post, please don’t hesitate to reach out to us. Contact us via social media or just send us an email at [email protected]!