Project Description

Multi Frame Integration for OCR

Reflection and bad lighting conditions can be a big issue when your are performing OCR on a document. You are tediously trying to find a proper angle permitting the retrieval of all the relevant information at once. Even if you’re able to scan the document, the result could be poor or incomplete. Looking at more than a single frame and integrate the redundant information of multiple frames can enhance your OCR experience.


Although it is a trivial idea, merging the information is not as simple. This process is often referred to as multiple frame or temporal integration. Given the task of OCR one can reduce the approaches to the following two ways:  

  1. Perform OCR on each frame separately and merge the resulting text.
  2. Merge the frames to get a single “more suited” image to perform OCR on. Depending on the situation this could mean that the result should be without any reflections, obstacles or distortions. Even increased resolutions could be possible if the input image quality is insufficient.

In the following I will consider a few problems which you have to handle in this context, before performing actual OCR.


So let’s say you want to scan a business card (a planar rectangular surface) and you move your camera around it. Obviously you get a different perspective on it in each frame and you do not have this perfect view from above like for example the one you get when using an office scanner. In other words, your business card in your image is perspectively distorted. This may influence your OCR experience. We need to warp our image such that the card is not geometrically distorted anymore. You may know this procedure from those great apps turning your mobile phone camera into a document scanner. A (slightly contrived) example:

business card

Another big problem is to find the text you want to recognize. Since you need to move your camera around to get different lightning conditions you can not tell the user to place the camera statically such that the text is within a viewfinder window or similar. Your software has to find the target text yourself. Given a high resolution image made by nowadays mobile phones, this could be an expensive task. I will not go into details here but there are several attempts and this is a topic in nowaday research. For those of you who are interested, you’ll find some proper links at the end of the post.  


Since the rectification and the text search in general are quite expensive tasks, you may not want to do it on every single frame but only on a few keyframes. To not lose the information of the remaining frames you can track your target over time. This is generally faster than text detection in each frame separately. There are tons of different approaches to track different objects in videos like people or cars. Nevertheless, like in our business card example, let us assume that the text is printed on a nearly planar surface (So, to be on the safe side, don’t create an origami bird out of your text document). This assumptions makes things a lot easier. In simple terms, the math under the hood changes from nonlinear (pure evil) to linear (marvelous and wonderful).

Let’s keep things marvelous, wonderful and linear.

A simple approach could be based on salient feature. That means nothing other than finding the same distinctive corners or blobs in consecutive frames. A visualization of it can look like this:

business card with green overlay

Finally, we do have several frames and know where to find the required text in each frame. As already mentioned in the beginning, we can now perform OCR on each frame separately and simply compare the results. If the same text is recognized multiple times, it is likely that the recognition was successful. However, to give an (even more contrived) example for the second category, the following figure illustrates the fusion of two overexposed frames using a simple (but regularly exploited) minimum operator (only applied to the overlapping region for presentation purposes):

business cards with and without reflection

Of course the given examples are quite simplified, nevertheless the concepts remain the same. As you can see, MFI can remove bothering reflections and improve the certainty of your recognition result. However, a lot of components must work well together to obtain results better than single frame approaches.


This should give you brief insight into the possibilities of Multiple Frame Integration in the field of OCR. Text detection and recognition especially in video imagery is still a hot topic in nowadays research and lot of awesome variations are published every year.


If you have questions, suggestions or feedback on this, please don’t hesitate to reach out to us via FacebookTwitter or simply via [email protected]! Cheers!