MONTHLY COMPUTER VISION MEETUP ROUNDUP #5
Last month we hosted the 5th edition of our Monthly Computer Vision Meetup! It was the first time we had an external sponsor – big thanks to ViewAr who kept the audience refreshed through the event! :) There were four talks on the agenda, so after everybody had arrived and grabbed a drink, ViewAr started with a short intro talk on their field of expertise.
Intro Talk by Markus Meixner from ViewAR
ViewAR is a Viennese Augmented Reality Company, that has established its expertise in AR business solutions. They are experts in product visualization and provide 3D visualization apps in order to push sales and speed up the decision making process of the end clients. Objects can quickly be adapted through the app to the needs of the client. With their apps, businesses are enabled to create virtual scenarios for their clients.
One use case of ViewAr is with Lufthansa for example, where workers can scan cargo loads and find out unused space to therefore decide how many additional items can fit in there.
The Company is planning to rollout an SDK for developer soon, which will provide different templates for certain industries. This way developers can brand each scenario in the way they want.
Understanding our surroundings: The missing key by Yacine Alami from Wikitude
Yacine was so kind to sum up his whole talk for us in his own words: “Since the past few years, we accomplished a lot in the field of AR, the concept of SLAM was a tremendous advance in the technology and we can already say that 2D tracking is the stone age of AR. 3D tracking is the future, seamless tracking in a unknown environment (no targets) and not depending on any initialization routine. Some of us, fellow scientists have already achieved this, but it’s not yet out for the public, It’s mostly research. But the biggest challenge is yet to come. We want AR to be as integrated as possible into the user’s field of view, we want to have the perfect augmentation and to have a synergy between the real world and the virtual world.
Thus, the missing key is how can we understand our surroundings.
The SLAM concept (Simultaneous Localization and Mapping) is already a good starting point to be able to understand our surroundings since it helps us in two ways: Localizing the user and mapping the environment at the same time.
This approach usually yields a sparse point cloud that could help to reconstruct the environment that the camera sees. However, since it’s only a vision system, it has flaws like drift, and error propagation that needs to be corrected.
A possible solution is using sensor fusion; fusing data from different sensors to get a better overall result. It comes on top of the SLAM problem and helps in the localization of the user by giving a more accurate 6 DoF pose estimation of the camera.
To improve the mapping part we can use some depth sensors (like Kinect, or Google Tango) to get a better quality of the point cloud generated (better feature depth estimation).
Thereby, understanding our surroundings is a only few steps away. We need to be able to use what is around us to do AR, so we are not bound to a special place like a desk or a target.
Nowadays, only a few mobile devices are able to do address those problems in an efficient way.
The Google Tango with its RGB Camera, Depth sensor and IMU, is a powerful tablet able to solve the SLAM problem by doing sensor fusion and mapping the environment to use it in an AR experience.
The Microsoft Hololens who was just shipped to developers at the end of march, is a Head-mounted display opening a lot of possibilities in what AR should be. By superimposing virtual objects onto real objects in a seamless way, it’s a great leap forward into the AR world. In a nutshell, the AR field has still a long way to go to be what we were dreaming of.” (Yacine Alami)
Peter started his talk by explaining the basics and a little history of the Microsoft Xbox Kinect. The Kinect is a motion sensing input device for the Xbox and Windows Computers and its code is written in C#, C++/CLI or VB.NET. The current Kinect can track 25 boy joints and 6 different people at a time.
Peter continued by explaining how Depth Sensing with the first version of the Kinect works: The IR Emitter (E) projects a known infrared light pattern (“coded light”) into the scene while the IR Sensor (S) retrieves reflected light from the environment. The depth map is then computed by triangulation and light coding technology. Coded light or structured light happens when the emitter projects a pseudo-random pattern of light (spreckles) where each “point” is unique per position and can be recognized in the pattern. It therefore allows to compute depth information through triangulation. Triangulation is used in stereo vision to reconstruct a point P in 3D space given two or more images. The positions, orientations and optical properties of those image sources must be known. Now the problem is that triangulation requires two or more images, but the Kinect only has one depth sensor. The trick is, that the Kinect actually uses two images. One from the sensor and one image reference pattern which are then used to correlate against.
The second generation of the Kinect uses “Time of Flight” to measure depth, which is the indirect measurement of time it takes for laser pulses to a surface and back to the sensor. With Time of Flight, the sensor and illuminator both alternate between an “OFF” and “ON” state in a very high frequency (>1GHz). Light impulses from the illuminator will be reflected by the environment back into the sensor, but the longer the light impulse takes, the more likely it will not reach the sensor within the same cycle but within one of the next cycles. The total amount of rejected light compared to the total amount of absorbed light holds information about the round-trip distance which is used to measure depth.
The last talk on the agenda was held by Alexander Chernykh about Receipt Detection, and the experience he made while implementing this app on Android and iOS.
Generally, the problems he faced can be categorized into two classes: Frame Capture and Computer Vision. For iOS those two problems are solved through a stable API across all devices and a CIDetector. But with Android the API is really unstable and differs from device to device. Furthermore, the simple code for opening and configuring cameras gets rotten fast as soon as more and more cameras from different manufacturers get supported. For the Computer Vision part there’s the problem, that there’s no official library provided by Google, and therefore you are left with two options – OpenCV and boofCV.
boofCV has the advantage, that it is a pure Java library, which you can directly use in your Android project. It has a small storage footprint on the device, but as soon as you want to process high-resolution images it consumes a lot of memory and eventually can hit VM heap size limits. OpenCV on the other hand is a mixed C++ and Java library, which is relatively large in storage cost, but efficient and fast. The Java API is sufficient for some simple tasks, but as soon as you want to do some more complex stuff, you will need to use the C++ API and thus also get involved with the Java Native Interface (JNI), which honestly is not really fun :)
Alexander set himself a hard time limit for each frame processing call of 30ms, so he showed us the optimizations he has done to stay within this bound. For most of Computer Vision related tasks you do not need Full-HD resolution, usually something around 640×480 or less is sufficient. This was the case also in Alex’s project. He downsizes each frame to 640×480 and then does a filtering with a Gaussian filter to remove noise. He also defined specific detection areas in his Android UI, so that the receipt fits into a rectangle only in vertical direction. This way he does not have to handle arbitrarily rotated receipts. The main method used for detecting the edges of the receipts was with the help of the Hough transform. Since only vertical receipts have to be detected, he is able to compute only a small part of the Hough accumulator space, and thus save a lot of computational time. And finally, another often used trick, is that he used precomputed sinus and co-sinus tables for a fast look-up.
At the end of his talk, Alexander mentioned current problems which among others includes on-device OCR. We from Anyline know very well that this is not such an easy task, and therefor we wish you good luck and a lot of fun with that Alexander! ;-)
Hold a talk yourself!
You have a project or topic you would like to talk about or you know someone, who would like to share his/her experiences and knowledge at our Computer Vision Meetup? Please contact us!
Oh – and don’t forget to join our meetup group ! ;)
“Bringing together computer vision amateurs and professionals is highly valuable for the community.
Looking forward to the next Meetup.” Yacine Alami
QUESTIONS? LET US KNOW!