Computer Vision Challenge 4: OCR

This is a challenge we’re working on in the Silicon Valley Computer Vision Meetup.  This challenge is to use OCR to read a receipt. Specifically, this receipt:


Receipt for OCR

We’ll be using an OCR engine called Tesseract. To get started with Tesseract:

1. Install Tesseract using the instructions. Be sure to install the appropriate language training data.

2. Download the full-size receipt image.

3. Enter the command line:

tesseract IMG_2288.jpg out

4. Look at file “out.text”.  You should see (among other things) the text:

Red Restaurant and Bar

Congratulations, you’ve got Tesseract up and running!

Along with the text, you’ll see a lot of garbage.  The next step is to tune Tesseract so that it captures all of the text.

Computer Vision Challenge 3: Play Spot-It™

Our new challenge is to write a program that successfully plays the card game “Spot-It“.

The Game

There are several variations on the game, but the basic Spot-It mechanic is this:

  1. Two circular cards are turned over.
  2. Every pair of cards has precisely one symbol in common.
  3. The first player to point out the common symbol wins the round.

Here is a sample pair of Spot-It cards:Two Spot-It Cards

In this example, the common symbol is a 4-leaf clover.

Suggested Setup

Assume the cards will be laid out side by side, like in the above photo.  Split the input image in half, assuming one card on the left side, and one card on the right.  That way you can use the above photo to develop your algorithm, and then test it with a camera pointed at two real cards.

How to Match Symbols

There are a number of different ways to match symbols.

  • Identify and extract the features for each card, and then find the areas on each card match features for each
  • Extract the contours for each symbol, compute the moments for each contour, and then find the contours with the closest moments. The OpenCV call matchShapes might come in handy.
  • ?

I’ll be focusing on the feature-based approach here.  I’ll post more here later as I work on my solution.

Update: Several members of the meet up have done some amazing things with this.

Soheil Eizadi has solved the problem for the sample image.  His code is available at:

JJ Stiff has gotten really nice outlines of the images. His code is available at:

Computer Vision Challenge 2: Object Tracking

This challenge is much more open-ended than the augmented reality challenge:

Given a somehow-designated object in a scene, track that object as it moves about the scene.

The object could be designated a number of different ways:

1. The largest object moving in the foreground.

2. The object is a different color than the rest of the scene.

3. Designated with some kind of GUI (e.g. click on the object to track).

4. Your idea here.

Similarly, “tracking” can mean a number of different things:

1. Overlay some kind of marker over the designated object in the scene, and move that marker as the object moves.

2. Move a camera to keep the object centered in the field of view.

3. Move a robot so that it follows the designated object without letting it get too close or too far away.

I suggest you start with either tracking the largest moving object in the foreground, or a uniquely colored object using a marker and a fixed camera.

Computer Vision Challenge 1: Augmented Reality

This is a challenge we’re working through in the Silicon Valley Computer Vision Meetup. Feel free to follow along on your own.

The source and supporting files are available in a GitHub repo.

The first challenge is to implement augmented reality (AR). This means synchronizing the world as seen through a camera and then superimposing computer-generated imagery on top of the real world.  You can see an example in the following photo.  The 3D graphic of a globe is superimposed over a real-world image.  In this case, the position of the globe is based on the position of the black square printed on the piece of paper.  (The white square is for orientation, and will only be needed later. To begin with we will use a target with only a solid black square.) The black square is an example of a fiducial marker.  Marker-based AR uses markers like these (or more sophisticated ones that encode data) to orient the camera with respect to the world.



Like most computer vision projects, this will require a pipeline.  The basic pipeline will do the following:

  • Run video from a camera through OpenCV and out to the screen.
  • Separate the black square from the rest of the image.
  • Find the contours (2d coordinates for the outline of the square).
  • Get the four vertices of the square from the contours.
  • Map the 2d coordinates of the corners to their coordinates in 3d.
  • Draw a 3d object relative to the square’s position in 3d.
  • Use the second smaller square to orient your 3d drawing.

Each of these steps will teach you something about computer vision and OpenCV. Iif you don’t complete the whole thing, don’t worry. However far you get, you’ll learn something.

Step 0 Install OpenCV

You can’t use OpenCV until you install it.  Unfortunately, OpenCV has a fairly complicated install process due to all the dependencies it includes.. Here are some suggestions to make things easier:

Mac: Use your favorite package manager to install OpenCV.  For example, to install with MacPorts:

  • Install MacPorts from the MacPorts website.
  • Update MacPorts by issuing the command “sudo port selfupdate” from the command line.
  • Install OpenCV by entering the command “sudo port install opencv”.  This will take several hours.

Linux: Use your system package manager to install OpenCV.

Windows: No idea.  Try using the Windows installer on the OpenCV home page maybe?

Step 1 Video from Camera to Screen

This step will introduce you to OpenCV’s HighGUI module.  The HighGUI module provides a simple cross-platform way of drawing windows and reading from the camera, among other things.

Rather than walk you through this, I’ve written the code for you. Take a look at step1.cpp on GitHub. The line “cap >> image;” creates an OpenCV Mat object from the image capture device.  The line “cv::imshow(“image”, image);” puts up a window titled “image” and shows the image in it.

Virtually all of the code you write for this challenge will be between those two lines of code.  You capture the image with “cap >> image”, process it somehow, and show the result with  “cv::imshow”.

Important! In addition to calling “cap >> image” and “cv::imshow”, you also need to call “cv::waitKey” to give the system time to process events.  If you don’t call cv::waitKey, the image window may mysteriously stop updating under certain circumstances.

Before you go any further build and run the step1 program.  This will show that you’ve got everything hooked up properly.

Step 2: Separate the black square from the rest of the image.

Note: steps 2 through 6 use the solid square marker.  Click on the link and print it out.  You may want to scale your print so there’s plenty of white border around the black square.

In step 3, we’re going to use the findContours call to find the outline of the shape of the black square.  But findContours wants a single channel binary image (anything that isn’t 0 is treated as a 1.  So we’ll have to modify the image before findContours does its magic.

First, we’ll need to convert the image from color to grayscale.  This is done with the cvtColor call.  Then we’ll need to threshold the image using either threshold or adaptiveThreshold.

Step 3: Find the contours.

Now that we have a properly prepared image, we can find the contours.  This involves a single call to findContours.

Step 4: Find the four corners of the square.

findContours returns a vector of contours.  From that, we’ll need to extract the polygon for the square.  One approach to this is to use approxPolyDP to find polygon approximations of the contours. Then search for polygons with only four vertices.  See the squares example from in your OpenCV samples directory for details.

Step 5: Map the 2d coordinates to 3d.

For this step, we’ll use the call solvePnP to map the 2d points onto the 3d points in our model of the world.

Note that two of the parameters solvePnP take are a camera matrix and a vector of distortion coefficients. To compute these, you’ll need to calibrate your camera, then pass in the computed values to solvePnP.

Step 6: Draw something.

Once you have computed the camera and object position with solvePnP, you can use the results to project 3d points onto the image using projectPoints.  The you can use the OpenCV drawing functions to draw a line, for example, that moves with the image.

Step 7: Orient your drawing with the small white square.

Up until now, we’ve been picking which 3 vertices of the square to use arbitrarily.  In real-world situations, you probably want to keep track of the marker’s orientation, so you can orient a complex drawing constantly as the marker rotates from the point of view of the camera.

To do this, we’ll switch from the solid square marker to the marker with the small white square.  Print out this marker like you did the solid one.

We start the same way we did before: find the large square.  However, once we’ve done that, we then find the small square.  We’ll designate the vertex of the large square closest to the small square as “first”.

That’s all there is to it!


This exercise was inspired in part by chapter 2 of Mastering OpenCV with Practical Computer Vision Projects from Packt Publishing. That chapter walks through building a more advanced AR pipeline.  The chapter is also available as an article on the Packt website.  If you’re not developing for iPhone, you can skip down to the section titled “Marker Detection”, after which point all the code is all platform-independent C++.

There is also a full marker-based AR library built on top of OpenCV called ArUco that does  almost all of the work for you.  I used ArUco to build my iPhone AR demo.

Streaming Video for Pebble

Back in 2012, I participated in the Kickstarter for Pebble, a smart watch that talks to your smart phone via Bluetooth. I was looking forward to writing apps for it. Unfortunately, my first Pebble had a display problem and by the time I got around to getting it exchanged, all the easy watch apps had been written.

I racked my brain for an application that hadn’t already been written. Then it hit me — streaming video! I could take a movie, dither it, and send it over Bluetooth from my iPhone to the Pebble. The only problem was: how would I get the video source?

Then I remembered, “Duh, I just wrote an app for that.” CVFunhouse was ideal for my purposes, since it converts video frames into easier-to-handle OpenCV image types, and then back to UIImages for display. All I had to do was process the incoming video into an image suitable for Pebble display, and then ship it across Bluetooth to the Pebble.

My first iteration just tried to send a buffer of data the size of the screen to the Pebble, and then have the Pebble copy the data to the screen. This failed fairly spectacularly. The hard part about debugging on the Pebble is that there’s no feedback. You build your app, copy it to the watch, and then run it. It either works or it doesn’t. (Internally, your code may receive an error code. But unless you do something to display it, you’ll never know about it.) Also, if your Pebble app crashes several times in rapid succession, it goes into “safe mode” and forces you to reinstall the Pebble OS from scratch. I had to do this several times during this process.

Eventually, I wrote a simple binary display routine, and lo and behold, I was getting errors. APP_MSG_BUFFER_OVERFLOW errors, to be exact, even though my buffer should have been more than sufficiently large to handle the data the watch was receiving. I discovered that there is a maximum allowed value for Bluetooth receive buffer size on Pebble, and if you exceed it, you’ll either get an error, or crash the watch entirely. I wanted to send 3360 bytes of data to the Pebble. I discovered empirically that the most I could send in one packet was 116 bytes. (AFAIK, this is still not documented anywhere.) Once I realized this, I was able to send image data to the Pebble in fairly short order, albeit only 5 scan lines at a time.

All that remained was to dither the image on the iPhone side. From back in the monochrome Mac days, I remembered a name: Floyd-Steinberg dithering. I Googled it, and it turns out that the Wikipedia article includes the algorithm, and it’s all of 10 lines of code. Once I coded that, I had streaming video.

Unfortunately, the video only streamed at around 1 FPS on an iPhone 5. How I got it streaming faster is a tale for another day.

CVFunhouse, a iOS Framework for OpenCV

Ever since I took the free online Stanford AI class in fall of 2011, I’ve been fascinated by artificial intelligence, and in particular computer vision.

I’ve spent the past year and a half teaching myself computer vision, and in particular the open source computer vision library OpenCV. OpenCV is a cross-platform library that encapsulates a wide range of computer vision techniques, ranging from simple edge detection, all the way up to 3D scene reconstruction.

But developing primarily for iOS, there was an impedance mismatch. iOS deals with things like UIImages, CGImages and CVImageBuffers. OpenCV deals with things like IplImages and cv::Mats.

So I wrote a framework that takes care of all the iOS stuff, so you can focus on the computer vision stuff.

I call it CVFunhouse. (With apologies to Robert Smigel).

As an app, CVFunhouse displays a number of different applications of computer vision. Behind the scenes, the framework is taking care of a lot of the work, so you can focus on the vision stuff.

To use CVFunhouse, you create a subclass of CVFImageProcessor. You override a single method, “processIplImage:” (or “processMat:” if you’re working in C++). This method will get called once for every frame of video the camera receives. Your method processes the video frame however you like, and outputs the processed image via a callback to imageReady: (or matReady: for C++).

The callback is important, because you’re getting the video frames on the camera thread, but you probably want to use the image in the main UI thread. The imageReady: and matReady: methods take care of getting you a UIImage on the main thread, and also take care of disposing of the pixels when you’re done with them, so you don’t leak image buffers. And you really don’t want to leak image buffers in an app that’s processing about 30 of them per second!

CVFunhouse is dead easy to use. The source is on GitHub at To get started, just run:

git clone

from the command line. Then open the project in Xcode, build and run.

I’ve now built numerous apps on top of CVFunhouse. It’s the framework I use in my day-to-day work, so it’s constantly getting improved. I hope you enjoy it too.

Your iPhone’s Seven Senses

Humans have five senses. Your iPhone has seven:

  • Touchscreen
  • Camera
  • Microphone
  • GPS (augmented by cell tower and WiFi location)
  • Accelerometer
  • Gyroscope
  • Magnetometer

(The magnetometer is normally used as a compass. But think for a moment — your iPhone can actually sense magnetic fields. That’s something only a few animals can do.)

Now here’s the sad part:

Most of the time we communicate with our iPhones via only one of those senses — touch. Virtually all of our interaction with our iPhones is via touching a screen the size of a business card. We talk with our iPhone like Anne Sullivan talked to Helen Keller.

But the iPhone isn’t blind or deaf. It can see and hear quite well, and it has a better sense of location and direction than most people.

But it’s very rare that apps take advantage of these senses. One of the few that does (other than navigation and photography apps) is the Apple Store app.

Note, I’m not talking about the App Store app, I’m talking about the app you use to purchase Macs and iPhones from Apple. The app that’s normally a friendly front end for the Apple Store website.

But when you run the app while you’re in (or near) an actual physical Apple retail store (like this one in Palo Alto), the Apple Store app gives you a bunch of new options. For example, it knows you’re in an Apple Store, so if you have a Genius Bar appointment there, it automatically checks you in for your appointment, and shows you a picture of the Genius who will be meeting you.

But the coolest thing you can do with the Apple Store app while at an actual Apple Store is self-checkout. You don’t need to find somebody in a blue shirt to help you with your purchase. Instead, you can just grab an item off the shelf, point your iPhone’s camera at its barcode, and enter your iTunes password. Your item is charged to the credit card associated with your iTunes account, and you’re free to walk out the door with it. It’s freaky weird the first time you do it, but also way cool.

And all this is done using just a two of the iPhone’s senses — GPS and camera.

Imagine what you could do with all seven!

Shuttle Launch

So, it’s been a few months since I went to see the shuttle launch.

It was okay.

I guess you could sum up my feelings with that Peggy Lee song “Is That All There Is?” The reason that I went to see the shuttle launch is because of the essay Penn Jillette wrote about it in Penn and Teller’s “How to Play in Traffic”.

“It’s 3.7 miles away, and your looking at this flame and the flame is far away and it’s brighter than watching an arc welder from across a room[….] The fluffy smoke clouds of the angels of exploration spill out of your field of vision. They spill out of your peripheral vision.”

“You don’t exactly hear it at first, it almost knocks you over. It’s the loudest most wonderful sound you’ve ever heard. […] You can’t really hear it. It’s too loud to hear. It’s wonderful deep and low. It’s the bottom.”

“This is a real explosion and it’s controlled and it’s doing nothing but good and it makes your unbuttoned shirt flap around your arms. It’s beyond sound,it’s wind. It’s a man-made hurricane.”

The key point there being, “3.7 miles away”. In the VIP section. I was in closer to 7 miles away, along the NASA Causeway, in the closest section open to to the general public. From there, the Shuttle is a tiny speck without binoculars, and the sound of the launch, when it hits you, is reminiscent of the sound of distant thunder in the midwest. And with the low clouds, the whole show was over in matter of seconds. I could tell you more, but just watch the movie. That’s pretty much what I saw and heard, and I’m nowhere near as good at words as Penn.

Next time, I’m bringing binoculars.