Detecting an Open Freezer – Beating a CNN?

I don’t miss my college dorm room all too much. I met a lot of great guys and can’t complain about my time spent there, but I am much happier now living with my wife. Regardless, our freezer had the annoying habit of unsealing itself, which would let cold air out and potentially spoil our food.

Before fixing it like a normal person, I thought it could be a fun computer vision application. So I built a mount for my raspberry pi and camera then pointed it at the fridge, and gathered a 4,000 image dataset. Now rather than wait for the freezer to open (I was in college after all) I used the common TITGD method. For those unfamiliar, this is the “Tyler Intervened to Gather Data” approach where I simply took 2,000 pictures of each freezer state, opened and closed. I also made sure to shake my camera mount, change lighting, and add other image noise.

From there I built a CNN to classify the images, which took about 15 minutes and resulted in a high performing model with a 5-Fold validation accuracy of 88.67%. However I wanted to see if I could minimize and feature engineer something with the same accuracy but less parameters. This was the real basis of what I felt make this project interesting.

Project Goal - Can I beat a vanilla CNN?

See what techniques or methods I could use to achieve similar performance to a CNN for image detection, with improvements in parameters, runtime, or both.

Basic Convolutional Neural Network

The basic neural net was fairly simplistic, with just two convolutional layers with maxpooling and a softmax output. While I could have made the net more refined with preprocessing steps, dropout, or been more aggressive with the convolution count, I wanted to stress test the vanilla network. Dubbed ‘universal function appropriators’ I just threw it the full (120, 160, 3) image, and it performed well with an accuracy of 88.67% and F1 of 90.23%. My network had 13,880 parameters, the VCG classifier is a well known CNN and is shown below.

Failure 1 – Mean Shifting

The first thing I attempted to do was mean shift each image. My thought was by finding the mean image for each class and shifting it away from itself, I could then compare this altered image against open and closed image means, and simply take the minimum. This didn’t work for a few reasons, but the biggest seemed to be the ‘shaking’ I had introduced into the dataset, resulting in images that did not contain the same viewpoint. This resulted in extremely fuzzy mean images, which lead to negligible performance.

Failure 2 – Sobel Operator

The next thing I looked at was trying to quantify the strength of vertical gradients in the images. The hypothesis was that the open freezer has more vertical dark lines due to the crack, so it could potentially be a defining characteristic of the distribution. However after using the Sobel Operator to identify the strength of Y-Gradients, there was simply too much noise again in the images, resulting in close to no predictive power. This was tested and validated by an extremely low KL divergence score of the distribution of both classes Y-Gradient strength.

Failure 3 – Fourier Transforms

At this point, I was pretty frustrated. While I was sure I could beat the CNN benchmark by at least some margin, I can’t deny that as far as ease of use was concerned, the simple CNN had me beat. Nevertheless I pressed on by trying to consider if the Fourier Transforms of the images might yield any results. This followed along with the line of thinking in failures 1 and 2, in that if I could find some quantifiable metric for the images, their distribution would have to differ in such a way that yielded predictive power. However after initial work, the Fourier coefficients were simply too noisy. Again, I had originally hoped that the extra vertical lines would cause a difference in either the real or imaginary plane in some of the coefficients, so I applied the Sobel operator before applying the Fourier transform which gave me slightly better structure, but still mainly noise.

No luck so far…

After all these failures, I realized most of my methods were focused purely on trying to find a statistical measure with clear delineation between the classes, without trying to do any reduction in the dimension of the images or further cleanup. Once I started to focus more on taking subsamples or downsampling the image before looking at statistical characteristics, I started to get closer to progress. Below is a simple example where I took a skewed downsample and then looked at the Sobel operator again of specific slices. Because I again felt the key was in focusing on the additional vertical black line from an open freezer, my ‘skewed downsample’ was merely average pixels over rectangles rather than squares at a 4:1 ratio. This yielded a clustering scheme that looked promising, until I hit something big…

Breakthrough!

By finally focusing on a little (120,30) slice of the image in grayscale that encapsulated most of the freezer seal, I made some amazing progress. By downsampling in rectangular columns and capturing the median and variance of each column, I formed a much more effective lower dimensional space. From here I then performed PCA on this space to just two dimensions, and then plotted the results. You can see for yourself, but the finer the partition (lower L score) the better our clustering became! This ultimately meant I could predict the state of the freezer door from a mere 32 values, much better than the CNN at 13,880!

The ‘DUH’ Model

After many attempts, I finally had it! A model that produced a 5-fold validated accuracy of 92.75% using just 32 features derived from a (120,30) grayscale image! I decided to name it the Down-sampled Uncertainty Hypothesis model. The assumption of this model is that when we look at the feature set from our subset of the image and downsample it, the entropy or uncertainty in the first two dimensions of our space should be minimal if the fridge is closed. Thus by simply checking that the norm of our vector is less than a specific threshold (.06 provided optimal in training) then we can assume the fridge is indeed closed. I was quite surprised that for this model, the precision was better than the recall, which is clear from the graphic of the lower dimensional space, yet opposite of the CNN.

I will add that as my roommate pointed out, my model has nothing compared to two strips of aluminum foil and a buzzer. And as a final parting thought, I will say that despite the CNN technically being less ‘efficient’ you could not beat the simplicity and ease of use in its ability to fit to the data. While the DUH method may be slightly faster, this project reaffirmed to me that Neural Nets truly are “universal function approximators.”

Final Results

  • Model
  • Vanilla CNN
  • DUH Method
  • Two strips of aluminum foil + alarm
  • Accuracy
  • 88.68%
  • 92.75%
  • 100%
  • Recall
  • 99.00%
  • 87.00%
  • 100%
  • Precision
  • 82.88%
  • 98.31%
  • 100%
  • Input
  • (120,160,3)
  • (120, 30)
  • 2 pieces of aluminum foil
  • Parameters
  • 13,880
  • 32
  • 2(?)
  • Calculation Time
  • 3.8e-4 sec/img
  • 5.6e-5 sec/img
  • Instant

Input Size – 16x Improvement

Going from (120, 160, 3) to (120, 30) meant I was able to aggressively throw away extraneous information.

Parameters – 433x Improvement

Again, I know dropout is used for the exact purpose of minimizing parameters in a CNN, but through optimization I feel proud I was able to know exactly how far I should push my manual feature engineering process to get better results.

Runtime – 6.8x Speedup

Even though I am fortunate enough to have a GPU for my CNN, the minimal feature engineering and simple 2-dimensional norm of the DUH method is simply a more minimal compute cost.

Bonus Clip – Dimension Reduction

To find the optimal partition size at which to sub-sample my image, I simply did a brute force search to see where the model performed best. Clearly it would be a positively correlated relationship, with finer partitions leading to better performance, but I wanted to understand where the trade off between parameters and performance lay. Below is the video of me slowly refining my partition until I got a clear distinction between the two groups of ‘open’ and ‘closed’ images. I personally just think it looks really cool, and enjoy seeing how in the animation the signs in the first two eigenvectors swap a few times