Image Classification

This is a project I worked on with my classmates Zach Beaver, Erin Boehmer, Nitin Kohli, and Kunal Shah, as the final project for the Scaling Up class that is part of the Master of Data Science program at UC Berkeley. GitHub code can be found here.

TinyTags: Image Tagging and Indexing at Scale


What does the project do

Our project provides a means to automatically tag images with categorical labels and make the image dataset searchable through a front-facing search interface. Specifically, we use a distributed Random Forests machine learning algorithm to automatically tag a large dataset of images using a model built from a tagged subset of images. Our project leverages the data contained in the Tiny Image Dataset, which is a public dataset consisting of almost 80 million 32×32 images, amounting to ~250GB of data. To train the Random Forests classifier, available through Spark’s MLlib, we use the tagged images included in the CIFAR-10 dataset.

Tiny Image Dataset:

CIFAR-10 Dataset:

The CIFAR-10 dataset contains 60,000 32×32 tagged color images, with 6000 images belonging to each of the 10 possible tag categories. Categories include: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Because the available tags are limited to 10 categories and used to train our classifier, we limit the image search to the 10 possible tag names.

In order to complete this project, we needed to transfer both the CIFAR-10 and Tiny Images datasets to an HDFS in the cloud, deployed via Cloudera. We then used Spark and its associated MLlib to train and test models that would tag the images using inherent features. After finding a suitable model to tag the remaining images in the dataset, we indexed the images and created a Solr search engine with a web UI front-end that allows users to query for images.

What Problem Is Being Solved?

While the problem of text organization and search is one that is fairly well-understood with mature solutions, automated organization and search over images is an area of active and aggressive research; entire labs are dedicated to the endeavor at some of the most well-known industry and academic institutions.  The primary dilemma is that image annotation is typically a tedious, manual, and subjective process.  Current image search engines match user queries to user-annotated photos.  This process is time-consuming and dependent upon the uploader with the shortcoming that two people could describe a photo in completely different ways; this significantly affects search results. In light of these shortcomings, automated image recognition (via machine learning) holds the promise of eliminating subjective, user-dependent annotation and ultimately saving people time with both image annotation and search.

Given that machine learning can eventually deliver on this promise of automated annotation, there must be a Big Data pipeline and ecosystem of tools that can effectively handle automated annotation (even retroactively) of large image corpuses.  The goal of this project is to survey and utilize an appropriate set of Big Data tools necessary to ingest, automatically tag, index, and store large image corpuses.  As such, the value of this project is in creating an infrastructure necessary to support effective auto-annotation of images.  The applications of such a system include cloud-based organization of personal photo collections, standardizing image search criteria, reducing the amount of time needed to push image content via social media, and automating aerial recognition of disaster-related imagery.  Image recognition research has advanced rapidly in recent years, and while wholesale image recognition is not yet mature, we want to develop the skills, strategy, and pipeline necessary to support such advances.

Big Data Tools

In order to complete the assignment, we decided upon a suite of Big Data tools, which includes:

  • Apache Spark and Spark MLlib (deployed via Cloudera Manager)
  • Python (PySolr, Pydoop, Flask, Scikit-Learn)
  • IPython Notebook with PySpark support and YARN connection
  • HDFS with Hue as UI (Deploy via Cloudera Manager)
  • YARN as resource manager
  • Softlayer (Deploy a YARN cluster using Cloudera Manager; host web application)
  • Solr for index and search (Deploy via Cloudera Manager)

Image Recognition with Random Forests in Spark

The primary goal of this project is to automatically create tags (or labels) for images. Creating an accurate classifier is a crucial piece of this project, but also a very time-consuming one to do well. Achieving near state-of-the art performance in image recognition can take months, even for experienced researchers, as demonstrated by recent image recognition competitions on Kaggle. At the beginning of the project, our ambitious goal was to try to automatically tag all types of images in the Tiny Images data set; we eventually narrowed our scope such that (1) we would attempt to only provide tags to images whose labels are present in the pre-tagged CIFAR-10 dataset and (2) we would only tag Tiny Images and make them available for search when the algorithm predicted a tag above a certain accuracy threshold.

Having thus narrowed the scope, our first task was to choose a classifier that was (1) supported by Spark and (2) provided a modicum of accuracy. Training a ML algorithm is an extremely iterative process, marked by trial and error over the course of many experiments.  To facilitate this effort, we decided to separate our development and production environments.  We made algorithm decisions as a team (in parallel through a common GitHub repository) by using Scikit-Learn on our local machines; this was our development environment before transferring the final algorithm decisions to Spark.  We chose this ‘local dev environment’ approach for a few reasons:

  1. It allowed us to begin developing the algorithm while the Spark environment was being readied by another team member.
  2. The labeled training data is actually small (50,000 images), which can be further divided and does not necessitate distributed computing.  The real need for distributed computing comes with trying to automatically tag the 80 million Tiny Images data set.
  3. In development, iteration time is key.  Working with a library we were familiar with (Scikit-Learn) allowed us test multiple algorithms rapidly.
  4. We could work in parallel without worrying about resource constraints of running multiple Spark jobs at once.

After local development, we translated the hyper-parameters (with some tweaks due to memory limitations and hyper-parameter support in Spark) to Spark’s MLlib for implementation and trained on the entire CIFAR-10 data set (see Appendix “Local Development in Scikit-Learn).

We then completed the final code development in two phases

  1. Learning a model:

We imported train data (CIFAR-10) into Spark as RDDs. After the import, we run the RandomForestClassifier training step on the RDD. We trained it on 10 trees in the forest.

  1. Predicting using the model:
    1. The entire dataset was a ~250GB bin file. And even after splitting the file into 80 smaller bin files to transfer into HDFS, each individual bin file was still too big for the driver memory to process both transformation into two-dimensional array and RDD conversion. Therefore, we further split each bin file into 50 RDDs, which were much smaller. This solved our input data problem.
    2. We wanted to predict the accuracy of our results as well. i.e. how many of the trained trees agreed on the output label. This was difficult in Spark, especially in PySpark, as there is no currently no library supporting accuracy output. Therefore, we have to create a function for outputting accuracy.
    3. We want to only keep output with certain levels of accuracy. Therefore, from the local ML development, we determined that 40% is a good cutoff threshold for accuracy. When predicting on the test dataset, we chose not to write out an output if the accuracy falls below 40% (i.e. less than 40% of the trees in the Random Forest “voted” for the particular class). This reduces the amount of “contentious,” and therefore likely inaccurate, data we have to write out.
    4. We still had to choose a good output format for our results that would make it easily indexable. (NOTE: We could have sent the content of an RDD directly to Solr, but this would not be a good approach in case the process fails. Therefore, we separated indexing from output production). We decided to output a JSON dictionary with the RDD block number, feature vector, label, and the accuracy.

After these steps, we have an indexable JSON output file for each input RDD file.

Indexing the Processed Data Set

Once the RDDs have gone through the algorithm, it outputs a tuple of four fields for each image: the tag, which ranges from 1 to 10, accuracy, which ranges from 4 to 10, the RDD block number, which ranges from 0 to 3999, and the raw bytes. If the accuracy is less than 4, the image will be discarded and will not be written out. We then sent the data to Solr for indexing. We first created a new schema for the raw data as we don’t want it to be indexed. A new schema called “data” is created as a field that’s not indexed but is stored. Aside from the raw data that is stored as a non-indexed JSON field, we indexed three total fields: the tag, the accuracy, and the number of the RDD it comes from. The last one is mainly for if indexing runs into any errors. In such cases, we can tell which RDD it stops at, delete all the data in Solr from that RDD and restart from that RDD. Otherwise, because Solr randomly assigns index ID, if indexing runs into any error, restarting without knowing the stop point will be extremely difficult.

Because of the large amount of data that needs to be indexed, we ended up provisioning an additional node with a 750GB secondary disk. Non-indexed fields require a constant amount of drive space, but indexed fields require about 2x as much space on the hard drive during Solr processing. Therefore, a large amount of disk space was needed to process the image JSON. The Solr node was not provisioned as part of the cluster as it does not use the services from Cloudera, but rather communicates only with the node from the cluster that is sending the processed JSON image data.

Front End UI Landing Page


Demonstrating Search Capabilities


Showing Searched Results




%d bloggers like this: