Introduction: different from printed character or handwritten recognition

Introduction:

Recognizing
character and digit from documents such as photographs which captured at a
street level is a very important factor in modern-day map making. For example, detect
an address automatically and accurately from street view images of that
building. By using this information more precise map can be built and it can
also improve navigation services. Though normal character classification is
already a solved problem by computer vision but still recognizing digit or
character from the natural scene like photographs are still a harder problem.  The reason behind the difficulties may be the
non-contrasting backgrounds, low resolution, blurred images, fonts variation,
lighting etc.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Traditional
approaches for classifying characters and digits from natural images were
separated into two channels. First segmenting the images to extract isolated
characters and the perform recognition on extracted images. And this can be
done using multiple hand-crafted features and template matching. 1

The
main purpose of this project is to recognize the street view house number by
using a deep convolutional neural network. 
For this work, I considered the digit classification dataset of house
numbers which I extracted from street level images. 5 This dataset is similar
in flavor to MNIST dataset but with more labeled data. It has more than
600,000-digit images which contain color information and various natural
backgrounds and collected from google street view images. 5 To achieve the goal,
I formed an application which will detect the number of just image pixels. Here,
a convolutional neural network model with multiple layers is used to train the
dataset and detect the house digit number with high accuracy. I used the
traditional convolutional architecture with different pooling methods and
multistage features and finally got 91.9% accuracy.

Related Work:

Street view number
detection is falling into the category of the natural scene text recognition
problem which is quite different from printed character or handwritten
recognition problem. Research in this field was started in 90’s but still, it
is considered as an unsolved problem. As I mentioned earlier that its
difficulties are due to fonts variation, scales, rotations, low lights etc.

In earlier years to deal
with this matter sequentially, character classification by sliding window or
connected components mainly used. 4 After that word prediction can be done by
predicting character classifier in left to right manner. Recently segmentation
method guided by supervised classifier use where words can be recognized
through a sequential beam search. 4 But none of this can help to solve the
street view recognition problem.

In recent works
convolutional neural networks proves its capabilities more accurately to solve
object recognition task. 4 Some research has done with CNN to tackle scene
text recognition tasks. 4 On that studies CNN shows its huge capability to
represent all types of character variation in the natural scene and till now it
is holding this highly variability. Analysis with convolutional neural network
stars at early 80’s and it successfully applied for handwritten digit
recognition in 90’s 4 After that with the increasing availability of computer
resources, training sets, advance algorithm and dropout training must lead to
many successes using deep convolutional neural networks. 3

Previously CNN used
mainly for detecting a single object from an input image. It was quite
difficult to isolate each character from a single image and identify them.
Goodfellow 4 solve this problem by a deep large CNN directly to model the
whole image and a simple graphical model as top inference layer.

The rest of the paper is
designed in section III Convolutional neural network architecture, section IV
Experiment, Result, and Discussion and Future Work and Conclusion in section V.

 

CNN Architecture:

Convolutional Neural Networks (CNN) can handle complex,
high-dimensional data which is nearly identical ordinary Neural Networks. It
has some neuron which carries some weight and biases. Each neuron takes images
as inputs, then make a forward function to implement and widely reduce the
number of parameters in the network. 16 Generally, CNN consists of several
layers. 1 In the first layer, the input will be convoluted with a set of
filters to get the values of the feature maps. Then to decrease the
dimensionality of the spatial resolution of the feature map, after each
convolution layer there will be a sub-sampling or pooling layer. Convolutional
layers output substitute by sub-sampling layers and create the feature extractor
to retrieve selective features from the raw images. These layers will be
procced by fully connected layers (FCL) and the output layer. The output from
the earlier layer will be the input for the following layer. For the different
problem, CNN architectures may differ too.  

 

Experiment, Result and Discussion:

Data

The main objective of my
project is detecting and identifying house-number signs from street view
images. For my work the dataset I am considering is street view house numbers
dataset taken from (* dataset link). has similarities with MNIST dataset. It
has more than 600,000 labeled characters. The images are in .png format. After
download and extract the dataset I cropped the images. Now all images are 32×32
pixels with three color channels. There are 10 classes, 1 for each digit. Digit
‘1’ is label as 1, ‘9’ is label as 9 and ‘0’ is label as 10. http://ufldl.stanford.edu/housenumbers/
The dataset is divided into three subgroups: train set, test set, and extra
set. The extra set is the largest subset of easy samples contains almost
531,131 images. Correspondingly, train dataset has 73,252 and test data set has
26,032 digits. Fig:2 is the example of the original, variable-resolution, colored
house-number images with character level bounding boxes. http://ufldl.stanford.edu/housenumbers/

 

The
bounding box information are stored in digitStruct.mat instead of drawn
directly on the images in the dataset.) Each tar.gz file contains the orignal
images in png format, together with a digitStruct.mat file, which can be loaded
using Matlab. The digitStruct.mat file contains a struct called digitStruct with the same length as the number of original images. Each
element in digitStruct has the following fields: name which is a string
containing the filename of the corresponding image. bbox which is a struct array that contains the position, size and
label of each digit bounding box in the image. Eg: digitStruct(300).bbox(2).height gives height of the 2nd digit bounding box in the 300th
image. 

 

In
the above figure, it clearly is shown that in SVHN dataset maximum house numbers
are printed signs and they are easier to read. Though, there is a large
variation in case font, size and colors which make the detection quite difficult.
The variation of resolution is large here. (Median: 28 pixels. Max: 403 pixels.
Min: 9 pixels.) The following figure shows large variation in the character
height as measured by the height of the bounding box in original street view
dataset. This suggests that size of all characters in the dataset, their
placement, and character resolution is not uniform across the dataset which increases
the difficulty of making a correct house number detection.

 Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo
Wu, Andrew Y. Ng Reading Digits in Natural Images
with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised
Feature Learning 2011. (PDF)

Experiment

In my experiment, I train a multilayer CNN
for street view house numbers recognition and check the accuracy of test data.
The coding is done in python using Tensorflow, a powerful library for
implementation and training deep neural networks. The central unit of data in
TensorFlow is the tensor. A tensor consists of a set of primitive values shaped
into an array of any number of dimensions. A tensor’s rank is its number of
dimensions. 20 Along with TensorFlow used some other library function such as
Numpy, Mathplotlib, SciPy etc.

Firstly, as I have technical resource
limitation I perform my analysis only using the train and test dataset. And
omit extra dataset which is 2.7GB. Secondly, to make the analysis simpler I
find and delete all those data points which have more than 5 digits in the
image. For the implementation, I randomly shuffle valid dataset I have used the
pickle file svhn_multi which I created by preprocessing the data from the
original SVHN dataset. Then used the pickle file and train a 7-layer Convoluted
Neural Network. Finally, I cast off the test data to check for accuracy of the
trained model to detect number from street house number image.

               At
the very beginning of my experiment, first convolution layer I used 16 feature
maps with 5×5 filters, and originate 28x28x16 output. A few ReLU layers are
also added after each layer to add more non-linearity to the decision-making
process. After first sub-sampling the output size decrease in 14x14x10. The
second convolution has 512 feature maps with 5×5 filters and produces 10x10x32
output. In this moment applied sub-sampling second time and shrink the output
size to 5x5x32. Finally, the third convolution has 2048 feature maps with same
filter size. It is mentionable that the stride size =1 in my experiment along
with this zero padding also used here. During my experiment, I used dropout
technique to reduce the overfitting. Finally, the last layer is SoftMax
regression layer. Weights are initialized randomly using Xavier initialization
which keeps the weights in the right range. It automatically scales the
initialization based on the number of output and input neurons. Now I train the
network and log the accuracy, loss and validation accuracy in steps of 500.

Initially, I used a static learning rate of
0.01 but later switched to exponential decay learning rate with an initial
learning rate of 0.05 which decays every 10000 steps with a base of 0.95. To
minimize the loss, I used Adagrad Optimizer. When I reached a satisfactory
accuracy level for the test dataset then stop the learning and save the
hyperparameters in the cnn_multi checkpoint file. When I need to perform the detection,
it will load that time without train the model again.

Initially, the model produced an accuracy of
89% with just 15000 steps. It’s a great starting point and certainly, after a
few hours of training the accuracy will reach my benchmark of 90%. However, I added
some simple improvements to further increase the accuracy of few number of
learning steps. First, added a dropout layer after the third convolution layer
just before fully connected layer. This allows the network to become more
robust and prevents overfitting. Secondly, introduced exponential decay to
learning rate instead of keeping it constant. This helps the network to take
bigger steps at first so that it learns fast but over time as we move closer to
the global minimum, take smaller noisier steps. With these changes, the model
is now able to produce an accuracy of 92.9% on the test set with 15000 steps.
Since there are a large training set and about 13068 images in the test set,
there is a chance of more improvement if we trained the model for a longer duration. We
run our model for a few thousand steps and check the mini batch accuracy and
validation accuracy for every batch of 500 and once the training is complete we
check the test set accuracy.

Result and discussion

Finally,
my considered model reached in the accuracy of 92.9% on test set which is satisfactory
to me. As I mentioned in the earlier section that the saved hyper-parameters in
a checkpoint file will be restored later to continue training or to use it for
detecting new images. By using the dropout, I confirm that the model is strong
and can do well with most data. The model is tested over a wide range of input
from the test dataset and generalizes well.

It
can be seen from Fig. 4 that the model is able to predict the data clearly for
most of the images. However, it still gives incorrect output when the images
are blurry or any other noise. Due to my time limitation, I trained the model for
a relatively less duration. I believe there is a chance to increase the
accuracy level. Also, if we use a better hardware and GPU can be trained and
run the model faster.

In
my experiment I proposed a multi-layer deep convolutional neural network to do
the street view house number recognition. I do the experiment on more than
600,000 images and got almost 92% accuracy. So, from the accuracy it can be
clearly seen that the model produces correct output for most images. However,
the detection fails if the Image is blurry, with noise etc. One interesting
aspect of the project was to find out how well the optimization tricks like
dropout and exponential learning rate decay perform on real data. One difficult
aspect of the project was to choose appropriate architecture for the problem.
Since there are many ways the architecture can be implemented it’s very
difficult to understand why an architecture will work best for data. The model
implemented here is relatively simple but does the job very well and is quite
robust, however it’s still requires a lot of work to make the model perform
equivalent or better than a human operator. As a future work, I will extend my
experiment using different technique and algorithms. And try to find out which
one has better accuracy with minimum cost and less number of loss