An Introduction to Computer Vision with Keras

Kara Combs
Kara Combs
Air Force Institute of Technology

Visual detection and identification of objects is important when it comes to our daily lives and comes quite naturally to us humans. However, many robust artificial intelligence (AI) agents still struggle to do it quickly and accurately. This article gives a brief introduction to how computer vision models are created. As the world evolves with the introduction of AI-driven innovations like drones, self-driving cars, and other autonomous technologies, the field of computer vision gains unprecedented importance (IBM, 2023). Computer vision, a key area within artificial intelligence, involves a computer’s ability to interpret visual cues (IBM, 2023). The accuracy of these interpretations is vital for the algorithms that power computer vision, particularly in high-stakes applications. For example, in the automotive industry and specifically for companies like Tesla, the precision of these predictions is critical, as it can mean the difference between life and death (Shepardson, 2022). A cornerstone of most computer vision systems is a type of artificial neural network known as a “convolutional" neural network (CNN), which plays a crucial role in processing these visual cues (IBM, 2023).


CNNs & Their Hyperparameters

Convolutional neural networks (CNNs) create various feature maps (which are similar to viewing different aspects of a picture) that are quite successful at processing images (Meel, 2023). CNNs started with the small five-layer LeNet, which was trained on the Modified National Institute of Standards and Technology (MNIST) dataset (see (LeCun et al., 2010)), consisting of handwritten digits ranging from zero to nine (LeCun et al., 1998). However, CNNs became more prominent over a decade later due to a CNN called AlexNet (see (Krizhevsky et al., 2017)), which won the 2012 ImageNet Large Scale Visual Recognition Challenge (often abbreviated as “ILSVRC") (Russakovsky et al., 2015). AlexNet was a 10-layer CNN that was the first to win the ILSVRC competition since its inception in 2010 (Russakovsky et al., 2015). In subsequent years the ImageNet Large Scale Visual Recognition Challenge and more generally, computer vision algorithms, have been dominated by CNNs such as VGG-16, ResNet-50 (Residual Network), and PNASNet-5 (Progressive Neural Architecture Search Network) (Russakovsky et al., 2015).

Although CNNs are powerful and produce impressive results, their hyperparameters and parameters are typically fine-tuned for the specific problem at hand. It is important to note the differences between a hyperparameter and a parameter. Hyperparameters are controlled and set by the user and influence parameter values; whereas, parameters are set by the algorithm based on the data being passed through (Simic, 2023). There are many hyperparameters associated with CNN such as the number of layers, layer types, number of nodes, etc. Determining the best settings for a CNN’s hyperparameters becomes extremely important as each hyperparameter impacts the overall accuracy of the model. Therefore, a design of experiments is often needed to determine the best settings for the model’s hyperparameters. Many times researchers do not have to create a CNN from scratch and can use the architecture of a previously-trained neural network, which only necessitates some fine-tuning of the hyperparameters and parameters. This phenomenon is called “transfer" learning since pre-trained “base" knowledge is “transferred" to a new, but related knowledge domain (Hosna et al., 2022; Zhuang et al., 2020). Essential to any computer vision model is the data it is being trained to classify since this determines the model’s internal parameters and the resulting outcomes.

CIFAR-10 Dataset

A common dataset for testing computer vision models is the CIFAR-102 dataset. Originally part of the much larger 80 million tiny images dataset (see (Torralba et al., 2008)), the CIFAR-10 dataset consists of 60,0000 images with 50,000 images to be used for training and the remaining 10,000 images for validation. The images’ dimensions are 28 x 28 and are evenly split between 10 classes (Krizhevsky, 2009). Along with the earlier mentioned MNIST dataset, CIFAR-10 is one of the most widely used computer vision datasets and is often used as an initial starting point. Since each individual image is small in size, it looks rather pixelated to the human eye when enlarged, which is clear in Figure 4. Since these images are multi-colored, they have three color channels; whereas, monochromatic images would only have 1 color channel. This is an important piece of information when designing an architecture.

Figure 4: Sample CIFAR-10 Images

Example Convolutional Neural Network

An example architecture based on Keras’s Convolutional Neural Network Tutorial is shown in Figure 5 (Chollet et al., 2015). Starting on the left, the input layer (white left-most block in 2CIFAR: Canadian Institute For Advanced Research Figure 5) are the actual images themselves which, in this case, have a height and width of 28 and three color channels. These images are then passed through three convolutional layers (blue) with max-pooling layers (green) in between. The third convolutional layer is followed by two fully connected layers (orange) with the final layer being the classification (or output) layer (purple). A classification layer is a specific type of fully connected layer where the number of nodes corresponds to the number of classes, which in our case is ten. Between the second fully connected and the classification layers, there is a dropout rate that will remove a certain number of nodes from one layer to the next. A dropout can occur at any point in the model to best assist with overfitting the model to the data. The number of layers and their dimensions are all customizable, though there are some constraints due to the input image size and the number of classes.

Figure 5: Example CNN Architecture 

After the architecture of the CNN is designed, there are still two required hyperparameters that are needed for Keras’s model.compile method, the optimizer and the loss function. The optimizer adjusts the CNN’s weights, which determine how much emphasis to place on that particular amount of information being passed through, to yield better predictions in the final output layer. Most standard problems utilize the popular Adam optimizer which is a stochastic gradient descent algorithm originally discussed in (Kingma and Ba, 2017) that is conveniently a built-in Keras class.

A popular loss function is the Sparse Categorical Crossentropy class, which is used for categorical outputs (opposed to numerical ones). A loss function attempts to quantify the differences (if any) between CNN’s predicted value of an observation compared to the observation’s ground truth (or label). Additionally, the function’s from_logits parameter standardizes the prediction probability when set to True and is usually needed with the final fully-connected layer as it does not have its activation argument set to softmax (Landup, 2022). Finally, a metric needs to be specified. Unless circumstances dictate otherwise, accuracy is usually the standard metric to select which compares the number of times the predicted class of an image matches the image’s true class in this example. However, other built-in metrics include area under the curve (AUC), precision, and recall.

One of the last lines of code related to the Keras model is the function, which includes, perhaps the two most influential factors to a CNN outside of the architecture itself, the batch size and number of epochs hyperparameters.

A batch is the number of input samples that the model evaluates prior to updating its weights, which is known as an iteration (Brownlee, 2022; Baeldung, 2022). Larger batch sizes decrease the amount of time it takes for a model to train since the model does not update itself after every iteration when the batch_size > 1 (i.e., “stochastic gradient descent") (Brownlee, 2022). Usually, batch sizes occur in powers of two (2n) depending on the dataset size, and the default Keras batch_size is 32. A concept related to batch size and iteration is the number of epochs.

The number of epochs, which is the number of times the CNN cycles through all the training data (Brownlee, 2022; Baeldung, 2022). An iteration is the number of batches that are completed to complete an epoch and as mentioned earlier, the batch size is the number of training samples evaluated within one iteration (Baeldung, 2022). The number of epochs is a hyperparameter typically determined by the experimenter selecting the value or determining the value based on performance (when the model is no longer making significant improvements) (Baeldung, 2023). The more epochs that take place usually correlate to a higher accuracy; however, it takes more time.

In conclusion, CNNs have had a profound impact on computer vision by being able to computationally interpret visual data. While the state-of-the-art seem complex, they are all constructed with the same basic building blocks introduced herein. Ongoing research continues to enhance their abilities within fields such as autonomous vehicles and medical image diagnostics. CNNs have undeniably revolutionized our interaction with the visual world as a cornerstone of artificial intelligence and computer vision.



Baeldung, 2022.   The difference between epoch and iteration in neural networks. networks-epoch-vs-iteration. Accessed: July 13, 2023.

Baeldung, 2023. Epochs in neural networks. Accessed: July 17, 2023.

Brownlee, J., 2022. Difference between a batch and an epoch in a neural network. difference-between-a-batch-and-an-epoch/. Accessed: July 10, 2023.

Chollet, F., et al., 2015. Keras.

Hosna, A., Merry, E., Gyalmo, J., Alom, Z., Aung, Z., Azim, M.A., 2022. Transfer learning: a friendly introduction. Journal of Big Data 9, 102.

IBM, 2023. What is computer vision? Accessed: 2023-07-06. Kingma, D.P., Ba, J., 2017. Adam: A method for stochastic optimization. arXiv:1412.6980.

Krizhevsky, A., 2009. Learning multiple layers of features from tiny images. URL: 18268744.

Krizhevsky, A., Sutskever, I., Hinton, G.E., 2017. Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84–90. URL:, doi:10.1145/3065386.

Landup, D., 2022. What is ’from_logits=True’ in Keras TensorFlow loss functions? in-keras-tensorflow-loss-functions/. Accessed: July 10, 2023.

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 2278–2324. doi:10.1109/5.726791.

LeCun, Y., Cortes, C., Burges, C., 2010. Mnist handwritten digit database. ATT Labs [Online]. Available:

Meel, V., 2023. ANN and CNN: Analyzing differences and similarities. Accessed: 2023-07-06.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115, 211–252. doi:10.1007/s11263-015-0816-y.

Shepardson, D., 2022. U.S. opens special probe into fatal Tesla pedestrian crash in California. Reuters.

Simic, M., 2023. Parameters vs hyperparameters. Accessed: August 17, 2023.

Torralba, A., Fergus, R., Freeman, W.T., 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on pattern analysis and machine intelligence 30, 1958–1970.

Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q., 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE 109, 43–76.


Acknowledgments: We would like to thank Harsh Anand for taking time to review this article. Header photo provided by Placidplace.