Improving Convolution Neural Network’s (CNN) Accuracy using t-SNE

6 min readJul 23, 2020

Written by Prasun Biswas and Chandan Durgia

It is always a good feeling when you are able to weave a story. This is an extension of the article written around “Demystifying CNN” which aimed to leverage and extend the generalized CNN to a complex version and decodify the model components and their contributions to overall performance. If you haven’t read the article — pacify your thirst … here is the link.

Ideally, CNN with transfer learning or using teachable learning are itself quite stable and widely acceptable. However, sometimes these generalized or pre-trained models might not be suitable for business use. Problem is, you probably guessed it, the number of layers and features and the complexity associated with it.

So, the question is — is there another secret sauce needed for the near perfect dish?

In the earlier post, we delved into how understanding of the individual components and tweaking the layers or features could help in understanding of the model and improving performance. However, the problem starts when there are a significant number of layers and features. This causes two main issues –

1. overfitting and 2. tweaking features could be very time consuming as the execution time could be significantly high in deep learning. Furthermore, creating analogy with regression analysis, a similar problem like multicollinearity could also be an issue.

Therefore, it is utmost important that dimensionality reduction is considered and applied appropriately to the CNN models as well. When one thinks of Dimensionality reduction the first thought usually is to try Principal Component analysis (PCA).

Credit should be given when due!! PCA usually does a pretty good job in reducing the dimensions, but the big pitfall is that since it is a linear algorithm, it can’t interpret the complex polynomial relationship between features which is commonly the case with the CNN models. Moreover, while projecting multi-dimensional points to lower dimensional space, with PCA there is a possibility that the points closer in the higher dimensional space might become far apart in the lower dimensional space — which ideally shouldn’t be the case.

Irrespective, it’s worth noting that PCA is commonly used in the industry for CNN models as well.

t-SNE to the rescue!!

The two limitations discussed about PCA are well covered by t-SNE (t-Distributed Stochastic Neighbor Embedding) method of dimensionality reduction. t-SNE is not linear, it is meant to capture the complex polynomial relationships between features. In addition, the projections of points, actually closer in high dimension space, also remain closer in low dimensional space as well.

The gist is t-SNE projects data in a lower dimensional space in a manner that clustering (proximity) of the data points is retained.

How t-SNE works

The t-SNE algorithm includes two main stages:

1) It creates a probability distribution (t-distribution) over the pairs of high-dimensional objects in such a way that similar objects have a high probability of being selected (closer to the mean in t-distribution), while the dissimilar points have very small probability of being selected. (further from the mean in t-distribution)

2) Then, it defines similar probability distribution for the points in the low-dimensional space, wherein it treat it as an optimization problem and the algorithm uses Stochastic Gradient Descent to solve it and minimizes the Kullback-Leibler divergence as a cost function, which is commonly used measure between the two distributions with respect to the locations of the points in that space.

In other words, t-SNE considers points in high dimensional space and transforms those points in a low dimensional space keeping the sanctity of the similarity (nearness) of the points within the cluster and with other clusters.

Without going into much depth, it is worth acknowledging that the algorithm is non-linear in nature and adjusts to the underlying data by performing different transformations on different regions. These different levels of transformation create severe complexities in the mathematical form and appropriate hyper tuning becomes imperative.

Optimizing t-SNE Hyperparameters

Below are the key hyperparameters which are used in the t-SNE algorithm:

Perplexity: It is the measure of the density of the data. If the points are closer in a cluster the data is dense within the cluster. If the points are at a distance the density is low.

Perplexity as a parameter tries to balance attention between local (density within a cluster) and global (density between clusters) aspects of the data. In other words, this is a guess about the number of close neighbors each point has. Given a high number of perplexity, the amount of k-nearest neighbors increases (In KNN, it is widely accepted that optimal K~N^ (1/2)).

From an image recognition perspective, the perplexity value has a complex effect on the resulting pictures. The original paper by van der Maaten says, ‘The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.’ A tendency has been observed towards clearer shapes as the perplexity value increases. The most appropriate value depends on the density of your data. Loosely speaking, one could say that a larger/denser dataset requires a larger perplexity.

Number of Iterations: More the iteration, better the result. However practically, it isn’t feasible as in deep learning dataset size is already huge. Again, if too few iterations are used, the clusters might not be visible, and a huge clump of data points can be noticed in the center of the t-SNE plot. As per literature, the largest distance between data points is on the order of ~100. So, the rule of thumb indicates that the algorithm reaches convergence and further increasing the number of iterations will only marginally change the plot.

Learning rate: The learning rate is a common tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function to reach the optimal point. If the learning rate is too high, the data may look like a very spread cluster with any point approximately distant from its nearest neighbors. Many points also separate from their local clusters since too large corrections are made too quickly. If the learning rate however is too low, most map points may look compressed in a very dense cluster with few outliers and clear separation. Since t-SNE is an iterative algorithm it is important to let enough iterations occur to let it converge to a state where any further changes are minute.

t-SNE for improving accuracy

As we know, it is possible to calculate variable importance in simpler methodologies, let’s say random forest, due to its non-black box architecture. But for CNN, it is very difficult to understand which variables/features affect the outcome most. It can only be achieved by selectively adding and/or removing the variables which, due to the large number of variables in deep learning, is not recommended. That is where t-SNE becomes handy and imperative.

Finally, connecting dots from the earlier post.

In the previous article, we talked about h5 tensor flow file, which is basically the auto generated code using a teachable machine for the given image classification problem and we discussed how using that code in Netron or Tensorboard, we can understand the individual components. One can imagine though a code is ready to be used, it might need some improvement on the accuracy, overall stability or to align with business requirements. Leveraging t-SNE, one can boost the accuracy/stability by executing the model on a much lower number of extracted features — thereby making the model tad less complex.

Who doesn’t like simplicity !!

Happy Learning!