T-SNE (T distributed - Stochastic Neighbour Embedding)
Neighbour Embeddings
- Preserve the nearest neighbours instead of preserving distances
Stochastic Neighbour Embedding
- Loss function is KL divergence between pairwise similarities in the high-dimensional and in the low-dimensional spaces. Similarities are defined such that they sum to 1
- High price for putting close neighbors in high dimensions far away
- If the distance between points high then their similarity is less and vice versa
Chatgpt explanation of SNE formula
Calculating High-dimensional Similarity
- Kernel width by default is chosen such that 30 values are large and others should be zero (perplexity - 30)
Calculating low-dimensional Similarity
Optimization
- we can use Gradient Descent for optimizing the loss
This works as a many body simulation: close neighbors attract each other while all points repulse each other
Early exaggeration
- During the early stages of iteration to keep the similar points together, the attractive forces between nearby points are amplified, making them more dominant than the repulsive forces between distant points. (if this is not done, then it was observed that the quality of the output is not good)
- After a few steps, early exaggeration is turned off, then repulsion will expand the clusters
Speeding up T-SNE
- Vanilla T-SNE is quadratic complexity
- To speed up we need to take care of Attractive and repulsive forces
Attractive Forces
- Use a small number of non-zero affinities i.e. sparse k-nearest neighbour (KNN) graph.
- Using approximate KNN
Repulsive forces
Using approximate repulsive forces
Barnes-Hut t-SNE O(nlog(n))
FFT-accelerated interpolation based t-SNE - O(n)
Noise contrastive estimation / negative sampling - O(n)
Hyperparameters of t-SNE
Perplexity
Perplexity
is the ‘effective’ number of neighbors considered for similarity. Default is 30Much smaller values of perplexity is not useful
Much larger values are computationally prohibitive
Kernal
- Kernel - Heavy tailed kernel provide more fain grained cluster results
- t-SNE uses Cauchy kernel (distribution with heavier tails)
Role of initialization
- t-SNE preserves local structure (neighbours) but often struggle to preserve global structure. The loss function has many local minima and intialization can play a large role
- We can use PCA as initialization for t-SNE
Exaggeration
If the Exaggeration is very high then the white space between clusters is reduced
If the Exaggeration is not there then the points in a cluster will end up in different places due to repulsion