Unsupervised Machine Learning K-Means, PCA

Unsupervised Learning 

It uses unlabeled training dataset rather than a labeled one as used in supervised learning. In other words we don't have a the vector of expected results, we only have a dataset of features where we can find the structure. Clustering is good for market segmentation, social network analysis, organizing computer clusters, astronomical data analysis etc.

K-Means Algorithm

The K-Means Algorithm is a most popular and widely used algorithm for automatically grouping data into coherent subsets.

1. Randomly initialize k points in the dataset called the cluster centroids.

2. Cluster assignment: assign all examples into one of k groups based on which cluster centroid the example is closest to.

3. Move centroid: compute the averages for all the points inside each of the k cluster centroid groups, then move the cluster centroid points to those averages.

4. Re-run (2) and (3) until we have found our clusters.

If we have a cluster centroid with 0 points assigned to it, we can randomly re-initialize that centroid to a new point. we can also simply eliminate that cluster group.

Optimization Objective

Our optimization objective is to minimize all our parameters using the cost function. That is we are finding all the values in the sets c, representing all clusters and μ, representing all our centroids, that will minimize the average of the distances of every training example to its corresponding cluster centroid.

Randomized Initialization

There is one particular recommended method for randomly initializing cluster centroids. For this we have to make sure that the number of clusters is less than the number of our training examples. Randomly pick k training examples. These examples should be unique. K-means can get stuck in local optima. To decrease the chance of this happening, we can run the algorithm on many different random initializations.

Choosing the Number of Clusters

Choosing K is very arbitrary and ambiguous. The Elbow method is used to find the number of clusters. Plot the cost function J and the number of cluster K. The cost function should reduce as we increase the number of clusters and then flatten. Choose K at the point where the cost function starts to flatten. J will always decrease as k increases. The one exception is if k-means get stuck at local optimum. Another way to choose K is to observe how well k-means performs on a downstream purpose.

Dimensionality Reduction

We may want to reduce the dimension of our features if we have a lot of redundant data. To do this, we find two highly correlated features, plot them, and make a new line that seems to describe both features accurately. We place all the new features on this single line. Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm.

Principal Component Analysis

The most popular dimensionality reduction algorithm is Principal Component Analysis (PCA). Given two features, x1​ and x2​, we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature. The same can be done with three features, where we map them to a plane. The goal of PCA is to reduce the average of all the distances of every feature to the projection line. This is the projection error. In PCA, we are minimizing the shortest distance, or shortest orthogonal distances, to our data points. Before we can apply PCA, there is a data pre-processing step w e must perform.

The Steps in PCA Algorithm. 

1. Compute "Covariance matrix Σ ". 
2. Compute "eigenvectors" of covariance matrix Σ. [U,S,V] = svd(Sigma); 
3. Take the first k columns of the U matrix and compute z.

Sigma = (1/m) * X' * X

[U,S,V] = svd(Sigma);

Ureduce = U(:,1:k);

Z = X * Ureduce;

The most common use of PCA is to speed up supervised learning. Trying to prevent overfitting, we might think that reducing the features with PCA would be an effective way to address overfitting. It might work, but is not recommended because it does not consider the values of our results y. Using just regularization will be at least as effective.