This is our first unsupervised model.

so we just have a bunch of unclassified clusters on our graph. We want to find a certain K amount of clusters, and then when we give an unclustered point we can assign a cluster to it.

Real data may not be as good as the graph above, we may also work in 20 dimensions, 40 dimensions high ass dimensions. The computer needs an algorithm

The algorithm goes like this:

you give it a K number of clusters and it places centroids around the grid.

the centroids are the centres of the clusters and they are randomly placed. Then for each datapoint, we ask it: where is the nearest centroid?

drawing is not the most accurate

then, we delete the centroid we already have and place it in the centre of all the points so that the centroid is actually in the center.

but you see, its not very valid anymore.

So we do the same again, we ask with the new centroids for every point, what centroid am I nearest to? And they all update

Centroids center

then ask again for all points: what centroid am I closest to? Nothing changes and this is how the algorithm ends.

Programming

We still use sci-py.

From sklearn.clusters import KMeans

we also want to use scale from preprocessing to normalize our data into 0 and 1.

From sklearn.preprocessing import scale

the dataset we will be using is the load_digits dataset.

this is a dataset of hand-written digits.

We don’t want to tell the ai what clasification each digit is, we just want it to spot patterns in the data and we want it to add new data to see what category it fits in. we give the model a bunch of pictures and the model tells us which pictures are similar and which are not.

Ok lets make the data now.

we load the digits then we scale to normalize

then we make the model.

n_clusters is the number of clusters we want.

init is the initialization method. We could pass an array or callable in there, but make it random for now it works.

n_init is the number of times we want to try to start with different init positions. How many times we want our centroids to be randomized.

The model just fits data, just one data, there is no class just points.

Now it is important to know we cannot test this. It doesn’t know there is a true or a false answer.

We could use the predict feature, and it will tell us what cluster our bunch of pixels belong to, but it will not know what digit it is.

pretend there is an image file there