K-Nearest Neighbours

This algorithm allows us to classify unknown data using data that is already classified. As you can guess, this is a supervised model

we have a scatterplot here with overweight people(fat but short) in red and underweight people(tall but skinny) in blue.

There are only 2 classes. Either overweight or underweight.

We have a grey dot that is unclassified. We must classify it. To classify it, we find K of our closest neightbours. If k is one. We check only our closest, if k is 2, then we check 2 of the closest, if k is 3 then 3 closest. We check the distances and the classes of our closest neightbours to determine our result.

If we had a green segment in the middle, and if k = 4, then these could be the neighbours

1 neighbour is green, 3 others are red

good practice is to avoid making k divisible by the number of classes you have. Cause if you have k=3 neighbours and 3 classess, you can have cases where you have 1 of each class as your neighbours.

SKlearn datasets and data mining

Sklearn comes with a lot of datasets. We will import one that is for malignant or benign tumors.

its all from sklearn.datasets

when we load the data, it is actually a bunch object

a bunch object is pretty much sci-py’s version of a dictionary

so we can get the keys.

data and target are the important ones. These 2 are the actual data, the other is just for descriptive purposes of this dataset.

all of Scikit-Learn datasets are divided into data and target. data

both lists in .data and .targets have a length of 529. so both are mapped to eachother. I print data and I print target. Data is the data, target tells us if the dataset is for malginant tumors or benign tumors.

The data list is a 2d list of other lists which are quantities. it is too big to contain in one print. There are 529 of these lists yknow!

The target list is a list of 1s and 0s.

like tha. 0 means malignant, 1 means benign

to understand a bit more, go back to print the keys of data. There are a few things of note

lets read the DESCR which is the description of this dataset.

So it says number of instances = 569. ok ok, and also 30 attributes. These 30 attributes are the classes. How does that work with 569 instances though? Well the thing about this dataset is that its for people. So an instance could be a person testing. Remember how the data list is a 2d list with other lists? What if we find the length of one of those lists?