we have mark and we want to determine if he will go outside or not today. What affects if he wants to go outside is if it is sunny, windy, rainy and if hes alone.
We give the decision tree the data of if he went outside and what conditions were it that day and the decision tree ends up with the classification of is he going outside or not?
the decision tree asks a lot of questions which branch off to different paths. At the end of each branch it has a value of yes we are going outside or no we stay indoors. Some decisions the tree doesnât even need to put. It may be that 2 decisions lead to the same outcome so the parent decision leads to a single outcome. Also the decisions donât have to be binary, it can have 3 branches, 4 branches even more
how its done is we give it a root node. The root node asks the first question.
From that you can lead to several branches. In our example you ask for sunny and you get yes or no. So this is the dataset:
these are the bare minimum we need, sunny? Rainy? Windy? Alone? And the conclusion value.
So in the root branch where we ask for sunny, we check the result with the conclusion
see if there is an immediate correlation between sunny and conclusion. In our case, he always goes out when its sunny. Count the ammount of times the conclusion results in yes and how many times the conclusion results in no
Then we check the other node that is the rain node
if that result always concludes in a NO or it always concludes in a YES then we know there is a 100% correlation there and can end the branching there.
Random forest classification
What your root node is matters. Every conclusion correlation we get is based off the root node. On occasion, starting at a different root node may crush a lot of redundant correlations like if it rain yes, then always no outside, we wouldnât have to ask for rain several times if rain was the 3rd checked node. Rain node would be root and thus we never ask it more than once.
TLDR: Decision trees may vary wildly depending on the root node.
This is the problem that the random forest classification solves.
It will make several decision trees and train them on the same data
Then, we would test the forest with a testing data. And for the most voted on answer, that is the democratic mood of the forest, that is our answer. We reduce the risk of misclassifications.
programming
So we will use 2 classifiers here. One for making the trees and one for the forest. You donât need both, but we use both to just compare.
then for data creation, get the dataset and the train_test_split
ok we got the data. Lets make the classifiers
then we test both models and see the results.
no suprise, more is merrier.
still, most of the time the supporting vector classifiers are the best way.