4. KNN
Idea
classification
- For classification, KNN simply memorizes the entire training set.
- Classify a new instance: (1) finds the k-Nearest most similar instances to the new instance in the training; (2) gets the labels of those training instances; (3) assign the majority class of the k-nearest training instances to the new instance
Regression
- the predicted value of the new instance will be the mean of the target values associated with the k training instances that have the x-value that's closest to the new instances
- (1) find the k training points that have x values closest to the new instance's X value
- (2) the predicted y value will be the average target value of these k training points
Model Complexity
- n_neighbors = 1 → high model complexity, overfitting, It tries to get correct predictions for every single training point while ignoring the general trend between classes
- large n_neighbors → simple model, a much smoother decision boundary, captures more of the global trend
- metric (p): distance function between data points
Pros
- A k-nearest neighbor approach can be a reasonable baseline against more sophisticated methods.
- no need to train the model
Cons
- high computational cost with large high-dimensional datasets, especially your data is sparse (each instance has lots of features, but most of them are zero)