This project explores a foundational classification task using the K-Nearest Neighbors (KNN) algorithm on the widely used Breast Cancer dataset from scikit-learn.
The primary goal is to understand the behavior of KNN in distinguishing between malignant and benign tumors based on numerical features derived from digitized images of breast mass tissue.
The notebook follows a clean ML workflow:
-
Data Loading: Load breast cancer data via
sklearn.datasets.load_breast_cancer(). -
Exploratory Data Analysis:
- View data shape, features, and class distributions.
- Use
seabornfor basic visualization.
-
Train-Test Split:
- Split dataset into training and test sets (80/20).
-
Modeling with KNN:
- Apply
KNeighborsClassifierfromsklearn.neighbors. - Test different
kvalues (number of neighbors).
- Apply
-
Evaluation:
- Accuracy evaluation.
- Confusion matrix plotting.
- Error rate plot vs.
k.
- Python 3
numpy,pandas– numerical computing and data manipulationseaborn,matplotlib– visualizationscikit-learn– datasets, model training, evaluation
-
Source:
sklearn.datasets.load_breast_cancer() -
Features: 30 real-valued features (e.g., mean radius, texture, smoothness)
-
Target Classes:
0: malignant1: benign
-
Samples: 569
-
Goal: Predict whether a tumor is malignant or benign
-
Download or clone the repository.
-
Install required libraries:
pip install numpy pandas matplotlib seaborn scikit-learn
-
Run the notebook:
jupyter notebook knn.ipynb
-
Train/Test Split: 80/20
-
KNN Accuracy: Varies with
k, best value selected via error rate plot -
Visualization:
- Target distribution via
seaborn.countplot - Confusion matrix for classification performance
- Error vs.
kplot to optimize hyperparameters
- Target distribution via
- 📈 Error rate vs.
kvisualization - ✅ Confusion matrix
- 📊 Class distribution plots
-
Demonstrates clear understanding of:
- The importance of choosing the right
k - Effects of class imbalance
- Train-test split strategy
- The importance of choosing the right
-
A practical and well-scaffolded implementation of KNN in binary classification.
Email: imehranasgari@gmail.com GitHub: https://github.com/imehranasgari
This project is licensed under the Apache 2.0 License – see the LICENSE file for details.
💡 Some interactive outputs (e.g., plots, widgets) may not display correctly on GitHub. If so, please view this notebook via nbviewer.org for full rendering.