This project provides an in-depth exploration of clustering techniques, focusing on the K-Means algorithm with various configurations, alongside comparisons with DBSCAN and GMM (Gaussian Mixture Model) on the Iris dataset. The analysis covers:
- K-Means Variants:
- Implementation with Euclidean and Manhattan distances.
- Two initialization strategies: random (from
init_random.txt) and farthest (frominit_farthest.txt). - Cost function (Sum of Squared Errors, SSE) analysis across iterations, including percentage change at the 10th iteration.
- Data Preprocessing:
- Standardization of the Iris dataset features (removing mean, scaling to unit variance).
- PCA Visualization:
- Projecting 4D Iris features into 2D using PCA for intuitive visualization.
- Visual comparison of clustering results against true species labels.
- Clustering Methods:
- K-Means and GMM with 3 clusters.
- DBSCAN with density-based clustering (no predefined cluster count).
- Performance Evaluation:
- Adjusted Rand Score to quantitatively assess clustering accuracy.
- Visual and numerical comparison of clustering methods.
Key findings include:
- GMM outperforms K-Means and DBSCAN in terms of Adjusted Rand Score (0.90 vs. 0.62 and 0.43), due to its ability to model non-linear distributions.
- K-Means with farthest initialization generally converges faster and achieves lower SSE compared to random initialization.
- PCA visualizations reveal GMM's superior alignment with true Iris species labels.
ClusteringTechniquesExploration/
│
├── Clustering.ipynb # Main Jupyter Notebook with the clustering analysis
└── datasets/ # Folder containing initialization files and Iris dataset
To run this project, ensure you have the following Python packages installed:
matplotlib==3.10.0numpy==2.2.4pandas==2.2.3scikit-learn==1.6.1ipykernel==6.29.5
You can install them using the following command:
pip install matplotlib==3.10.0 numpy==2.2.4 pandas==2.2.3 scikit-learn==1.6.1 ipykernel==6.29.5Additionally, a Chinese font (NotoSansCJK-Regular.ttc) is used for visualization labels. Ensure this font is available on your system, or modify the font path in the notebook accordingly.
-
Clone the Repository:
git clone https://github.com/<your-username>/Clustering_Techniques_Exploration.git cd Clustering_Techniques_Exploration
-
Set Up the Environment:
-
Ensure the required packages are installed (see Requirements).
-
If using a virtual environment, activate it:
source <your-venv>/bin/activate # On macOS/Linux <your-venv>\Scripts\activate # On Windows
-
-
Prepare the Data:
- The Iris dataset is typically loaded via
sklearn.datasets. If you have a custom dataset in thedatasets/folder, ensure it is properly referenced in the notebook. - Ensure
init_random.txtandinit_farthest.txtare placed in thedatasets/folder for K-Means initialization.
- The Iris dataset is typically loaded via
-
Run the Notebook:
-
Open
Clustering.ipynbin Jupyter Notebook or JupyterLab:jupyter notebook Clustering.ipynb
-
Execute the cells to perform clustering, visualize results, and view the analysis.
-
- Euclidean vs. Manhattan Distance:
- K-Means with Euclidean distance generally achieves lower SSE due to its alignment with the spherical assumption of clusters.
- Manhattan distance results in slightly higher SSE but may be more robust to outliers in certain scenarios.
- Initialization Strategies:
- Farthest initialization (from
init_farthest.txt) leads to faster convergence and lower SSE compared to random initialization (frominit_random.txt). - At the 10th iteration, the SSE percentage change with farthest initialization is typically smaller, indicating better stability.
- Farthest initialization (from
- K-Means: Produces clear, hard boundaries but struggles with overlapping regions (Adjusted Rand Score: 0.62).
- DBSCAN: Density-based clustering, sensitive to parameters, resulting in noise points and suboptimal clustering (Adjusted Rand Score: 0.43).
- GMM: Excels in capturing non-linear distributions and overlapping regions, achieving the highest accuracy (Adjusted Rand Score: 0.90).
- The PCA scatter plots show that GMM aligns most closely with the true Iris species labels, particularly in handling overlapping regions between Iris-versicolor and Iris-virginica.
- K-Means performs well for distinct clusters like Iris-setosa but struggles with overlapping regions.
- DBSCAN's noise points disrupt the visual clarity of clusters.
Below are sample plots from the analysis:
Note: Replace these placeholders with actual screenshots of your plots for better presentation.
- Experiment with additional distance metrics (e.g., cosine similarity) for K-Means.
- Optimize DBSCAN parameters to reduce noise points and improve clustering accuracy.
- Apply the analysis to other datasets to validate findings across different data distributions.
- Explore hierarchical clustering or other advanced methods for comparison.
This project is licensed under the MIT License. See the LICENSE file for details.
- The Iris dataset is sourced from
sklearn.datasets. - Thanks to the open-source community for providing the tools used in this project (scikit-learn, matplotlib, etc.).
- Font support for visualization provided by Noto Sans CJK.