Skip to content

wajason/Clustering_Techniques_Exploration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

ClusteringTechniquesExploration

Overview

This project provides an in-depth exploration of clustering techniques, focusing on the K-Means algorithm with various configurations, alongside comparisons with DBSCAN and GMM (Gaussian Mixture Model) on the Iris dataset. The analysis covers:

  • K-Means Variants:
    • Implementation with Euclidean and Manhattan distances.
    • Two initialization strategies: random (from init_random.txt) and farthest (from init_farthest.txt).
    • Cost function (Sum of Squared Errors, SSE) analysis across iterations, including percentage change at the 10th iteration.
  • Data Preprocessing:
    • Standardization of the Iris dataset features (removing mean, scaling to unit variance).
  • PCA Visualization:
    • Projecting 4D Iris features into 2D using PCA for intuitive visualization.
    • Visual comparison of clustering results against true species labels.
  • Clustering Methods:
    • K-Means and GMM with 3 clusters.
    • DBSCAN with density-based clustering (no predefined cluster count).
  • Performance Evaluation:
    • Adjusted Rand Score to quantitatively assess clustering accuracy.
    • Visual and numerical comparison of clustering methods.

Key findings include:

  • GMM outperforms K-Means and DBSCAN in terms of Adjusted Rand Score (0.90 vs. 0.62 and 0.43), due to its ability to model non-linear distributions.
  • K-Means with farthest initialization generally converges faster and achieves lower SSE compared to random initialization.
  • PCA visualizations reveal GMM's superior alignment with true Iris species labels.

Project Structure

ClusteringTechniquesExploration/
│
├── Clustering.ipynb         # Main Jupyter Notebook with the clustering analysis
└── datasets/                # Folder containing initialization files and Iris dataset

Requirements

To run this project, ensure you have the following Python packages installed:

  • matplotlib==3.10.0
  • numpy==2.2.4
  • pandas==2.2.3
  • scikit-learn==1.6.1
  • ipykernel==6.29.5

You can install them using the following command:

pip install matplotlib==3.10.0 numpy==2.2.4 pandas==2.2.3 scikit-learn==1.6.1 ipykernel==6.29.5

Additionally, a Chinese font (NotoSansCJK-Regular.ttc) is used for visualization labels. Ensure this font is available on your system, or modify the font path in the notebook accordingly.

How to Run

  1. Clone the Repository:

    git clone https://github.com/<your-username>/Clustering_Techniques_Exploration.git
    cd Clustering_Techniques_Exploration
  2. Set Up the Environment:

    • Ensure the required packages are installed (see Requirements).

    • If using a virtual environment, activate it:

      source <your-venv>/bin/activate  # On macOS/Linux
      <your-venv>\Scripts\activate     # On Windows
  3. Prepare the Data:

    • The Iris dataset is typically loaded via sklearn.datasets. If you have a custom dataset in the datasets/ folder, ensure it is properly referenced in the notebook.
    • Ensure init_random.txt and init_farthest.txt are placed in the datasets/ folder for K-Means initialization.
  4. Run the Notebook:

    • Open Clustering.ipynb in Jupyter Notebook or JupyterLab:

      jupyter notebook Clustering.ipynb
    • Execute the cells to perform clustering, visualize results, and view the analysis.

Key Findings

K-Means Analysis

  • Euclidean vs. Manhattan Distance:
    • K-Means with Euclidean distance generally achieves lower SSE due to its alignment with the spherical assumption of clusters.
    • Manhattan distance results in slightly higher SSE but may be more robust to outliers in certain scenarios.
  • Initialization Strategies:
    • Farthest initialization (from init_farthest.txt) leads to faster convergence and lower SSE compared to random initialization (from init_random.txt).
    • At the 10th iteration, the SSE percentage change with farthest initialization is typically smaller, indicating better stability.

Clustering Comparison

  • K-Means: Produces clear, hard boundaries but struggles with overlapping regions (Adjusted Rand Score: 0.62).
  • DBSCAN: Density-based clustering, sensitive to parameters, resulting in noise points and suboptimal clustering (Adjusted Rand Score: 0.43).
  • GMM: Excels in capturing non-linear distributions and overlapping regions, achieving the highest accuracy (Adjusted Rand Score: 0.90).

PCA Visualization

  • The PCA scatter plots show that GMM aligns most closely with the true Iris species labels, particularly in handling overlapping regions between Iris-versicolor and Iris-virginica.
  • K-Means performs well for distinct clusters like Iris-setosa but struggles with overlapping regions.
  • DBSCAN's noise points disrupt the visual clarity of clusters.

Visualizations

Below are sample plots from the analysis:

SSE vs. Iteration (Euclidean Distance)

SSE vs. Iteration (Manhattan Distance)

PCA Scatter Plots (Clustering Results)

Note: Replace these placeholders with actual screenshots of your plots for better presentation.

Future Improvements

  • Experiment with additional distance metrics (e.g., cosine similarity) for K-Means.
  • Optimize DBSCAN parameters to reduce noise points and improve clustering accuracy.
  • Apply the analysis to other datasets to validate findings across different data distributions.
  • Explore hierarchical clustering or other advanced methods for comparison.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgments

  • The Iris dataset is sourced from sklearn.datasets.
  • Thanks to the open-source community for providing the tools used in this project (scikit-learn, matplotlib, etc.).
  • Font support for visualization provided by Noto Sans CJK.

About

In-depth exploration of clustering techniques, including K-Means with Euclidean and Manhattan distances, initialization strategies (random, farthest), and comparisons with DBSCAN and GMM on the Iris dataset, featuring PCA visualization and SSE analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors