Image source: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote01_MLsetup.html
Classical machine learning and optimization methods are truly fascinating. The motivation behind this project is to explore these foundational techniques in depth. The repository is currently under development, and given the vast breadth of machine learning algorithms and optimization methods, it is likely to remain a work in progress for a long time.
ML-Toolbox/
│
├── 📂 supervised-learning/ # Supervised Learning
│ ├── 📂 perceptron*
│ ├── 📂 knn*
│ ├── 📂 naive-bayes*
│ ├── 📂 logistic-regression*
│ ├── 📂 linear-regression*
│ ├── 📂 decision-trees*
│ ├── 📂 kd-ball-trees*
│ ├── 📂 svm*
│ ├── 📂 gaussian-processes*
│
├── 📂 ensemble-learning/ # Ensemble Methods
│ ├── 📂 bagging*
│ ├── 📂 boosting*
│ └── 📂 random-forests*
│
├── 📂 kernel-methods # Kernels
│ ├── 📂 perceptron*
│ ├── 📂 linear-regression*
│ └── 📂 svm*
│
├── 📂 unsupervised-learning/ # Unsupervised Learning
│ ├── 📂 k-means*
│ ├── 📂 gaussian-mixture-models*
│ ├── 📂 kernel-density-estimation*
│ ├── 📂 pca*
│ └── 📂 apriori-algorithm*
│
├── 📂 deep-learning/ # Deep Learning
│ ├── 📂 neural-networks*
│ ├── 📂 cnn*
│ ├── 📂 rnn*
│ ├── 📂 autoencoders*
│ └── 📂 variational-autoencoders*
│
├── 📂 optimization-methods/ # Optimization Techniques
│ ├── 📂 unconstrained/
│ │ ├── gradient-descent*
│ │ ├── newtons-method*
│ │ ├── quasi-newton-method*
│ │ ├── coordinate-descent*
│ │ └── conjugate-gradient*
│ │
│ ├── 📂 constrained/
│
├── 📂 assets/
│ ├── 📂 data/ # Datasets
│ ├── 📂 img/ # Images & visual assets
│ └── 📂 scripts/ # Preprocessing scripts
│
└── 📄 README.md
Note: * indicates work in progress.
If we had complete knowledge about a problem, we could directly write down a formula or an algorithm to solve it. For example, finding the shortest path between two points in a graph. On the other hand, if we had complete data, solving the problem would be as simple as looking up the answer. For instance, finding a place on a map is just a lookup task. Here, the main challenge is choosing the right way to organize data. In such settings, problems are solved using data structures and algorithms (DSA-Toolbox).
Machine learning can be thought of as a hybrid approach that combines knowledge and data to solve problems. In this regime, we don’t have complete knowledge or complete data. Consider the task of prediction. If we know that the problem follows a linear trend, we can model it using a straight line y = mx + c. We can then use data to find the unknown parameters m and c that make the line fit the data best. This would become linear regression.
Knowledge can be combined with data in many ways. Mainly, there are three different places where we can inject knowledge.
Image source: https://gpss.cc/gpss24/slides/Ek2024.pdf
- Model & Design Choices
Given a problem, our goal is to find the optimal solution h*. Since we cannot search over all possible solutions, we restrict ourselves to a class of models through design choices. These choices include how we model the problem, what assumptions we make, how we formulate our loss function, etc. These choices reflect what we already believe about the problem or what we want from the solution.
For example, consider classification. If we only care about accuracy, a Support Vector Machine (SVM) can be a good choice. But if we want probability outputs instead of just labels, models like Naive Bayes or Logistic Regression are better. Between these two, if we have very little data, Naive Bayes is often preferred. It makes explicit assumptions about data distribution (e.g., Gaussian Naive Bayes), and if those assumptions are close to reality, it works well even with limited data. Logistic Regression makes fewer assumptions and is more flexible. However, because of this, it usually needs more data to perform well.
Another example is regularization. L2 regularization is useful when we want smoother and simpler models. L1 regularization is helpful when we want the solution to be sparse.
All these choices define the gap between h* and hopt, where hopt is the best model that can be produced based on our design choices.
- Data
Data is the second place where we can inject knowledge. hopt is the best solution we would get if we had perfect or unlimited data. But with limited or biased data, we end up with ĥopt instead. One way to improve this is to collect more data. Another way is to use knowledge about the problem itself.
For example, if all training images have bright backgrounds, the model may not work well on images with darker backgrounds. In this case, we can use data augmentation to include different lighting conditions during training. This can help reduce the gap between hopt and ĥopt.
- Optimization
The third place where we can inject knowledge is the way we optimize the model. For some problems, like total variation denoising, the ADMM algorithm may perform better than gradient descent. In other cases, the Newton method may reach the solution in a few steps, while gradient descent may take much longer. Another example of optimization is the choice of hyperparameters. This includes the learning rate, batch size, constants, etc. ĥopt represents the best solution achievable with ideal optimization, and ĥ represents the solution we actually obtain.
However, with great power comes great responsibility. In all three components, if our assumptions do not match reality, they will increase the error instead of reducing it.
The datasets are either public datasets from libraries (such as Fashion-MNIST) or datasets downloaded from Kaggle. Details about each dataset used can be found in assets/data/README.md. The datasets can be downloaded from their original sources or from this link. Once downloaded, the datasets must be placed in the /assets/data folder.
- Cornell CS4780 Machine Learning for Intelligent Systems by Prof. Kilian Weinberger.
- CS7.403 Statistical Methods in Artificial Intelligence course by IIIT Hyderabad.
- Gaussian Process Summer School 2024.
- MIT 6.036 Machine Learning by Prof. Tamara Broderick.
- Bias Variance Tradeoff by MIT OpenCourseware and The Stanford NLP Group.
- Kernel Methods in Computer Vision by Prof. Christoph Lampert, and Notes on Lagrangian multiplier and KKT.
- Neural Networks and Deep Learning Online Book by Michael Nielsen.
- Talk on Association Rule Mining by Prof. Ami Gates.