This project serves a dual purpose. Throughout undergrad my computational training was applied (comp-bio/structural bio research + informatics and modelling coursework) with algorithmic implementations coming mainly from high level modelling libraries but, as I continue to grow into a data-scientist I am looking to grow my experience in numerical computing. My two main goals at the moment are to:
-
Implement algorithms from scratch using only scientific computing libraries for vectorization, and dataframe oriented frameworks like Pandas.
-
Write out the derivations for the various models I have been using day-to-day and build on them in order to expand my theoretical statistical foundations.
The result should be a robust library of foundational biostatistical and machine learning methods derived in the docs, implemented in my backend, and used in the dashboard currently under development.
Outlined below are the different models I have implemented so far.
There are examples at the bottom of each python module inside of the "main" block that can be run to test out each implementation if anyone perusing is curious to see. There are, of-course unlisted dependencies (i.e. numpy/scipy/scikit-learn) but this is not really meant to be an entirely public use at the moment so for now I leave it up to the user to pip/conda install their way to success.
The front-end will eventually contain case-studies I picked out of interest for me. However, they will all be done using my own package! Which is a neat way to work on both front and back at once!
I plan on also adding in some kind of TypeScript front end GUI/chart displayer mostly to try to get some practice using JavaScript/TypeScript.
There is actually already a univariate regression implemented in TypeScript before I realized that there weren’t many good vectorized math packages (aside from like tensorflow but this came with its own suite of problems) in the Node.js version of TypeScript and it's really not meant for that anyways but, it gave me a solid foundation thus far.
Perhaps I will also add in some SQLite for databse operations. Although, we would be just moving CSV's around inside folders it would be proof of concept.
There are a number of models and algorithms im interested in implementing as I go along. Here is a small list of the things I am aiming for in the immediate future
In the near future:
- k-means
- PCA
- negative bionomial regression
- Markov Models