Skip to content

HAleaker/NLP-Corpus-Analysis

Repository files navigation

NLP Corpus Analysis (alpha stage)

This docker image is based on tools like spaCy, Textacy, pyLDAvis & others to analyse a text corpus, such as the collection of all published documents or any other CSV file with a text column.

It provides a range of Machine Learning and Natural Language Processing algorithms that can be executed over a corpus or its subset.

The project aims to provide these methods over a REST API when feasible.

Current features

Compose a text transformation pipeline to prepare a corpus

Upload a CSV file, then click "Create a corpus" to access the pipeline composition page.

Create and visualise topic models via pyLDAvis.

Topic Modeling technique is used for finding topics. In machine learning and NLP, a topic model is a statistical model for identifying abstract "topics" in a document collection.

Video demonstration

LDA visualisation example

How to run:

docker-compose build
docker-compose up -d

This will start the application server on localhost:8181 after some time.

Corpus Data

The latest dataset can be produced by visiting the global catalogue > See all results > download csv. Once the csv file is downloaded, you can pass it to this application to be analysed. Make sure the "document text" to be analysed is the first column. The other columns are considered metadata.

For testing, you may download an already prepared large corpus data:

curl -L -o data.csv https://www.dropbox.com/s/sihmoc4wwpl0kr2/data_all.csv?dl=1

About

A repository for Machine Learning and Natural Language Processing of text data using technologies like spaCy, Textacy, pyLDAvis, and more. Allows analysis of text files, particularly CSVs, and provides several ML and NLP algorithms.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages