Hi! Thanks for your interest in contributing to NLTK. :-) You'll be joining a long list of contributors. In this document, we'll try to summarize everything that you need to know to do a good job.
NLTK is in maintenance mode. We welcome bugfixes. We can consider minor enhancements if they are clearly documented in an NLTK issue and are supported by a team member who is willing to review a PR. (You are welcome to make a case for a major enhancement, but please note we have limited capacity to deal with it. Please enlist an NLTK team member before doing substantial coding.)
We use GitHub to host our code repositories and issues. The NLTK organization on GitHub has many repositories, so we can manage better the issues and development. The most important are:
- nltk/nltk, the main repository with code related to the library;
- nltk/nltk_data, repository with data
related to corpora, taggers and other useful data that are not shipped by
default with the library, which can be downloaded by
nltk.downloader; - nltk/nltk.github.com, NLTK website with information about the library, documentation, link for downloading NLTK Book etc.;
- nltk/nltk_book, source code for the NLTK Book.
NLTK consists of the functionality that the Python/NLP community is motivated to contribute. Some priority areas for development are listed in the NLTK Wiki.
We use Git as our version control system, so the best way to contribute is to learn how to use it and put your changes on a Git repository. There's plenty of documentation about Git -- you can start with the Pro Git book.
To set up your local development environment for contributing to the main repository nltk/nltk:
- Fork the nltk/nltk repository on GitHub to your account;
- Clone your forked repository locally
(
git clone https://github.com/<your-github-username>/nltk.git); - Run
cd nltkto get to the root directory of thenltkcode base; - Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
- Install NLTK in editable mode with dependencies:
pip install -e . pip install -r pip-req.txt - Install the pre-commit hooks:
pip install pre-commit pre-commit install
- Install the code formatters and linter used by the pre-commit hooks:
pip install black isort ruff pyupgrade
- Download the datasets for running tests
(
python -m nltk.downloader all); - Create a remote link from your local repository to the
upstream
nltk/nltkon GitHub (git remote add upstream https://github.com/nltk/nltk.git) -- you will need to use thisupstreamlink when updating your local repository with all the latest contributions.
NLTK uses pre-commit to run code quality checks
before each commit. The hooks are configured in
.pre-commit-config.yaml
and include:
- pre-commit-hooks -- trailing whitespace, end-of-file fixer, YAML check
- pyupgrade -- upgrade syntax to Python 3.10+
- black -- code formatting
- isort -- import sorting
- ruff -- fast Python linter with auto-fix
You can run all hooks manually with:
pre-commit run --all-filesOr run the tools individually:
isort nltk/path/to/file.py
black nltk/path/to/file.py
ruff check nltk/path/to/file.pyWe use gitflow to manage our branches.
Summary of our git branching model:
- Go to the
developbranch (git checkout develop); - Get all the latest work from the upstream
nltk/nltkrepository (git pull upstream develop); - Create a new branch off of
developwith a descriptive name (for example:feature/portuguese-sentiment-analysis,hotfix/bug-on-downloader). You can do it by switching to thedevelopbranch (git checkout develop) and then creating a new branch (git checkout -b name-of-the-new-branch); - Do many small commits on that branch locally (
git add files-changed,git commit -m "Add some change"); - Run the tests to make sure nothing breaks
(
pytest nltk/testortox -e py313if you are on Python 3.13); - Add your name to the
AUTHORS.mdfile as a contributor; - Push to your fork on GitHub (with the name as your local branch:
git push origin branch-name); - Create a pull request using the GitHub Web interface (asking us to pull the
changes from your new branch and add to them our
developbranch); - Wait for comments.
- Write helpful commit messages.
- Anything in the
developbranch should be deployable (no failing tests). - Never use
git add .: it can add unwanted files; - Avoid using
git commit -aunless you know what you're doing; - Check every change with
git diffbefore adding them to the index (stage area) and withgit diff --cachedbefore committing; - Make sure you add your name to our list of contributors;
- If you have push access to the main repository, please do not commit directly
to
develop: your access should be used only to accept pull requests; if you want to make a new feature, you should use the same process as other developers so your code will be reviewed. - See RELEASE-HOWTO.txt to see everything you need before creating a new NLTK release.
- Use PEP8;
- Write tests for your new features (please see "Tests" topic below);
- Always remember that commented code is dead code;
- Name identifiers (variables, classes, functions, module names) with readable
names (
xis always wrong); - When manipulating strings, we prefer either f-string
formatting
(f
'{a} = {b}') or new-style formatting ('{} = {}'.format(a, b)), instead of the old-style formatting ('%s = %s' % (a, b)); - All
#TODOcomments should be turned into issues (use our GitHub issue system); - Run all tests before pushing (just execute
tox) so you will know if your changes broke something;
See also our developer's guide.
You should write tests for every feature you add or bug you solve in the code. Having automated tests for every line of our code lets us make big changes without worries: there will always be tests to verify if the changes introduced bugs or lack of features. If we don't have tests we will be blind and every change will come with some fear of possibly breaking something.
For a better design of your code, we recommend using a technique called test-driven development, where you write your tests before writing the actual code that implements the desired feature.
You can use pytest to run your tests, no matter which type of test it is:
cd nltk/test
pytest util.doctest # doctest
pytest unit/translate/test_nist.py # unittest
pytest # all testsIf your PR only touches a single module, you can run just the relevant test
file directly with python -m unittest without needing pytest:
# Run a specific test file
python -m unittest nltk.test.unit.test_tokenize
# Run a specific test class
python -m unittest nltk.test.unit.test_tokenize.TestTreebankWordDetokenizer
# Run a specific test method
python -m unittest nltk.test.unit.test_tokenize.TestTreebankWordDetokenizer.test_contractionsIf your PR touches a module that has doctests (inline >>> examples in
docstrings), you can run just those doctests with python -m doctest:
# Run doctests for a single module
python -m doctest nltk/metrics/distance.py
# Run with verbose output to see each test
python -m doctest -v nltk/metrics/distance.py
# Run a specific doctest file from the test suite
python -m doctest nltk/test/tokenize.doctestThese are faster than running the full test suite and useful for quick iteration during development.
NLTK uses GitHub Actions for continuous integration. See here for GitHub's documentation.
The .github/workflows/ci.yml file configures the CI:
-
on:section- ensures that this CI is run on code pushes, pull request, or through the GitHub website via
workflow_dispatch.
- ensures that this CI is run on code pushes, pull request, or through the GitHub website via
-
The
pre-commitjob- performs these steps:
- Downloads the
nltksource code. - Runs pre-commit on all files in the repository (black, isort, ruff, pyupgrade).
- Fails if any hooks performed a change.
- Downloads the
- performs these steps:
-
The
minimal_download_testjob- verifies that
nltk.download()works on all platforms (ubuntu, macos, windows).
- verifies that
-
The
testjob- tests against supported Python versions (
3.10,3.11,3.12,3.13,3.14). - tests on
ubuntu-latest,macos-latest, andwindows-latest. - performs these steps:
- Downloads the
nltksource code. - Sets up Python using whatever version is being checked in the current execution.
- Installs dependencies via
pip install -r pip-req.txt. - Downloads
nltk_data. - Runs
pytest --numprocesses auto -rsx --doctest-modules nltk.
- Downloads the
- tests against supported Python versions (
Using pytest directly:
# Run all tests
pytest nltk/test
# Run a specific test file
pytest nltk/test/unit/test_tokenize.py
# Run tests in parallel
pip install pytest-xdist
pytest --numprocesses auto nltk/testUsing tox (to test against a specific Python version):
pip install tox
tox -e py313 # for Python 3.13NLTK supports Python 3.10, 3.11, 3.12, 3.13, and 3.14.
See python_requires in setup.py.
We have three mail lists on Google Groups:
- nltk, for announcements only;
- nltk-users, for general discussion and user questions;
- nltk-dev, for people interested in NLTK development.
Please feel free to contact us through the nltk-dev mail list if you have any questions or suggestions. Every contribution is very welcome!
Happy hacking! (;