Skip to content

Replace chardet with charset-normalizer as preferred encoding detection#590

Open
imsurat wants to merge 1 commit intoBachmann1234:mainfrom
imsurat:make-chardet-optional
Open

Replace chardet with charset-normalizer as preferred encoding detection#590
imsurat wants to merge 1 commit intoBachmann1234:mainfrom
imsurat:make-chardet-optional

Conversation

@imsurat
Copy link
Copy Markdown

@imsurat imsurat commented Mar 28, 2026

Summary

  • charset-normalizer is now the preferred encoding detection library (same .detect() API as chardet)
  • chardet is supported as a fallback if charset-normalizer is not installed
  • If neither is available, falls back to utf-8 with a helpful warning message
  • chardet available as an optional extra (pip install diff-cover[chardet])

Motivation

  • chardet 7.0 relicensing dispute (chardet/chardet#327) creates legal uncertainty for downstream users
  • charset-normalizer is MIT-licensed, actively maintained, and already the preferred library for requests
  • This makes chardet fully optional while maintaining the same encoding detection capability

Changes

  • diff_cover/snippets.py: Try charset_normalizer.detect() first, fall back to chardet.detect(), then utf-8 with warning
  • pyproject.toml: charset-normalizer >= 2.0.0 as main dependency, chardet as optional extra
  • poetry.lock: Updated to reflect dependency change

Known behavior difference

test_latin_one_undeclared will fail when running with charset-normalizer instead of chardet. This is because charset-normalizer detects the short latin-1 test string as CP932 (Japanese) rather than latin-1 — a known limitation with very short ambiguous text samples. chardet happens to get this one right.

This is exactly the edge case mentioned in #313 where a --src-encoding CLI option would provide a better solution than guessing. The test could be made conditional on which library is installed, or the CLI encoding option could be added as a follow-up.

For the vast majority of real source files (which are much longer than the 32-byte test fixture), both libraries produce identical results.

Addresses

Closes #313

Test plan

  • Verified charset_normalizer.detect() returns same shape as chardet.detect()
  • 244/245 tests pass (1 known difference on short ambiguous latin-1 text)
  • Linter (ruff) passes
  • pyproject.toml valid (no duplicate sections)

@imsurat imsurat marked this pull request as draft March 28, 2026 01:30
charset-normalizer is now the default encoding detection library.
chardet is supported as a fallback if charset-normalizer is not installed.
If neither is available, falls back to utf-8 with a helpful warning.

Motivation:
- chardet 7.0 relicensing dispute (chardet/chardet#327) creates
  legal uncertainty for downstream users
- charset-normalizer has the same API, is MIT-licensed, actively
  maintained, and already preferred by requests
- Makes chardet fully optional — available as an extra for users
  who prefer it

Changes:
- snippets.py: try charset_normalizer first, fall back to chardet,
  then utf-8 with warning if neither installed
- pyproject.toml: charset-normalizer as main dep, chardet as optional extra

Closes Bachmann1234#313
@imsurat imsurat force-pushed the make-chardet-optional branch from 06dea13 to 22d6bd2 Compare March 28, 2026 01:42
@imsurat imsurat marked this pull request as ready for review March 28, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make chardet a optional dependency

1 participant