Replace chardet with charset-normalizer as preferred encoding detection by imsurat · Pull Request #590 · Bachmann1234/diff_cover

imsurat · 2026-03-28T01:29:06Z

Summary

charset-normalizer is now the preferred encoding detection library (same .detect() API as chardet)
chardet is supported as a fallback if charset-normalizer is not installed
If neither is available, falls back to utf-8 with a helpful warning message
chardet available as an optional extra (pip install diff-cover[chardet])

Motivation

chardet 7.0 relicensing dispute (chardet/chardet#327) creates legal uncertainty for downstream users
charset-normalizer is MIT-licensed, actively maintained, and already the preferred library for requests
This makes chardet fully optional while maintaining the same encoding detection capability

Changes

diff_cover/snippets.py: Try charset_normalizer.detect() first, fall back to chardet.detect(), then utf-8 with warning
pyproject.toml: charset-normalizer >= 2.0.0 as main dependency, chardet as optional extra
poetry.lock: Updated to reflect dependency change

Known behavior difference

test_latin_one_undeclared will fail when running with charset-normalizer instead of chardet. This is because charset-normalizer detects the short latin-1 test string as CP932 (Japanese) rather than latin-1 — a known limitation with very short ambiguous text samples. chardet happens to get this one right.

This is exactly the edge case mentioned in #313 where a --src-encoding CLI option would provide a better solution than guessing. The test could be made conditional on which library is installed, or the CLI encoding option could be added as a follow-up.

For the vast majority of real source files (which are much longer than the 32-byte test fixture), both libraries produce identical results.

Addresses

Closes #313

Test plan

Verified charset_normalizer.detect() returns same shape as chardet.detect()
244/245 tests pass (1 known difference on short ambiguous latin-1 text)
Linter (ruff) passes
pyproject.toml valid (no duplicate sections)

charset-normalizer is now the default encoding detection library. chardet is supported as a fallback if charset-normalizer is not installed. If neither is available, falls back to utf-8 with a helpful warning. Motivation: - chardet 7.0 relicensing dispute (chardet/chardet#327) creates legal uncertainty for downstream users - charset-normalizer has the same API, is MIT-licensed, actively maintained, and already preferred by requests - Makes chardet fully optional — available as an extra for users who prefer it Changes: - snippets.py: try charset_normalizer first, fall back to chardet, then utf-8 with warning if neither installed - pyproject.toml: charset-normalizer as main dep, chardet as optional extra Closes Bachmann1234#313

imsurat marked this pull request as draft March 28, 2026 01:30

imsurat force-pushed the make-chardet-optional branch from 06dea13 to 22d6bd2 Compare March 28, 2026 01:42

imsurat marked this pull request as ready for review March 28, 2026 01:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace chardet with charset-normalizer as preferred encoding detection#590

Replace chardet with charset-normalizer as preferred encoding detection#590
imsurat wants to merge 1 commit intoBachmann1234:mainfrom
imsurat:make-chardet-optional

imsurat commented Mar 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

imsurat commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Known behavior difference

Addresses

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

imsurat commented Mar 28, 2026 •

edited

Loading