Update chardet to 7.4.0.post1 by pyup-bot · Pull Request #732 · joaogarciadelima/libpythonpro

pyup-bot · 2026-03-27T01:07:40Z

This PR updates chardet from 3.0.4 to 7.4.0.post1.

Changelog

7.4.0

-------------------

**Performance:**

- Switched to dense zlib-compressed model format (v2): models are now
stored as contiguous ``memoryview`` slices of a single decompressed
blob, eliminating per-model ``struct.unpack`` overhead. Cold start
(import + first detect) dropped from ~75ms to ~13ms with mypyc.
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`354 &lt;https://github.com/chardet/chardet/pull/354&gt;`_)

**Accuracy:**

- Accuracy improved from 98.6% to 99.3% (2499/2517 files) through
a combination of training and scoring improvements:

- Eliminated train/test data overlap by content-fingerprinting test
 suite articles and excluding them from training data
 (`351 &lt;https://github.com/chardet/chardet/pull/351&gt;`_)
- Added MADLAD-400 and Wikipedia as supplemental training sources to
 fill gaps left by exclusion filtering
 (`351 &lt;https://github.com/chardet/chardet/pull/351&gt;`_)
- Improved non-ASCII bigram scoring: high-byte bigrams are now
 preserved during training (instead of being crushed by global
 normalization), and weighted by per-bigram IDF so encoding-specific
 byte patterns contribute proportionally to how discriminative they
 are (`352 &lt;https://github.com/chardet/chardet/pull/352&gt;`_)
- Added encoding-aware substitution filtering: character substitutions
 during training now only apply for characters the target encoding
 cannot represent
- Increased training samples from 15K to 25K per language/encoding pair
 (`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)

**Bug Fixes:**

- Added dedicated structural analyzers for CP932, CP949, and
Big5-HKSCS: these superset encodings previously shared their base
encoding&#x27;s byte-range analyzer, missing extended ranges unique to each
superset
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`353 &lt;https://github.com/chardet/chardet/pull/353&gt;`_)

7.3.0

-------------------

**License:**

- **0BSD license** — the project license has been changed from MIT to
`0BSD &lt;https://opensource.org/license/0bsd&gt;`_, a maximally permissive
license with no attribution requirement. All prior 7.x releases
should also be considered 0BSD licensed as of this release.
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)

**Features:**

- Added ``mime_type`` field to detection results — identifies file types
for both binary (via magic number matching) and text content. Returned
in all ``detect()``, ``detect_all()``, and ``UniversalDetector`` results.
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`350 &lt;https://github.com/chardet/chardet/pull/350&gt;`_)
- New ``pipeline/magic.py`` module detects 40+ binary file formats
including images, audio/video, archives, documents, executables, and
fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel,
OpenDocument) are distinguished by entry filenames.
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`350 &lt;https://github.com/chardet/chardet/pull/350&gt;`_)

**Bug Fixes:**

- Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in
accuracy testing — these are distinct encodings with different byte
order, not interchangeable
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)

**Performance:**

- Added 4 new modules to mypyc compilation (orchestrator, confusion,
magic, ascii), bringing the total to 11 compiled modules
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Capped statistical scoring at 16 KB — bigram models converge quickly,
so large files no longer score the full 200 KB. Worst-case detection
time dropped from 62ms to 26ms with no accuracy loss.
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Replaced ``dataclasses.replace()`` with direct ``DetectionResult``
construction on hot paths, eliminating ~354k function calls per full
test suite run
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)

**Build:**

- Added riscv64 to the mypyc wheel build matrix — prebuilt wheels are
now published for RISC-V Linux alongside existing architectures
(`Bruno Verachten &lt;https://github.com/gounthar&gt;`_,
`348 &lt;https://github.com/chardet/chardet/pull/348&gt;`_)

7.2.0

-------------------

**Features:**

- Added ``include_encodings`` and ``exclude_encodings`` parameters to
:func:`~chardet.detect`, :func:`~chardet.detect_all`, and
:class:`~chardet.UniversalDetector` — restrict or exclude specific
encodings from the candidate set, with corresponding
``-i``/``--include-encodings`` and ``-x``/``--exclude-encodings``
CLI flags
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`343 &lt;https://github.com/chardet/chardet/pull/343&gt;`_)
- Added ``no_match_encoding`` (default ``&quot;cp1252&quot;``) and
``empty_input_encoding`` (default ``&quot;utf-8&quot;``) parameters — control
which encoding is returned when no candidate survives the pipeline or
the input is empty, with corresponding CLI flags
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`343 &lt;https://github.com/chardet/chardet/pull/343&gt;`_)
- Added ``-l``/``--language`` flag to ``chardetect`` CLI — shows the
detected language (ISO 639-1 code and English name) alongside the encoding
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`342 &lt;https://github.com/chardet/chardet/pull/342&gt;`_)

7.1.0

-------------------

**Features:**

- Added PEP 263 encoding declaration detection — `` -*- coding: ... -*-``
and `` coding=...`` declarations on lines 1–2 of Python source files are
now recognized with confidence 0.95
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`249 &lt;https://github.com/chardet/chardet/issues/249&gt;`_)
- Added ``chardet.universaldetector`` backward-compatibility stub so that
``from chardet.universaldetector import UniversalDetector`` works with a
deprecation warning
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`341 &lt;https://github.com/chardet/chardet/issues/341&gt;`_)

**Fixes:**

- Fixed false UTF-7 detection of ASCII text containing ``++`` or ``+word``
patterns
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`332 &lt;https://github.com/chardet/chardet/issues/332&gt;`_,
`335 &lt;https://github.com/chardet/chardet/pull/335&gt;`_)
- Fixed 0.5s startup cost on first ``detect()`` call — model norms are now
computed during loading instead of lazily iterating 21M entries
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`333 &lt;https://github.com/chardet/chardet/issues/333&gt;`_,
`336 &lt;https://github.com/chardet/chardet/pull/336&gt;`_)
- Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
``detect()`` now returns chardet 5.x-compatible names by default
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`338 &lt;https://github.com/chardet/chardet/pull/338&gt;`_)
- Improved ISO-2022-JP family detection — recognizes ESC sequences for
ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Fixed silent truncation of corrupt model data (``iter_unpack`` yielded
fewer tuples instead of raising)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Fixed incorrect date in LICENSE
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)

**Performance:**

- 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model
norms as a side-product of ``load_models()``
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- ~40% faster model parsing via ``struct.iter_unpack`` for bulk entry
extraction (eliminates ~305K individual ``unpack`` calls)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)

**New API parameters:**

- Added ``compat_names`` parameter (default ``True``) to
:func:`~chardet.detect`, :func:`~chardet.detect_all`, and
:class:`~chardet.UniversalDetector` — set to ``False`` to get raw Python
codec names instead of chardet 5.x/6.x compatible display names
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Added ``prefer_superset`` parameter (default ``False``) — remaps legacy
ISO/subset encodings to their modern Windows/CP superset equivalents
(e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252).
**This will default to ``True`` in the next major version (8.0).**
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Deprecated ``should_rename_legacy`` in favor of ``prefer_superset`` —
a deprecation warning is emitted when used
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)

**Improvements:**

- Switched internal canonical encoding names to Python codec names
(e.g., ``&quot;utf-8&quot;`` instead of ``&quot;UTF-8&quot;``), with ``compat_names``
controlling the public output format.  See :doc:`usage` for the full
mapping table.
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Added ``lookup_encoding()`` to ``registry`` for case-insensitive
resolution of arbitrary encoding name input to canonical names
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Achieved 100% line coverage across all source modules (+31 tests)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language
accuracy on 2,510 test files
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Pinned test-data cloning to chardet release version tags for
reproducible builds
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)

7.0.1

-------------------

**Fixes:**

- Fixed false UTF-7 detection of SHA-1 git hashes
(`Alex Rembish &lt;https://github.com/rembish&gt;`_,
`324 &lt;https://github.com/chardet/chardet/pull/324&gt;`_)
- Fixed ``_SINGLE_LANG_MAP`` missing aliases for single-language encoding
lookup (e.g., ``big5`` → ``big5hkscs``)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Fixed PyPy ``TypeError`` in UTF-7 codec handling
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)

**Improvements:**

- Retrained bigram models — 24 previously failing test cases now pass
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- Updated language equivalences for mutual intelligibility (Slovak/Czech,
East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)

7.0.0

-------------------

Ground-up, 0BSD-licensed rewrite of chardet
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude,
`322 &lt;https://github.com/chardet/chardet/pull/322&gt;`_). Same package name,
same public API — drop-in replacement for chardet 5.x/6.x.

**Highlights:**

- **0BSD license** (previous versions were LGPL)
- **96.8% accuracy** on 2,179 test files (+2.3pp vs chardet 6.0.0,
+7.7pp vs charset-normalizer)
- **41x faster** than chardet 6.0.0 with mypyc (**28x** pure Python),
**7.5x faster** than charset-normalizer
- **Language detection** for every result (90.5% accuracy across 49
languages)
- **99 encodings** across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC,
LEGACY_REGIONAL, DOS, MAINFRAME)
- **12-stage detection pipeline** — BOM, UTF-16/32 patterns, escape
sequences, binary detection, markup charset, ASCII, UTF-8 validation,
byte validity, CJK gating, structural probing, statistical scoring,
post-processing
- **Bigram frequency models** trained on CulturaX multilingual corpus
data for all supported language/encoding pairs
- **Optional mypyc compilation** — 1.49x additional speedup on CPython
- **Thread-safe** ``detect()`` and ``detect_all()`` with no measurable
overhead; scales on free-threaded Python 3.13t+
- **Negligible import memory** (96 B)
- **Zero runtime dependencies**

6.0.0.post1

-------------------------

- Fixed ``__version__`` not being set correctly in the package
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)

6.0.0

-------------------

**Features:**

- Unified single-byte charset detection with proper language-specific
bigram models for all single-byte encodings (replaces ``Latin1Prober``
and ``MacRomanProber`` heuristics)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish,
Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German,
Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian,
Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian,
Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik,
Ukrainian, Vietnamese, Welsh
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- ``EncodingEra`` filtering via new ``encoding_era`` parameter
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- ``max_bytes`` and ``chunk_size`` parameters for ``detect()``,
``detect_all()``, and ``UniversalDetector``
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- ``-e``/``--encoding-era`` CLI flag
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_ via Claude)
- EBCDIC detection (CP037, CP500)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Direct GB18030 support (replaces redundant GB2312 prober)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Binary file detection
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Python 3.12, 3.13, and 3.14 support
(`Hugo van Kemenade &lt;https://github.com/hugovk&gt;`_,
`283 &lt;https://github.com/chardet/chardet/pull/283&gt;`_)
- GitHub Codespaces support
(`oxygen dioxide &lt;https://github.com/oxygen-dioxide&gt;`_,
`312 &lt;https://github.com/chardet/chardet/pull/312&gt;`_)

**Breaking changes:**

- Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
- Removed ``Latin1Prober`` and ``MacRomanProber``
- Removed EUC-TW support
- Removed ``LanguageFilter.NONE``
- ``detect()`` default changed to ``encoding_era=EncodingEra.MODERN_WEB``

**Fixes:**

- Fixed CP949 state machine
(`nenw* &lt;https://github.com/HelloWorld017&gt;`_,
`268 &lt;https://github.com/chardet/chardet/pull/268&gt;`_)
- Fixed SJIS distribution analysis (second-byte range &gt;= 0x80)
(`Kadir Can Ozden &lt;https://github.com/bysiber&gt;`_,
`315 &lt;https://github.com/chardet/chardet/pull/315&gt;`_)
- Fixed ``max_bytes`` not being passed to ``UniversalDetector``
(`Kadir Can Ozden &lt;https://github.com/bysiber&gt;`_,
`314 &lt;https://github.com/chardet/chardet/pull/314&gt;`_)
- Fixed UTF-16/32 detection for non-ASCII-heavy text
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Fixed GB18030 ``char_len_table``
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Fixed UTF-8 state machine
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Fixed ``detect_all()`` returning inactive probers
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Fixed early cutoff bug
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Updated LGPLv2.1 license text for remote-only FSF address
(`Ben Beasley &lt;https://github.com/musicinmybrain&gt;`_,
`307 &lt;https://github.com/chardet/chardet/pull/307&gt;`_)

5.2.0

-------------------

- Added support for running the CLI via ``python -m chardet``
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)

5.1.0

-------------------

- Added ``should_rename_legacy`` argument to remap legacy encoding names
to modern equivalents
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`264 &lt;https://github.com/chardet/chardet/pull/264&gt;`_)
- Added MacRoman encoding prober
(`Elia Robyn Lake &lt;https://github.com/rspeer&gt;`_)
- Added ``--minimal`` flag to ``chardetect`` CLI
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`214 &lt;https://github.com/chardet/chardet/pull/214&gt;`_)
- Added type annotations and mypy CI
(`Jon Dufresne &lt;https://github.com/jdufresne&gt;`_,
`261 &lt;https://github.com/chardet/chardet/pull/261&gt;`_)
- Added support for Python 3.11
(`Hugo van Kemenade &lt;https://github.com/hugovk&gt;`_,
`274 &lt;https://github.com/chardet/chardet/pull/274&gt;`_)
- Added ISO-8859-15 capital letter sharp S handling
(`Simon Waldherr &lt;https://github.com/SimonWaldherr&gt;`_,
`222 &lt;https://github.com/chardet/chardet/pull/222&gt;`_)
- Clarified LGPL version in license trove classifier
(`Ben Beasley &lt;https://github.com/musicinmybrain&gt;`_,
`255 &lt;https://github.com/chardet/chardet/pull/255&gt;`_)
- Removed support for Python 3.6
(`Jon Dufresne &lt;https://github.com/jdufresne&gt;`_,
`260 &lt;https://github.com/chardet/chardet/pull/260&gt;`_)

5.0.0

-------------------

- Added Johab Korean prober
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`207 &lt;https://github.com/chardet/chardet/pull/207&gt;`_)
- Added UTF-16/32 BE/LE probers
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`206 &lt;https://github.com/chardet/chardet/pull/206&gt;`_)
- Added test data for Croatian, Czech, Hungarian, Polish, Slovak,
Slovene, Greek, Turkish
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Improved XML tag filtering
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`208 &lt;https://github.com/chardet/chardet/pull/208&gt;`_)
- Made ``detect_all`` return child prober confidences
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`210 &lt;https://github.com/chardet/chardet/pull/210&gt;`_)
- Added support for Python 3.10
(`Hugo van Kemenade &lt;https://github.com/hugovk&gt;`_,
`232 &lt;https://github.com/chardet/chardet/pull/232&gt;`_)
- Slight performance increase
(`deedy5 &lt;https://github.com/deedy5&gt;`_,
`252 &lt;https://github.com/chardet/chardet/pull/252&gt;`_)
- Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)

4.0.0

-------------------

- Added ``detect_all()`` function returning all candidate encodings
(`Damien &lt;https://github.com/mdamien&gt;`_,
`111 &lt;https://github.com/chardet/chardet/pull/111&gt;`_)
- Converted single-byte charset probers to nested dicts (performance)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`121 &lt;https://github.com/chardet/chardet/pull/121&gt;`_)
- ``CharsetGroupProber`` now short-circuits on definite matches
(performance)
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`203 &lt;https://github.com/chardet/chardet/pull/203&gt;`_)
- Added ``language`` field to ``detect_all`` output
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_)
- Switched from Travis to GitHub Actions
(`Dan Blanchard &lt;https://github.com/dan-blanchard&gt;`_,
`204 &lt;https://github.com/chardet/chardet/pull/204&gt;`_)
- Dropped Python 2.6, 3.4, 3.5

Links

PyPI: https://pypi.org/project/chardet
Changelog: https://data.safetycli.com/changelogs/chardet/

pyup-bot added 2 commits March 26, 2026 22:07

Update chardet from 3.0.4 to 7.4.0.post1

7f1cd45

Update chardet from 3.0.4 to 7.4.0.post1

8fcea76

pyup-bot mentioned this pull request Mar 27, 2026

Update chardet to 7.3.0 #730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update chardet to 7.4.0.post1#732

Update chardet to 7.4.0.post1#732
pyup-bot wants to merge 2 commits intomasterfrom
pyup-update-chardet-3.0.4-to-7.4.0.post1

pyup-bot commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pyup-bot commented Mar 27, 2026

7.4.0

7.3.0

7.2.0

7.1.0

7.0.1

7.0.0

6.0.0.post1

6.0.0

5.2.0

5.1.0

5.0.0

4.0.0

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant