Skip to content

Update chardet to 7.4.0.post1#732

Open
pyup-bot wants to merge 2 commits intomasterfrom
pyup-update-chardet-3.0.4-to-7.4.0.post1
Open

Update chardet to 7.4.0.post1#732
pyup-bot wants to merge 2 commits intomasterfrom
pyup-update-chardet-3.0.4-to-7.4.0.post1

Conversation

@pyup-bot
Copy link
Copy Markdown
Collaborator

This PR updates chardet from 3.0.4 to 7.4.0.post1.

Changelog

7.4.0

-------------------

**Performance:**

- Switched to dense zlib-compressed model format (v2): models are now
stored as contiguous ``memoryview`` slices of a single decompressed
blob, eliminating per-model ``struct.unpack`` overhead. Cold start
(import + first detect) dropped from ~75ms to ~13ms with mypyc.
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`354 <https://github.com/chardet/chardet/pull/354>`_)

**Accuracy:**

- Accuracy improved from 98.6% to 99.3% (2499/2517 files) through
a combination of training and scoring improvements:

- Eliminated train/test data overlap by content-fingerprinting test
 suite articles and excluding them from training data
 (`351 <https://github.com/chardet/chardet/pull/351>`_)
- Added MADLAD-400 and Wikipedia as supplemental training sources to
 fill gaps left by exclusion filtering
 (`351 <https://github.com/chardet/chardet/pull/351>`_)
- Improved non-ASCII bigram scoring: high-byte bigrams are now
 preserved during training (instead of being crushed by global
 normalization), and weighted by per-bigram IDF so encoding-specific
 byte patterns contribute proportionally to how discriminative they
 are (`352 <https://github.com/chardet/chardet/pull/352>`_)
- Added encoding-aware substitution filtering: character substitutions
 during training now only apply for characters the target encoding
 cannot represent
- Increased training samples from 15K to 25K per language/encoding pair
 (`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)

**Bug Fixes:**

- Added dedicated structural analyzers for CP932, CP949, and
Big5-HKSCS: these superset encodings previously shared their base
encoding's byte-range analyzer, missing extended ranges unique to each
superset
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`353 <https://github.com/chardet/chardet/pull/353>`_)

7.3.0

-------------------

**License:**

- **0BSD license** — the project license has been changed from MIT to
`0BSD <https://opensource.org/license/0bsd>`_, a maximally permissive
license with no attribution requirement. All prior 7.x releases
should also be considered 0BSD licensed as of this release.
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)

**Features:**

- Added ``mime_type`` field to detection results — identifies file types
for both binary (via magic number matching) and text content. Returned
in all ``detect()``, ``detect_all()``, and ``UniversalDetector`` results.
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`350 <https://github.com/chardet/chardet/pull/350>`_)
- New ``pipeline/magic.py`` module detects 40+ binary file formats
including images, audio/video, archives, documents, executables, and
fonts. ZIP-based formats (XLSX, DOCX, JAR, APK, EPUB, wheel,
OpenDocument) are distinguished by entry filenames.
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`350 <https://github.com/chardet/chardet/pull/350>`_)

**Bug Fixes:**

- Fixed incorrect equivalence between UTF-16-LE and UTF-16-BE in
accuracy testing — these are distinct encodings with different byte
order, not interchangeable
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)

**Performance:**

- Added 4 new modules to mypyc compilation (orchestrator, confusion,
magic, ascii), bringing the total to 11 compiled modules
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Capped statistical scoring at 16 KB — bigram models converge quickly,
so large files no longer score the full 200 KB. Worst-case detection
time dropped from 62ms to 26ms with no accuracy loss.
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Replaced ``dataclasses.replace()`` with direct ``DetectionResult``
construction on hot paths, eliminating ~354k function calls per full
test suite run
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)

**Build:**

- Added riscv64 to the mypyc wheel build matrix — prebuilt wheels are
now published for RISC-V Linux alongside existing architectures
(`Bruno Verachten <https://github.com/gounthar>`_,
`348 <https://github.com/chardet/chardet/pull/348>`_)

7.2.0

-------------------

**Features:**

- Added ``include_encodings`` and ``exclude_encodings`` parameters to
:func:`~chardet.detect`, :func:`~chardet.detect_all`, and
:class:`~chardet.UniversalDetector` — restrict or exclude specific
encodings from the candidate set, with corresponding
``-i``/``--include-encodings`` and ``-x``/``--exclude-encodings``
CLI flags
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`343 <https://github.com/chardet/chardet/pull/343>`_)
- Added ``no_match_encoding`` (default ``"cp1252"``) and
``empty_input_encoding`` (default ``"utf-8"``) parameters — control
which encoding is returned when no candidate survives the pipeline or
the input is empty, with corresponding CLI flags
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`343 <https://github.com/chardet/chardet/pull/343>`_)
- Added ``-l``/``--language`` flag to ``chardetect`` CLI — shows the
detected language (ISO 639-1 code and English name) alongside the encoding
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`342 <https://github.com/chardet/chardet/pull/342>`_)

7.1.0

-------------------

**Features:**

- Added PEP 263 encoding declaration detection — `` -*- coding: ... -*-``
and `` coding=...`` declarations on lines 1–2 of Python source files are
now recognized with confidence 0.95
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`249 <https://github.com/chardet/chardet/issues/249>`_)
- Added ``chardet.universaldetector`` backward-compatibility stub so that
``from chardet.universaldetector import UniversalDetector`` works with a
deprecation warning
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`341 <https://github.com/chardet/chardet/issues/341>`_)

**Fixes:**

- Fixed false UTF-7 detection of ASCII text containing ``++`` or ``+word``
patterns
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`332 <https://github.com/chardet/chardet/issues/332>`_,
`335 <https://github.com/chardet/chardet/pull/335>`_)
- Fixed 0.5s startup cost on first ``detect()`` call — model norms are now
computed during loading instead of lazily iterating 21M entries
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`333 <https://github.com/chardet/chardet/issues/333>`_,
`336 <https://github.com/chardet/chardet/pull/336>`_)
- Fixed undocumented encoding name changes between chardet 5.x and 7.0 —
``detect()`` now returns chardet 5.x-compatible names by default
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`338 <https://github.com/chardet/chardet/pull/338>`_)
- Improved ISO-2022-JP family detection — recognizes ESC sequences for
ISO-2022-JP-2004 (JIS X 0213) and ISO-2022-JP-EXT (JIS X 0201 Kana)
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Fixed silent truncation of corrupt model data (``iter_unpack`` yielded
fewer tuples instead of raising)
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Fixed incorrect date in LICENSE
(`Dan Blanchard <https://github.com/dan-blanchard>`_)

**Performance:**

- 5.5x faster first-detect time (~0.42s → ~0.075s) by computing model
norms as a side-product of ``load_models()``
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- ~40% faster model parsing via ``struct.iter_unpack`` for bulk entry
extraction (eliminates ~305K individual ``unpack`` calls)
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)

**New API parameters:**

- Added ``compat_names`` parameter (default ``True``) to
:func:`~chardet.detect`, :func:`~chardet.detect_all`, and
:class:`~chardet.UniversalDetector` — set to ``False`` to get raw Python
codec names instead of chardet 5.x/6.x compatible display names
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Added ``prefer_superset`` parameter (default ``False``) — remaps legacy
ISO/subset encodings to their modern Windows/CP superset equivalents
(e.g., ASCII → Windows-1252, ISO-8859-1 → Windows-1252).
**This will default to ``True`` in the next major version (8.0).**
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Deprecated ``should_rename_legacy`` in favor of ``prefer_superset`` —
a deprecation warning is emitted when used
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)

**Improvements:**

- Switched internal canonical encoding names to Python codec names
(e.g., ``"utf-8"`` instead of ``"UTF-8"``), with ``compat_names``
controlling the public output format.  See :doc:`usage` for the full
mapping table.
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Added ``lookup_encoding()`` to ``registry`` for case-insensitive
resolution of arbitrary encoding name input to canonical names
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Achieved 100% line coverage across all source modules (+31 tests)
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Updated benchmark numbers: 98.2% encoding accuracy, 95.2% language
accuracy on 2,510 test files
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Pinned test-data cloning to chardet release version tags for
reproducible builds
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)

7.0.1

-------------------

**Fixes:**

- Fixed false UTF-7 detection of SHA-1 git hashes
(`Alex Rembish <https://github.com/rembish>`_,
`324 <https://github.com/chardet/chardet/pull/324>`_)
- Fixed ``_SINGLE_LANG_MAP`` missing aliases for single-language encoding
lookup (e.g., ``big5`` → ``big5hkscs``)
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Fixed PyPy ``TypeError`` in UTF-7 codec handling
(`Dan Blanchard <https://github.com/dan-blanchard>`_)

**Improvements:**

- Retrained bigram models — 24 previously failing test cases now pass
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- Updated language equivalences for mutual intelligibility (Slovak/Czech,
East Slavic + Bulgarian, Malay/Indonesian, Scandinavian languages)
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)

7.0.0

-------------------

Ground-up, 0BSD-licensed rewrite of chardet
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude,
`322 <https://github.com/chardet/chardet/pull/322>`_). Same package name,
same public API — drop-in replacement for chardet 5.x/6.x.

**Highlights:**

- **0BSD license** (previous versions were LGPL)
- **96.8% accuracy** on 2,179 test files (+2.3pp vs chardet 6.0.0,
+7.7pp vs charset-normalizer)
- **41x faster** than chardet 6.0.0 with mypyc (**28x** pure Python),
**7.5x faster** than charset-normalizer
- **Language detection** for every result (90.5% accuracy across 49
languages)
- **99 encodings** across six eras (MODERN_WEB, LEGACY_ISO, LEGACY_MAC,
LEGACY_REGIONAL, DOS, MAINFRAME)
- **12-stage detection pipeline** — BOM, UTF-16/32 patterns, escape
sequences, binary detection, markup charset, ASCII, UTF-8 validation,
byte validity, CJK gating, structural probing, statistical scoring,
post-processing
- **Bigram frequency models** trained on CulturaX multilingual corpus
data for all supported language/encoding pairs
- **Optional mypyc compilation** — 1.49x additional speedup on CPython
- **Thread-safe** ``detect()`` and ``detect_all()`` with no measurable
overhead; scales on free-threaded Python 3.13t+
- **Negligible import memory** (96 B)
- **Zero runtime dependencies**

6.0.0.post1

-------------------------

- Fixed ``__version__`` not being set correctly in the package
(`Dan Blanchard <https://github.com/dan-blanchard>`_)

6.0.0

-------------------

**Features:**

- Unified single-byte charset detection with proper language-specific
bigram models for all single-byte encodings (replaces ``Latin1Prober``
and ``MacRomanProber`` heuristics)
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- 38 new languages: Arabic, Belarusian, Breton, Croatian, Czech, Danish,
Dutch, English, Esperanto, Estonian, Farsi, Finnish, French, German,
Icelandic, Indonesian, Irish, Italian, Kazakh, Latvian, Lithuanian,
Macedonian, Malay, Maltese, Norwegian, Polish, Portuguese, Romanian,
Scottish Gaelic, Serbian, Slovak, Slovene, Spanish, Swedish, Tajik,
Ukrainian, Vietnamese, Welsh
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- ``EncodingEra`` filtering via new ``encoding_era`` parameter
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- ``max_bytes`` and ``chunk_size`` parameters for ``detect()``,
``detect_all()``, and ``UniversalDetector``
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- ``-e``/``--encoding-era`` CLI flag
(`Dan Blanchard <https://github.com/dan-blanchard>`_ via Claude)
- EBCDIC detection (CP037, CP500)
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Direct GB18030 support (replaces redundant GB2312 prober)
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Binary file detection
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Python 3.12, 3.13, and 3.14 support
(`Hugo van Kemenade <https://github.com/hugovk>`_,
`283 <https://github.com/chardet/chardet/pull/283>`_)
- GitHub Codespaces support
(`oxygen dioxide <https://github.com/oxygen-dioxide>`_,
`312 <https://github.com/chardet/chardet/pull/312>`_)

**Breaking changes:**

- Dropped Python 3.7, 3.8, and 3.9 (requires Python 3.10+)
- Removed ``Latin1Prober`` and ``MacRomanProber``
- Removed EUC-TW support
- Removed ``LanguageFilter.NONE``
- ``detect()`` default changed to ``encoding_era=EncodingEra.MODERN_WEB``

**Fixes:**

- Fixed CP949 state machine
(`nenw* <https://github.com/HelloWorld017>`_,
`268 <https://github.com/chardet/chardet/pull/268>`_)
- Fixed SJIS distribution analysis (second-byte range >= 0x80)
(`Kadir Can Ozden <https://github.com/bysiber>`_,
`315 <https://github.com/chardet/chardet/pull/315>`_)
- Fixed ``max_bytes`` not being passed to ``UniversalDetector``
(`Kadir Can Ozden <https://github.com/bysiber>`_,
`314 <https://github.com/chardet/chardet/pull/314>`_)
- Fixed UTF-16/32 detection for non-ASCII-heavy text
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Fixed GB18030 ``char_len_table``
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Fixed UTF-8 state machine
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Fixed ``detect_all()`` returning inactive probers
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Fixed early cutoff bug
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Updated LGPLv2.1 license text for remote-only FSF address
(`Ben Beasley <https://github.com/musicinmybrain>`_,
`307 <https://github.com/chardet/chardet/pull/307>`_)

5.2.0

-------------------

- Added support for running the CLI via ``python -m chardet``
(`Dan Blanchard <https://github.com/dan-blanchard>`_)

5.1.0

-------------------

- Added ``should_rename_legacy`` argument to remap legacy encoding names
to modern equivalents
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`264 <https://github.com/chardet/chardet/pull/264>`_)
- Added MacRoman encoding prober
(`Elia Robyn Lake <https://github.com/rspeer>`_)
- Added ``--minimal`` flag to ``chardetect`` CLI
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`214 <https://github.com/chardet/chardet/pull/214>`_)
- Added type annotations and mypy CI
(`Jon Dufresne <https://github.com/jdufresne>`_,
`261 <https://github.com/chardet/chardet/pull/261>`_)
- Added support for Python 3.11
(`Hugo van Kemenade <https://github.com/hugovk>`_,
`274 <https://github.com/chardet/chardet/pull/274>`_)
- Added ISO-8859-15 capital letter sharp S handling
(`Simon Waldherr <https://github.com/SimonWaldherr>`_,
`222 <https://github.com/chardet/chardet/pull/222>`_)
- Clarified LGPL version in license trove classifier
(`Ben Beasley <https://github.com/musicinmybrain>`_,
`255 <https://github.com/chardet/chardet/pull/255>`_)
- Removed support for Python 3.6
(`Jon Dufresne <https://github.com/jdufresne>`_,
`260 <https://github.com/chardet/chardet/pull/260>`_)

5.0.0

-------------------

- Added Johab Korean prober
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`207 <https://github.com/chardet/chardet/pull/207>`_)
- Added UTF-16/32 BE/LE probers
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`206 <https://github.com/chardet/chardet/pull/206>`_)
- Added test data for Croatian, Czech, Hungarian, Polish, Slovak,
Slovene, Greek, Turkish
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Improved XML tag filtering
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`208 <https://github.com/chardet/chardet/pull/208>`_)
- Made ``detect_all`` return child prober confidences
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`210 <https://github.com/chardet/chardet/pull/210>`_)
- Added support for Python 3.10
(`Hugo van Kemenade <https://github.com/hugovk>`_,
`232 <https://github.com/chardet/chardet/pull/232>`_)
- Slight performance increase
(`deedy5 <https://github.com/deedy5>`_,
`252 <https://github.com/chardet/chardet/pull/252>`_)
- Dropped Python 2.7, 3.4, 3.5 (requires Python 3.6+)

4.0.0

-------------------

- Added ``detect_all()`` function returning all candidate encodings
(`Damien <https://github.com/mdamien>`_,
`111 <https://github.com/chardet/chardet/pull/111>`_)
- Converted single-byte charset probers to nested dicts (performance)
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`121 <https://github.com/chardet/chardet/pull/121>`_)
- ``CharsetGroupProber`` now short-circuits on definite matches
(performance)
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`203 <https://github.com/chardet/chardet/pull/203>`_)
- Added ``language`` field to ``detect_all`` output
(`Dan Blanchard <https://github.com/dan-blanchard>`_)
- Switched from Travis to GitHub Actions
(`Dan Blanchard <https://github.com/dan-blanchard>`_,
`204 <https://github.com/chardet/chardet/pull/204>`_)
- Dropped Python 2.6, 3.4, 3.5
Links

@pyup-bot pyup-bot mentioned this pull request Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant