Skip to content

Fix ExifTool test failures: regex, eval, and bytecode interpreter bugs#250

Open
fglock wants to merge 10 commits intomasterfrom
fix-bytecode-my-extraction
Open

Fix ExifTool test failures: regex, eval, and bytecode interpreter bugs#250
fglock wants to merge 10 commits intomasterfrom
fix-bytecode-my-extraction

Conversation

@fglock
Copy link
Owner

@fglock fglock commented Feb 27, 2026

Summary

Fixes multiple bugs discovered while running the ExifTool test suite, improving pass rate from ~78% to 93% (294/316 individual tests, 102/113 test files fully passing).

Relates to #93

Bugs Fixed

  • Fix \G regex anchorRuntimeRegex.useGAssertion was hardcoded to false (private final boolean), so the \G assertion was never enforced. Fixed by making it mutable and setting from regexFlags.useGAssertion(). This fixed XISF/XMP parsing and RTF title extraction.

  • Fix grep scalar context in bytecode interpreterListOperators.grep() already wraps the match count in scalar context, but BytecodeInterpreter was re-counting the wrapper list (always returning 1). Removed the double-wrapping.

  • Fix regex octal escapes inside character classes[\0-\4] was passing single-digit octals as-is, but Java regex requires \04. Prepended leading zero for short octals in character classes.

  • Fix eval STRING leaking captured variable aliases — eval STRING pre-compilation created global aliases for captured my variables but never cleaned them up, causing variables to not re-initialize in loops.

  • Dynamic scoping for regex match variables$1, $&, @-, @+ etc. now properly save/restore across eval and subroutine boundaries.

  • Fix sysread/read type corruption on tied scalars — byte-string type was being lost.

  • Fix stat _ / lstat _ — cached stat buffer now works correctly with proper context handling.

  • Fix scalar localtime/gmtime — use Perl ctime format; fix for large epoch values.

  • Fix list slice with range indices(expr)[0..2] now works correctly.

  • Fix bytecode compiler my-extraction — short-circuit ops and AUTOLOAD with forward declarations.

  • Other fixes: BYTE_STRING type preservation, (@) prototype parsing, signature error messages, Time::Local update to 1.35.

Test plan

  • make passes (all unit tests)
  • ExifTool test suite: 294/316 tests pass (93%)
  • 102/113 ExifTool test files fully pass

Generated with Devin

fglock and others added 10 commits February 27, 2026 17:28
In Perl, \1-\7 inside character classes are always octal escapes, not
backreferences. Java regex requires a leading zero for octal sequences.
The preprocessor was passing single/double digit octals through as-is,
causing Java Pattern.compile to reject them.

Fixes 11 ExifTool test failures caused by MakerNoteHP2 condition
regex [\0-\4].

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
Two bugs fixed:

1. \G regex anchor was never active: RuntimeRegex had a hardcoded
   `private final boolean useGAssertion = false` field that shadowed
   the actual flag from RegexFlags. Now useGAssertion is set from
   regexFlags.useGAssertion() after compilation and in all copy paths.

2. grep in scalar context always returned 1 in bytecode interpreter:
   ListOperators.grep() already wraps the count in scalar context,
   but BytecodeInterpreter was re-wrapping with elements.size() which
   is always 1. Removed the double-wrapping to match the MAP handler.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
ScalarSpecialVariable (regex captures like $1, $&) computes its value
dynamically via getValueAsScalar(), but its type/value fields stay as
UNDEF/null. The RuntimeScalar copy constructor and set() method directly
accessed these fields, resulting in UNDEF when $1 was used in array ref
construction ([$1] -> empty), hash construction, or any copy context.

Added instanceof checks to resolve ScalarSpecialVariable before copying.
Fixes RTF Unicode parsing (Recompose([$2])) and MIFF metadata.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
In Perl, pos() only affects /g matches. Non-/g matches always start
from position 0 regardless of pos(). PerlOnJava was incorrectly using
pos() for all matches, causing regex failures when a prior /g match
had set pos() on a string.

This fixed ExifTool Text_5 test where IsUTF8() used /g matches on a
string, leaving pos() set, and a subsequent non-/g match on the same
string via $$dataPt !~ /[\x80-\x9f]/ would incorrectly start from the
pos() position instead of 0.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
- Fix not() operator for ScalarSpecialVariable ($1, $2, etc.) which
  always returned true because it checked the type field (always UNDEF)
  instead of delegating to getBoolean(). This caused DICOM parsing to
  incorrectly treat explicit VR as implicit VR.

- Fix Digest::MD5 and Digest::SHA add() methods to use ISO_8859_1
  instead of UTF_8 for byte string conversion, matching the addfile()
  methods. Fixes wrong IPTC digest computation for data with bytes >= 0x80.

- Fix lengthBytes (use bytes; length) to return character count for
  BYTE_STRING type instead of UTF-8 byte count. Bytes >= 0x80 were
  being double-counted as 2 UTF-8 bytes.

- Fix string concatenation to preserve BYTE_STRING type when either
  operand is BYTE_STRING and all characters are <= 255. Previously,
  concatenating a BYTE_STRING with a STRING would always upgrade to
  STRING, causing downstream binary data corruption.

- Fix unpack float/double handlers to use code point path for UTF-8
  flagged data, matching the existing short/long handler pattern. Bytes
  >= 0x80 were being UTF-8 expanded before float interpretation,
  corrupting values like 1.0 into -1.78e-38.

Fixes ExifTool tests: DICOM_2, PDF_2,3, PhotoMechanic_2

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
The is_utf8() function was checking type == BYTE_STRING when it should
check type != BYTE_STRING. In Perl 5, is_utf8() returns true when the
string has the internal UTF-8 flag (character string), and false for
byte strings. PerlOnJava BYTE_STRING type represents byte strings (no
UTF-8 flag), so is_utf8 should return false for them.

This caused ExifTool to incorrectly UTF-8 encode binary image data
when reading from scalar references, because it uses is_utf8() to
decide whether to re-encode the data.

Fixes ExifTool tests: PPM_3, GIF_2, IPTC_4, IPTC_5

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
…re, substr lvalue

- joinInternal: use OR logic for BYTE_STRING (any byte element makes result bytes)
- stringConcat/stringConcatWarnUninitialized: preserve BYTE_STRING in fallthrough
  when one operand is UNDEF/INTEGER
- ParseMapGrepSort: detect sort SUBNAME LIST pattern before trying block parse
- EmitForeach: save/restore pre-existing lexical variables used as loop variables
- RuntimeSubstrLvalue.set(): preserve BYTE_STRING type after substr assignment

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
tryWholeBlockRefactoring wraps large blocks in anonymous subs,
breaking return semantics (return exits the anon sub, not the
enclosing function). The interpreter fallback handles "Method too
large" safely. Also adds return detection to ControlFlowDetectorVisitor
as defense-in-depth, implements goto LABEL in bytecode compiler,
and caches overload method resolution.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
HASH_SET was casting value register to RuntimeScalar without checking
type, causing ClassCastException when a RuntimeList was in the
register. NOT opcode similarly hardened. Also improved bytecode
context display to show full 16-bit opcode values.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
…opcode

Three interpreter bugs fixed plus a runtime fix:

1. undef $scalar in interpreter: was creating a new undef value and
   ignoring the operand. Now emits UNDEFINE_SCALAR opcode that calls
   .undefine() on the variable register in-place.

2. HASH_SET opcode stored values directly in hash, allowing
   RuntimeScalarReadOnly cached integers to be stored. Later +=
   modifications failed with read-only error. Now wraps values in
   new RuntimeScalar() to ensure mutability.

3. print/say/printf return value: was returning byte count instead
   of 1 on success (Perl spec). Fixed in CustomFileChannel and
   PipeOutputChannel write() methods.

4. MY_SCALAR superinstruction (opcode 352): combines LOAD_UNDEF +
   SET_SCALAR for lexical scalar assignment. Prevents aliasing to
   read-only cached values.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant