Skip to content

Add German umlaut/eszett to default character replacement presets (#1667)#1809

Closed
CryptoJones wants to merge 3 commits into
rmcrackan:masterfrom
CryptoJones:feat/1667-umlaut-defaults
Closed

Add German umlaut/eszett to default character replacement presets (#1667)#1809
CryptoJones wants to merge 3 commits into
rmcrackan:masterfrom
CryptoJones:feat/1667-umlaut-defaults

Conversation

@CryptoJones
Copy link
Copy Markdown

Summary

  • Adds 7 new character mappings to all 6 default replacement presets (HiFi NTFS/Other, LoFi NTFS/Other, Barebones NTFS/Other):
    • äae, öoe, üue
    • ÄAe, ÖOe, ÜUe
    • ßss
  • Users can remove or remap any of these via the existing character replacement editor — no new settings or UI needed
  • Each mapping has its own static factory method on Replacement (same pattern as Colon(), Pipe(), etc.) so they can be referenced by name in tests and future presets

Motivation

Umlauts and eszett are valid filename characters on all modern filesystems but cause real compatibility problems when files are shared across systems or uploaded to services that normalize Unicode. This has been a long-standing pain point for German-language Audible content (#1667).

Test plan

  • UmlautReplacements test class passes (13 cases across Default, LoFi, and Barebones presets on both platforms)
  • Existing GetSafePath, GetSafeFileName, and GetValidFilename tests still pass (no regressions)
  • A book with an umlaut in its title (e.g. "Märchen") downloads with ae in the filename instead of ae being stripped to _

Closes #1667

🤖 Generated with Claude Code

Adds German umlaut (ä→ae, ö→oe, ü→ue, Ä→Ae, Ö→Oe, Ü→Ue) and eszett
(ß→ss) mappings to all six default replacement presets. Users can remove
or customize them via the character replacement editor. Covered by unit
tests across Default, LoFi, and Barebones presets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rmcrackan
Copy link
Copy Markdown
Owner

This PR does not compile. Please test AI-generated code locally before submitting.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

Jo-Be-Co commented May 14, 2026

Overall, however, I feel this falls a little short. The problem of special characters isn’t limited to German-speaking countries. French and Spanish characters immediately spring to mind.
The field is broad, covering not only titles in specific languages, but also the names of the people involved. For example, an Icelandic author might have written a book that has been translated into French.

Would replacing mechanism from Libation be the right approach here, or would a custom mapping suffice?
This issue isn't new; for example, there is the simpler Deaccent() function from the Humanizer library. However, it probably becomes more complex and accurate using the ICU standard or the CLDR transliteration Latin-ASCII from ICU4N.Text.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

BTW: German authorities have a standard that specifies which characters are permitted in names. This applies, for example, to the data recorded on identity documents. There are almost 700 possible characters and accents.

Take a look at this regular expression, which validates strings that comply with this standard:

( |'|[,-\.]|[A-Z]|[`-z]|~|¨|´|·|[À-Ö]|[Ø-ö]|[ø-ž]|[Ƈ-ƈ]|Ə|Ɨ|[Ơ-ơ]|[Ư-ư]|Ʒ|[Ǎ-ǜ]|[Ǟ-ǟ]|[Ǣ-ǰ]|[Ǵ-ǵ]|[Ǹ-ǿ]|[Ȓ-ȓ]|[Ș-ț]|[Ȟ-ȟ]|[ȧ-ȳ]|ə|ɨ|ʒ|[ʹ-ʺ]|[ʾ-ʿ]|ˈ|ˌ|[Ḃ-ḃ]|[Ḇ-ḇ]|[Ḋ-ḑ]|ḗ|[Ḝ-ḫ]|[ḯ-ḷ]|[Ḻ-ḻ]|[Ṁ-ṉ]|[Ṓ-ṛ]|[Ṟ-ṣ]|[Ṫ-ṯ]|[Ẁ-ẇ]|[Ẍ-ẗ]|ẞ|[Ạ-ỹ]|’|‡|A̋|C(̀|̄|̆|̈|̕|̣|̦|̨̆)|D̂|F(̀|̄)|G̀|H(̄|̦|̱)|J(́|̌)|K(̀|̂|̄|̇|̕|̛|̦|͟H|͟h)|L(̂|̥|̥̄|̦)|M(̀|̂|̆|̐)|N(̂|̄|̆|̦)|P(̀|̄|̕|̣)|R(̆|̥|̥̄)|S(̀|̄|̛̄|̱)|T(̀|̄|̈|̕|̛)|U̇|Z(̀|̄|̆|̈|̧)|a̋|c(̀|̄|̆|̈|̕|̣|̦|̨̆)|d̂|f(̀|̄)|g̀|h(̄|̦)|j́|k(̀|̂|̄|̇|̕|̛|̦|͟h)|l(̂|̥|̥̄|̦)|m(̀|̂|̆|̐)|n(̂|̄|̆|̦)|p(̀|̄|̕|̣)|r(̆|̥|̥̄)|s(̀|̄|̛̄|̱)|t(̀|̄|̕|̛)|u̇|z(̀|̄|̆|̈|̧)|Ç̆|Û̄|ç̆|û̄|ÿ́|Č(̕|̣)|č(̕|̣)|ē̍|Ī́|ī́|ō̍|Ž(̦|̧)|ž(̦|̧)|Ḳ̄|ḳ̄|Ṣ̄|ṣ̄|Ṭ̄|ṭ̄|Ạ̈|ạ̈|Ọ̈|ọ̈|Ụ(̄|̈)|ụ(̄|̈))*

@rmcrackan
Copy link
Copy Markdown
Owner

There are almost 700 possible characters and accents

Holy cow. That's way more than I would have guessed!

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

To see them, take a look at this file: latin_letters_1.3.txt
500 letters, nearly 150 sequences, ...

AK Clark and others added 2 commits May 15, 2026 01:01
…st data

The #Defaults region of ReplacementCharacters.cs had been auto-formatted
with U+201C/U+201D (curly quotation marks) used as C# string delimiters,
which is not valid syntax and broke every build target. Restore ASCII
straight quotes as delimiters while keeping the intended Unicode
replacement characters (e.g. "“", "”", """) as string content.

Also fix the three uppercase-umlaut test rows whose expected values did
not match the A_umlaut/O_umlaut/U_umlaut definitions ("Ae"/"Oe"/"Ue").
The function does not change context outside the umlaut itself, so
all-caps inputs like "ÄRGER" produced "AeRGER", not the expected
"AErger". Switch the rows to mixed-case inputs to test the realistic
case meaningfully.

Verified locally:
- FileManager.Tests UmlautReplacements: 12/12 pass
- LibationFileManager.Tests: no regressions
- (Pre-existing failures on master, unrelated: 2 ConditionalTag
  catastrophic-backtracking tests, 1 Tag_culture_test de-CH locale)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A test that pins the actual Unicode codepoints used as OpenQuote
(U+201C), CloseQuote (U+201D), and OtherQuote (U+FF02) replacements.
Catches re-corruption of the #Defaults block by auto-formatters that
substitute curly quotes for ASCII quotes — even when the file still
compiles, the replacement output would change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@CryptoJones
Copy link
Copy Markdown
Author

Apologies again, same story as #1810. The #Defaults block got auto-formatted with U+201C/U+201D used as C# string delimiters — a fairly impressive way to break every build target at once. Pushed 1df1b7ae (restored ASCII delimiters, kept the Unicode chars as string content, fixed three test rows whose expected output didn't match the A_umlaut("Ae") definitions) and 3029d5d9 (regression test pinning the actual codepoints so a future re-corruption is caught even if the file still compiles). Both FileManager and LibationFileManager suites green locally before pushing.

xkcd 2347 applies here too. Sorry for the noise on a Wednesday.

@CryptoJones
Copy link
Copy Markdown
Author

Thanks @Jo-Be-Co — fair critique. The German-only framing is arbitrary; Humanizer's Deaccent() or ICU4N's Latin-ASCII would be more principled.

One possible compromise: use Deaccent() for the BareBones preset (where users want stripped ASCII anyway) and keep the named umlaut/eszett replacements only for HiFi/LoFi where the German typographic convention (Ä → Ae, not A) is meaningful to preserve. That gives correct Latin-Extended coverage by default without losing the cultural nicety. Happy to rework along those lines if there's appetite.

The DIN 91379 pointer is wild — 500 letters plus 150 sequences. The combining-diacritic sequences are exactly why Unicode-normalization-based approaches (which Humanizer does) win over hand-rolled mappings. Great reference, thank you.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

Jo-Be-Co commented May 15, 2026

This list lists sequences only if there is no one-character alternative here.

In fact, however, most of the 500 characters can also be displayed in Unicode as a sequence. Removing accents then often uses this decomposition in the form "Keep the letter and remove the accent".

But there are other two-letter replacements in addition to the German ones:

Character Transliteration Example Author Example Title
Ä / ä AE Märta Tikkanen Män som hatar kvinnor
Ö / ö OE Dörte Hansen Erlösung
Ü / ü UE Günter Grass Über Menschen
ẞ / ß SS Peter Weiß Die Straße
Æ / æ AE - -
Ø / ø OE Peter Høeg Frøken Smillas fornemmelse for sne
Å / å AA Åsa Larsson Till offer åt Molok
Þ / þ TH Þórarinn Eldjárn -
IJ / ij IJ - -
Œ / œ OE - Cœur de pirate

@rmcrackan
Copy link
Copy Markdown
Owner

Thank you much for the code clean-up. Since I don't know any German beyond "guten tag" and @Jo-Be-Co actually is German, I'm deferring to him about the language rules. I'll just do a technical code review:

  • Public API naming (C# style)
    New factories are a_umlaut, o_umlaut, ... while the rest of Replacement uses PascalCase (OpenAngleBracket, Pipe, ...). For consistency with the type and typical .NET conventions, consider names like LowerAUmlaut / UpperAUmlaut / Eszett (or similar), unless you deliberately want snake_case for grep-ability.

  • HiFi presets: behavior change worth documenting
    HiFi_* is mostly about substituting filesystem-forbidden characters with fancy Unicode stand-ins. Umlauts are valid on NTFS (and typical POSIX filesystems). After this PR, Default (HiFi) will still transliterate umlauts to ASCII, so users who relied on Unicode filenames staying Unicode will see a change unless they edit presets. That is consistent with the PR motivation (sharing / normalization), but it is a product decision: a short note in release notes or docs (“Default now transliterates German umlauts; remove rules to keep Unicode”) would reduce surprise.

  • Test breadth
    Default covers eight cases; LoFi and Barebones only two each. Enough to prove wiring; you could add one row each for ö/ü/Ä if you want parity without much cost.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

A few thoughts on this:

  • If you consider the motivation behind this change, it initially concerns all characters that may appear as a result of Unicode being permitted. With regard to (German) umlauts, this also applies to the second half of ISO 8859-1, which forms the basis of Unicode. In other words, any characters that do not fall within the US-ASCII range are actually problematic.
  • As @rmcrackan pointed out, the replacements are primarily intended to prevent problems with filenames. Therefore, the implementation also takes into account the current environment in which Libation is running.
  • A few of the predefined replacements use characters from the Unicode range and are therefore not within the ASCII range.
  • Personally, I would prefer to keep the file names as close to the original as possible. So, at first glance, I think it would be a mistake if the adjustments to ‘special characters’ were to be applied automatically with a Libation update. I would simply disable this feature straight away.
  • The fields in question are actually always text fields, and it is now possible to output all of them with additional formatting. Currently, these are the details for the case and the length. I therefore find the idea of adding an option here for transliteration to ASCII appealing.
  • On the other hand – if you want these replacements for one field, you would actually want to apply them to all fields or the entire path. I therefore ultimately consider a global option to be the right approach here.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

Jo-Be-Co commented May 16, 2026

As already mentioned, non-ASCII characters do not appear exclusively in the German language. In addition to the actual locations of the Audible shops, there are undoubtedly other languages in which the books have been written, and likely even more countries from which the authors and narrators originate.

It is also stated in OP #1667 that the poster is only affected by German special characters, but also assumes that other languages, such as French, are involved.

I therefore believe the solution to this issue is the "Convert filenames and path to ASCII" option next to the replacements. This should then enable the following:

  • Replacement of special characters and accented letters that have a replacement consisting of multiple letters:
    • ÆǞǢǼÄAE or Ae
    • æǟǣǽäae
    • ǺÅAA or Aa
    • ǻåaa
    • ØŒǾȪÖOE or Oe
    • øœǿȫöoe
    • SS or Ss
    • ßss
    • ÞTH or Th
    • þth
    • IJIJ or Ij
    • ijij
    • ǕǗǙǛÜUE or Ue
    • ǖǘǚǜüUE
  • Together with the other useful replacements, a map (dictionary) might be useful here.
  • Apply Unicode Decomposition NFD
  • Delete all characters that are not included in the ASCII set, i.e. those with a codepoint of \x80 or higher.

However, these mappings and the code for cleaning the output do need to be maintained. There may be a library that could be used instead.

Furthermore, using this option should ensure that the standard replacements remain within the ASCII character set.

@rmcrackan
Copy link
Copy Markdown
Owner

I feel like we're kinda far into the weeds and I'm losing the plot. What's the actual problem we're trying to solve?

Are we looking for the option of just converting German-specific letters and diacritics? From all languages? (ie: option to use ascii letters only.) Are we trying to give the user the ability to select replacements from a list that we provide, such as Jo-Be-Co's examples?

We already have many mapped characters/string options and adding more is user-configurable. The author of #1667 mentions as much but finds it unintuitive to use. I don't disagree. I'm just not the right person to fix UX. I can imagine adding backup/restore or undo/redo to make it easier to recover from accidental changes but much more than that will require someone with actual UX aptitude.

Before we get any further into specifics though -- what's the actual problem we're trying to solve?

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

The existing implementation for filesystem‑safe filename normalization is solid and already well‑aligned with the actual needs of the underlying platforms. My concern is specifically about the proposed extension that adds replacements for a small set of German umlauts. As a native German speaker, I understand why these characters were chosen, but this approach addresses only a very narrow slice of the real problem.

Unicode issues in filenames are not limited to German at all. Audible metadata spans many languages, alphabets, and writing systems, and any solution that focuses on a handful of characters will inevitably remain incomplete. The current configuration mechanism allows unrestricted Unicode mappings, but it was designed for a small, manageable set of filesystem‑related edge cases — not for maintaining a comprehensive list of hundreds of potential characters.

For this reason, I believe expanding the mapping table is not the right direction. A more robust and maintainable approach would be a generic normalization mode, enabled via a simple switch, rather than an ever‑growing list of language‑specific exceptions.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

One more point worth mentioning: any comprehensive custom implementation will inevitably require a mapping layer. Unicode normalization alone provides a solid fallback, but it cannot fully replace explicit mappings for certain characters. If the project intentionally avoids using an external library, such a mapping could be loaded from a file to keep the frontend clean and maintain maximum flexibility. However, this also introduces new risks: maintainability, consistency, and the potential for user‑defined mappings to create unexpected behavior.

@rmcrackan
Copy link
Copy Markdown
Owner

I think this is out of scope for Libation. We already have a la carte string replacement. I can't imagine a replacement feature which would sufficiently address all of the other concerns here without it becoming its own mature tool. And requiring me/us to become experts in something outside of our focus. This last point is explicitly against my philosophy. I want everyone to stick to what they do best. Libation interfaces with audible-specific stuff; other people are language experts.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

Yes, that makes sense. Given everything we’ve put together here, the best option would be to use a suitable library. A custom implementation would then actually have to be so robust that it would occasionally require maintenance.

Since I had already concluded that the letter adjustments should only be made once the template has been processed, this also fits in well with the option of renaming files retrospectively using other tools.

@CLHatch
Copy link
Copy Markdown
Contributor

CLHatch commented May 17, 2026

I think this is out of scope for Libation. We already have a la carte string replacement. I can't imagine a replacement feature which would sufficiently address all of the other concerns here without it becoming its own mature tool. And requiring me/us to become experts in something outside of our focus. This last point is explicitly against my philosophy. I want everyone to stick to what they do best. Libation interfaces with audible-specific stuff; other people are language experts.

Unless I'm mistaken, the current a la carte string replacement feature should work perfectly for this anyways, shouldn't it? Perhaps we should just "bake in" some presets to handle various languages? Perhaps be able to add those presets to the existing replacement lists, instead of the "pick from these sets, all or nothing" as we currently have? Could say to "Add the set to replace *, ?, :, etc, with unicode", then add various sets for languages on top of that?

@rmcrackan
Copy link
Copy Markdown
Owner

@CLHatch :

Unless I'm mistaken, the current a la carte string replacement feature should work perfectly for this anyways, shouldn't it? Perhaps we should just "bake in" some presets to handle various languages? Perhaps be able to add those presets to the existing replacement lists, instead of the "pick from these sets, all or nothing" as we currently have? Could say to "Add the set to replace *, ?, :, etc, with unicode", then add various sets for languages on top of that?

I feel like we're doing pretty well with the non-letter characters. Letters are the well with no bottom:

  • Would you like latin-only?
  • Should we detect local language and offer substitutions?
  • What kind of sub.s? Just remove diacritics? Multi-characters subs -- such as with ß and þ?
  • Later, when we discover a better way of transliterating a language, do we discover which batch solutions the user selected and try to apply the new changes?

This quickly approaches me/us having to be language experts.

@CLHatch
Copy link
Copy Markdown
Contributor

CLHatch commented May 17, 2026

@CLHatch :

Unless I'm mistaken, the current a la carte string replacement feature should work perfectly for this anyways, shouldn't it? Perhaps we should just "bake in" some presets to handle various languages? Perhaps be able to add those presets to the existing replacement lists, instead of the "pick from these sets, all or nothing" as we currently have? Could say to "Add the set to replace *, ?, :, etc, with unicode", then add various sets for languages on top of that?

I feel like we're doing pretty well with the non-letter characters. Letters are the well with no bottom:

  • Would you like latin-only?
  • Should we detect local language and offer substitutions?
  • What kind of sub.s? Just remove diacritics? Multi-characters subs -- such as with ß and þ?
  • Later, when we discover a better way of transliterating a language, do we discover which batch solutions the user selected and try to apply the new changes?

This quickly approaches me/us having to be language experts.

I was thinking more in terms of just pretty much what we already have, straight character substitution, just a bit more "modular". Then people could submit their own files with character replacements for different languages (similar to how localization files are done for some projects, or themes), and users could choose to add those files to their substitution lists? Let others be the "language experts" if they want to be.

@rmcrackan
Copy link
Copy Markdown
Owner

@CLHatch I think our brains are in the same place. To me there's only 2 answers:

  1. A special case for latin-only characters (which would still need canonical mapping, but I could just defer to whatever .net provides out of the box)
  2. Explicit string replacement

For the latter case, the implementation is trivial: loop over a list and apply find+replace. The building of that list is where the UI comes in and is non-trivial.

As you said, you could seed it with localization files. The easiest way is: pick a set of replacements and add them to our list. If the replacement file changes later, it's up to the user to apply new changes.
The advanced way would be pointers to these files. So if a file get an update later, no action is needed by the user.

And the really easy way is: we've already got this tool, only -- the UI/UX sucks. Having established that I suck at UI and the next implementation is likely little more than a UI change, it's hard for me to want to sink more time into something that isn't going to be a lot better.

Conversation aside, and this thread has had great conversation, the only concrete implementation in this thread doesn't do much to advance these needs -- so I'm going to close this PR. I am NOT closing the door to conversation or to future solutions to this problem. I just don't think this particular solution is the one we want to pursue, for all the reasons listed here.

@rmcrackan rmcrackan closed this May 17, 2026
@Jo-Be-Co
Copy link
Copy Markdown
Contributor

I was thinking more in terms of just pretty much what we already have, straight character substitution, just a bit more "modular". Then people could submit their own files with character replacements for different languages (similar to how localization files are done for some projects, or themes), and users could choose to add those files to their substitution lists? Let others be the "language experts" if they want to be.

First of all, I don’t think the problem with troublesome characters can really be narrowed down to a specific user-depending language, even if I only have books in one language in my library.
That, at least, is why I think you need a fairly comprehensive solution, or else you shouldn’t do it at all.
I actually think it’s a good idea to give the user the option to have problematic characters replaced according to their preferences. But the processing of a suitable file must also be sufficiently stable.

What’s needed here is a suitable fail-safe structure for the file, and a default template as well. But take a look at the ICU file Latin_ASCII.txt. In addition to all the lines listed here, lines 19–21 specify that accents are removed using NFD normalisation.
Before you start parsing a file like that, however, I would recommend using a port of ICU4C or something similar.

So, if we’re not going to leave it out entirely, my current suggestion would be:

  • A checkbox option to remove non-ASCII characters
  • Optional A: an input field where users can enter a list of characters that should not be replaced
  • Optional B: a simple file (with comments) such as ß<TAB>SZ for user-defined replacements (or perhaps JSON?)

For the implementation, I would use an ICU port with de_ASCII:

  • Use a regular expression to search for unwanted characters (taking user defined exceptions into account).
  • Send only the matches through an ICU correction
  • Optional: apply the user-defined replacements

Most of the fixes are therefore not included in Libation. You only need to update the libraries used from time to time. Users can choose to enable this feature and define exceptions. Option B would allow hardcore users to put their own expertise or preferences into practice.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

I had to send this after the close, as my comment box was already full and I then had to do other things.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

My solution from "Nerdistan" would go like this:

using ICU4N.Text;
using System;
using System.IO;

public static class CustomAsciiTransliterator
{
    private static readonly Lazy<Transliterator> _instance =
        new Lazy<Transliterator>(() =>
        {
            const string rulesFile = "libation-icu.rules";

            if (File.Exists(rulesFile))
            {
                // Load custom rules from file
                string rules = File.ReadAllText(rulesFile);

                return Transliterator.CreateFromRules(
                    id: "Custom-DeAscii",
                    rules: rules,
                    direction: TransliterationDirection.Forward
                );
            }

            // Fallback: use ICU's built-in de-ASCII
            return Transliterator.GetInstance("de-ASCII");
        });

    public static string Convert(string input)
    {
        return _instance.Value.Transliterate(input);
    }
}

And rules file like:

// ===============================================
//  Simple ICU Transliteration Rules
//  Demonstrates:
//    - custom replacements
//    - final fallback to ICU rule set "de-ASCII"
===============================================


// -----------------------------------------------
// Custom replacements
// -----------------------------------------------
Æ > AE;
æ > ae;

Þ > Th;
þ > th;

Ä > Ae;
ä > ae;

Ö > Oe;
ö > oe;

Ü > Ue;
ü > ue;


// -----------------------------------------------
// Apply ICU's built-in de-ASCII rules
//    This will handle remaining characters.
//    NOTE: de-ASCII may still override exceptions
//          because no masking is used.
// -----------------------------------------------
:: de-ASCII;

@rmcrackan Would you create an issue here now, or would you rather wait until someone else takes it on?

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

A big apology to everyone. I’ve only just realised that this adjustment doesn’t actually extend the replacement dialogue, but ‘merely’ adds a few user preferences. These are then stored in the defaults and the table.

Otherwise, however, the key points remain the same:

  • The behaviour would change with a new release (which some people might not want)
  • Only German umlauts are too restricted here

Anyone who has trouble with umlauts and other special characters can already enter alternatives here. It’s just that these haven’t been pre-assigned yet. However, I find this pre-assignment (also) problematic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants