Add German umlaut/eszett to default character replacement presets (#1667)#1809
Add German umlaut/eszett to default character replacement presets (#1667)#1809CryptoJones wants to merge 3 commits into
Conversation
Adds German umlaut (ä→ae, ö→oe, ü→ue, Ä→Ae, Ö→Oe, Ü→Ue) and eszett (ß→ss) mappings to all six default replacement presets. Users can remove or customize them via the character replacement editor. Covered by unit tests across Default, LoFi, and Barebones presets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This PR does not compile. Please test AI-generated code locally before submitting. |
|
Overall, however, I feel this falls a little short. The problem of special characters isn’t limited to German-speaking countries. French and Spanish characters immediately spring to mind. Would replacing mechanism from Libation be the right approach here, or would a custom mapping suffice? |
|
BTW: German authorities have a standard that specifies which characters are permitted in names. This applies, for example, to the data recorded on identity documents. There are almost 700 possible characters and accents. Take a look at this regular expression, which validates strings that comply with this standard: |
Holy cow. That's way more than I would have guessed! |
|
To see them, take a look at this file: latin_letters_1.3.txt |
…st data
The #Defaults region of ReplacementCharacters.cs had been auto-formatted
with U+201C/U+201D (curly quotation marks) used as C# string delimiters,
which is not valid syntax and broke every build target. Restore ASCII
straight quotes as delimiters while keeping the intended Unicode
replacement characters (e.g. "“", "”", """) as string content.
Also fix the three uppercase-umlaut test rows whose expected values did
not match the A_umlaut/O_umlaut/U_umlaut definitions ("Ae"/"Oe"/"Ue").
The function does not change context outside the umlaut itself, so
all-caps inputs like "ÄRGER" produced "AeRGER", not the expected
"AErger". Switch the rows to mixed-case inputs to test the realistic
case meaningfully.
Verified locally:
- FileManager.Tests UmlautReplacements: 12/12 pass
- LibationFileManager.Tests: no regressions
- (Pre-existing failures on master, unrelated: 2 ConditionalTag
catastrophic-backtracking tests, 1 Tag_culture_test de-CH locale)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A test that pins the actual Unicode codepoints used as OpenQuote (U+201C), CloseQuote (U+201D), and OtherQuote (U+FF02) replacements. Catches re-corruption of the #Defaults block by auto-formatters that substitute curly quotes for ASCII quotes — even when the file still compiles, the replacement output would change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Apologies again, same story as #1810. The xkcd 2347 applies here too. Sorry for the noise on a Wednesday. |
|
Thanks @Jo-Be-Co — fair critique. The German-only framing is arbitrary; Humanizer's One possible compromise: use The DIN 91379 pointer is wild — 500 letters plus 150 sequences. The combining-diacritic sequences are exactly why Unicode-normalization-based approaches (which Humanizer does) win over hand-rolled mappings. Great reference, thank you. |
|
This list lists sequences only if there is no one-character alternative here. In fact, however, most of the 500 characters can also be displayed in Unicode as a sequence. Removing accents then often uses this decomposition in the form "Keep the letter and remove the accent". But there are other two-letter replacements in addition to the German ones:
|
|
Thank you much for the code clean-up. Since I don't know any German beyond "guten tag" and @Jo-Be-Co actually is German, I'm deferring to him about the language rules. I'll just do a technical code review:
|
|
A few thoughts on this:
|
|
As already mentioned, non-ASCII characters do not appear exclusively in the German language. In addition to the actual locations of the Audible shops, there are undoubtedly other languages in which the books have been written, and likely even more countries from which the authors and narrators originate. It is also stated in OP #1667 that the poster is only affected by German special characters, but also assumes that other languages, such as French, are involved. I therefore believe the solution to this issue is the "Convert filenames and path to ASCII" option next to the replacements. This should then enable the following:
However, these mappings and the code for cleaning the output do need to be maintained. There may be a library that could be used instead. Furthermore, using this option should ensure that the standard replacements remain within the ASCII character set. |
|
I feel like we're kinda far into the weeds and I'm losing the plot. What's the actual problem we're trying to solve? Are we looking for the option of just converting German-specific letters and diacritics? From all languages? (ie: option to use ascii letters only.) Are we trying to give the user the ability to select replacements from a list that we provide, such as Jo-Be-Co's examples? We already have many mapped characters/string options and adding more is user-configurable. The author of #1667 mentions as much but finds it unintuitive to use. I don't disagree. I'm just not the right person to fix UX. I can imagine adding backup/restore or undo/redo to make it easier to recover from accidental changes but much more than that will require someone with actual UX aptitude. Before we get any further into specifics though -- what's the actual problem we're trying to solve? |
|
The existing implementation for filesystem‑safe filename normalization is solid and already well‑aligned with the actual needs of the underlying platforms. My concern is specifically about the proposed extension that adds replacements for a small set of German umlauts. As a native German speaker, I understand why these characters were chosen, but this approach addresses only a very narrow slice of the real problem. Unicode issues in filenames are not limited to German at all. Audible metadata spans many languages, alphabets, and writing systems, and any solution that focuses on a handful of characters will inevitably remain incomplete. The current configuration mechanism allows unrestricted Unicode mappings, but it was designed for a small, manageable set of filesystem‑related edge cases — not for maintaining a comprehensive list of hundreds of potential characters. For this reason, I believe expanding the mapping table is not the right direction. A more robust and maintainable approach would be a generic normalization mode, enabled via a simple switch, rather than an ever‑growing list of language‑specific exceptions. |
|
One more point worth mentioning: any comprehensive custom implementation will inevitably require a mapping layer. Unicode normalization alone provides a solid fallback, but it cannot fully replace explicit mappings for certain characters. If the project intentionally avoids using an external library, such a mapping could be loaded from a file to keep the frontend clean and maintain maximum flexibility. However, this also introduces new risks: maintainability, consistency, and the potential for user‑defined mappings to create unexpected behavior. |
|
I think this is out of scope for Libation. We already have a la carte string replacement. I can't imagine a replacement feature which would sufficiently address all of the other concerns here without it becoming its own mature tool. And requiring me/us to become experts in something outside of our focus. This last point is explicitly against my philosophy. I want everyone to stick to what they do best. Libation interfaces with audible-specific stuff; other people are language experts. |
|
Yes, that makes sense. Given everything we’ve put together here, the best option would be to use a suitable library. A custom implementation would then actually have to be so robust that it would occasionally require maintenance. Since I had already concluded that the letter adjustments should only be made once the template has been processed, this also fits in well with the option of renaming files retrospectively using other tools. |
Unless I'm mistaken, the current a la carte string replacement feature should work perfectly for this anyways, shouldn't it? Perhaps we should just "bake in" some presets to handle various languages? Perhaps be able to add those presets to the existing replacement lists, instead of the "pick from these sets, all or nothing" as we currently have? Could say to "Add the set to replace *, ?, :, etc, with unicode", then add various sets for languages on top of that? |
|
@CLHatch :
I feel like we're doing pretty well with the non-letter characters. Letters are the well with no bottom:
This quickly approaches me/us having to be language experts. |
I was thinking more in terms of just pretty much what we already have, straight character substitution, just a bit more "modular". Then people could submit their own files with character replacements for different languages (similar to how localization files are done for some projects, or themes), and users could choose to add those files to their substitution lists? Let others be the "language experts" if they want to be. |
|
@CLHatch I think our brains are in the same place. To me there's only 2 answers:
For the latter case, the implementation is trivial: loop over a list and apply find+replace. The building of that list is where the UI comes in and is non-trivial. As you said, you could seed it with localization files. The easiest way is: pick a set of replacements and add them to our list. If the replacement file changes later, it's up to the user to apply new changes. And the really easy way is: we've already got this tool, only -- the UI/UX sucks. Having established that I suck at UI and the next implementation is likely little more than a UI change, it's hard for me to want to sink more time into something that isn't going to be a lot better. Conversation aside, and this thread has had great conversation, the only concrete implementation in this thread doesn't do much to advance these needs -- so I'm going to close this PR. I am NOT closing the door to conversation or to future solutions to this problem. I just don't think this particular solution is the one we want to pursue, for all the reasons listed here. |
First of all, I don’t think the problem with troublesome characters can really be narrowed down to a specific user-depending language, even if I only have books in one language in my library. What’s needed here is a suitable fail-safe structure for the file, and a default template as well. But take a look at the ICU file Latin_ASCII.txt. In addition to all the lines listed here, lines 19–21 specify that accents are removed using NFD normalisation. So, if we’re not going to leave it out entirely, my current suggestion would be:
For the implementation, I would use an ICU port with de_ASCII:
Most of the fixes are therefore not included in Libation. You only need to update the libraries used from time to time. Users can choose to enable this feature and define exceptions. Option B would allow hardcore users to put their own expertise or preferences into practice. |
|
I had to send this after the close, as my comment box was already full and I then had to do other things. |
|
My solution from "Nerdistan" would go like this: using ICU4N.Text;
using System;
using System.IO;
public static class CustomAsciiTransliterator
{
private static readonly Lazy<Transliterator> _instance =
new Lazy<Transliterator>(() =>
{
const string rulesFile = "libation-icu.rules";
if (File.Exists(rulesFile))
{
// Load custom rules from file
string rules = File.ReadAllText(rulesFile);
return Transliterator.CreateFromRules(
id: "Custom-DeAscii",
rules: rules,
direction: TransliterationDirection.Forward
);
}
// Fallback: use ICU's built-in de-ASCII
return Transliterator.GetInstance("de-ASCII");
});
public static string Convert(string input)
{
return _instance.Value.Transliterate(input);
}
}And rules file like: @rmcrackan Would you create an issue here now, or would you rather wait until someone else takes it on? |
|
A big apology to everyone. I’ve only just realised that this adjustment doesn’t actually extend the replacement dialogue, but ‘merely’ adds a few user preferences. These are then stored in the defaults and the table. Otherwise, however, the key points remain the same:
Anyone who has trouble with umlauts and other special characters can already enter alternatives here. It’s just that these haven’t been pre-assigned yet. However, I find this pre-assignment (also) problematic. |
Summary
ä→ae,ö→oe,ü→ueÄ→Ae,Ö→Oe,Ü→Ueß→ssReplacement(same pattern asColon(),Pipe(), etc.) so they can be referenced by name in tests and future presetsMotivation
Umlauts and eszett are valid filename characters on all modern filesystems but cause real compatibility problems when files are shared across systems or uploaded to services that normalize Unicode. This has been a long-standing pain point for German-language Audible content (#1667).
Test plan
UmlautReplacementstest class passes (13 cases across Default, LoFi, and Barebones presets on both platforms)GetSafePath,GetSafeFileName, andGetValidFilenametests still pass (no regressions)aein the filename instead ofaebeing stripped to_Closes #1667
🤖 Generated with Claude Code