Skip to content

Add <title sort>, <title short sort>, <first series sort> template tags (#1620)#1810

Open
CryptoJones wants to merge 2 commits into
rmcrackan:masterfrom
CryptoJones:feat/1620-sort-tags
Open

Add <title sort>, <title short sort>, <first series sort> template tags (#1620)#1810
CryptoJones wants to merge 2 commits into
rmcrackan:masterfrom
CryptoJones:feat/1620-sort-tags

Conversation

@CryptoJones
Copy link
Copy Markdown

Summary

Adds three new template tags that strip a leading English article (A / An / The, case-insensitive) from the resolved value:

Tag Example input Output
<title sort> The Hobbit: There and Back Again Hobbit: There and Back Again
<title short sort> The Hobbit: There and Back Again Hobbit
<first series sort> The Lord of the Rings Lord of the Rings

Words that merely start with "The", "A", or "An" but aren't whole-word articles are untouched (e.g. "Theatre of War" → "Theatre of War").

Usage examples

Folder template that files under the sort letter:

<title sort short> [<id>]

Mixed template that puts the full title in the filename but sorts by the stripped form in the directory:

<if series-><first series sort>\<-if series><title sort> [<id>]

Changes

  • TemplateTags.cs — three new TemplateTags static properties
  • Templates.csStripLeadingArticle() private helper; registered in filePropertyTags (used by Folder, File, and ChapterFile templates) and the chapterPropertyTags inner collection
  • TemplatesTests.cs — new SortTags test class: 14 cases covering all three tags, no-article pass-through, case-insensitivity, and availability in the chapter template

Closes #1620

🤖 Generated with Claude Code

…gs (rmcrackan#1620)

Adds three new file/folder naming template tags that strip a leading
article (A/An/The, case-insensitive) from the resolved value:
  <title sort>       — full title, article removed
  <title short sort> — title up to first colon, article removed
  <first series sort>— first series name, article removed

Useful for organizing libraries so "The Hobbit" files into "H/" instead
of "T/". Article stripping is additive-only; existing templates are
unchanged. Covered by unit tests in TemplatesTests.SortTags.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Jo-Be-Co
Copy link
Copy Markdown
Contributor

I’ve considered similar cuts before.

Here are a few general comments:

  • These cuts should be available for every language so that the new tags also work for non-English books.

  • Would we always remove all leading components of all languages, or should we refer to the respective language of the book?

  • Currently, only a few selected tags are provided with the new logic. While this doesn’t make sense for every tag, there are at least more title tags and the list tag for series.

  • How about a formatting option for text output? In addition to indicating uppercase, lowercase, titlecase, and length restrictions, we could also include an additional S for a <title[10l]>.

  • Since you’ve already named the new tags with “sort,” would that also be a possible sorting option (at least for series)?

A completely different approach would be to give the user the possibility of text replacement. For example, with a replace-tag that is set around another tag. The regular expression ^(A|An|The) could, for example, capture the current list.

@rmcrackan
Copy link
Copy Markdown
Owner

This PR does not compile. Please test AI-generated code locally before submitting.

The new SortTags class lives in namespace Templates_ChapterFile_Tests
but referenced Shared.GetLibraryBook(). The Shared class is in
namespace TemplatesTests; the file's top-level `using static
TemplatesTests.Shared;` brings the methods in unqualified, so drop the
`Shared.` prefix to match the surrounding test conventions.

Verified locally: SortTags tests pass (14/14), full project builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@CryptoJones
Copy link
Copy Markdown
Author

Sincere apologies — you're absolutely right, and I'm sorry for the wasted CI runs. The PR went up without a local build (the new test class moved to a different namespace and lost access to the unqualified Shared symbol). Just pushed 002ebbb6 with the fix and verified dotnet test locally first this time — 14/14 new test cases pass.

Also: obligatory xkcd 2347 — you're the Nebraska person here. Thank you for the time you sink into reviewing AI-generated noise like this.

@CryptoJones
Copy link
Copy Markdown
Author

Thanks for the careful review, @Jo-Be-Co. Genuinely good points:

  • Multi-language articles: Agreed, hard-coding English A/An/The is a real limitation. Cleanest fix is probably keying off the book's Language field. The replace-tag idea you raise below would subsume this entirely.
  • Tag selectivity: Fair — narrow scoping was deliberate but extending to author / series list is straightforward; happy to do that here or as a follow-up.
  • Format-option vs. dedicated tag: I prefer your <title[…s]>-style modifier over a separate <title sort> tag — it composes better. Open to reworking the PR that way if maintainers prefer.
  • series sort as a sort order: good catch, worth a separate issue.
  • Replace-tag with regex: that's the most flexible design and probably the right long-term answer. I think it deserves its own issue rather than expanding this PR.

Will hold off on rework until there's direction from maintainers.

@rmcrackan
Copy link
Copy Markdown
Owner

I'm really torn on this. I like this idea. It very much feels like the kind of thing I might have included myself in Libation's early days when it was just me and my English-speaking self. And let's be honest -- everything about Libation is anglocentric. BUT the books it liberates are not -- and this feature is about those books and their potentially non-English metadata. I'll think about this.

I'm not crazy about the proposed syntax but you and @Jo-Be-Co are working through that; I'll chime in after you 2 come to a consensus.

@rmcrackan
Copy link
Copy Markdown
Owner

I couldn't find an off-the-shelf solution and it looks like Humanizer removed this in v3 (boo!). But a good-enough version seems straight-forward (I know: famous last words)

private static Dictionary<string, string[]> LeadingArticles { get; } = new(StringComparer.OrdinalIgnoreCase)
{
    ["en"] = new[] { "the", "a", "an" },
    ["fr"] = new[] { "le", "la", "les", "l'", "un", "une", "des" },
    ["es"] = new[] { "el", "la", "los", "las", "un", "una", "unos", "unas" },
    ["it"] = new[] { "il", "lo", "la", "i", "gli", "le", "l'", "un", "uno", "una" },
    ["de"] = new[] { "der", "die", "das", "ein", "eine", "einen", "einem", "einer" },
    ["pt"] = new[] { "o", "a", "os", "as", "um", "uma", "uns", "umas" },
    ["nl"] = new[] { "de", "het", "een" },
    ["sv"] = new[] { "en", "ett" },
};

public static string ToSorted(string title, string? languageHint = null)
{
    var trimmed = title.TrimStart();
    var lang = languageHint ?? DetectLanguage(trimmed) ?? "en";
    if (LeadingArticles.TryGetValue(lang, out var articles))
    {
        foreach (var art in articles)
        {
            var prefix = art.EndsWith("'") ? art : art + " ";
            if (trimmed.StartsWithInsensitive(prefix))
                return trimmed[prefix.Length..].TrimStart();
        }
    }
    return trimmed;
}

While I was playing with this, I also found some string sorting algo notes about sorting without diacritics, which relates to the other PR discussion. The Normalization Form D on line 1 is "canonical decomposition". It decomposes single-character with accent into latin letter plus a combining character. Prints the same but is now 2 characters. (Which allows us to strip the non-latin character in a later step.) (Normalize method, NormalizationForm enum)

example:

  • before:
    U+00E9 (LATIN SMALL LETTER E WITH ACUTE)
  • after:
    U+0065 (plain e)
    U+0301 (COMBINING ACUTE ACCENT)
var normalized = trimmed.Normalize(NormalizationForm.FormD);
var sb = new StringBuilder(normalized.Length);
foreach (var ch in normalized)
    if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)
        sb.Append(ch);
return sb.ToString().ToLowerInvariant();

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

Two good approaches that point in the right direction.

Whereby I would not be sure whether I like the special treatment of `'`` or if I would add the spaces to the other terms. In the end, it will come down to a developer maintained solution anyway.

I would probably have stored one regular expression per language here again. 🤓

The topic of Unicode accent removal is exactly how to customise a large part without the need for explicit mapping. But we shouldn't delve into this here.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

So what solutions do I see here:

  • The existing implementation should be extended to other languages. In most cases, the language of the book should be the correct choice.

  • When implementing with new tags, you should immediately become feature complete. A <series sort> will then be a bit more complex.

  • The series tags already have a format option {N} for the output of the value. Here you could also add another letter that implements the shortening. This would then also be useable directly for sorting <series[format({X}) sort(X)]>. Unfortunately, this does not yet exist with the simple text fields. But would be a consideration (analogous to <tag[{S:3}]>). <title[5U]> <title[{S:5U}]> <title[{X:5U}]> ...

  • On the other hand, only one place where the extension is installed is to intervene in the text formatting. But it works everywhere. But will not be needed everywhere. '<title[X]>, <series[format({N:X})]>`

  • A <replace> tag or as an option for the text outputs is certainly very powerful and universally useable. But this feels like driving a truck shopping.

Personally, I would prefer one of the variants with X, whereby the X should of course be a wisely chosen letter here.

@Jo-Be-Co
Copy link
Copy Markdown
Contributor

var normalized = trimmed.Normalize(NormalizationForm.FormD);

var sb = new StringBuilder(normalized.Length);

foreach (var ch in normalized)

    if (CharUnicodeInfo.GetUnicodeCategory(ch) != UnicodeCategory.NonSpacingMark)

        sb.Append(ch);

return sb.ToString().ToLowerInvariant();

My compact variant:

var cleaned = Regex.Replace(input.Normalize(NormalizationForm.FormD), @"\p{Mn}+", "");

I think that it is quite likely that these replacements are rather rare. Then the character-by-character processing by a StringBuilder would only generate a string copy. A regexp without a match might be more efficent here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Removal of "A", "And", and "The" from the beginning of either a book series or book title

3 participants