Skip to content

feat(website, backend): SeqSet citation tracking#6304

Open
tombch wants to merge 70 commits into
mainfrom
seqset-citations
Open

feat(website, backend): SeqSet citation tracking#6304
tombch wants to merge 70 commits into
mainfrom
seqset-citations

Conversation

@tombch
Copy link
Copy Markdown
Collaborator

@tombch tombch commented Apr 27, 2026

Summary

Schema Changes

  • Adds a seqset_citing_source table: a 'citing source' being any publication that references one or more SeqSets. A citing source must have a DOI, title, year and contributors, and can have an origin either from Crossref or manually curated (manual curation endpoints will be added in a follow-up PR).
  • Adds a seqset_to_citing_source table for database joins between citing sources and the seqsets they cite.

Backend Changes

  • Adds a scheduled task to update SeqSet citation sources with results from Crossref. The task has an initial delay of one minute, then runs every 6 hours. It first checks if the Crossref backend service is active, then queries the Crossref Cited-by API to retrieve all citations for the DOI prefix configured in Loculus, merges these results into individual citing sources, logs any conflicts, and then upserts these results in the database. If a curated citing source exists for a DOI found on Crossref, it is updated with the results from Crossref.
  • Updates the backend endpoint /get-seqset-cited-by-publication: for a particular Seqset ID and version, retrieves citations and returns their source DOIs, titles, years and contributors for each citation of the SeqSet. This endpoint's return type has been modified: previously it returned the backend CitedBy type, now it returns List<SeqSetCitation>.
  • There aren't any endpoints yet for manual curation of citations, or for aggregating citations for groups, sequences etc - but I think these would be better in a follow-up PR.
  • Also - there should probably be some cleanup task for removing citing sources if all their SeqSets all get deleted - if that is still a supported feature.

Website Changes

  • Adds a View Citations modal to the SeqSet details page which lists citations for the seqset.
  • Updates the total count of citations and the citations over time graph on the SeqSet details page to use the get-seqset-citations endpoint.
  • The user citations graph doesn't display citations yet, but that will be updated in a follow-up PR.

Additional website changes

  • Updates the other modals on the SeqSet details page to use the BaseDialog component for consistency.

Screenshot

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

@claude claude Bot added website Tasks related to the web application backend related to the loculus backend component seqset Related to seqsets labels Apr 27, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 27, 2026

This PR may be related to: #1501 (Check Crossref via API to see what papers citing DOIs)

@tombch tombch changed the title feat(website, backend): SeqSet citation details feat(website, backend): SeqSet citation tracking May 7, 2026
Comment thread website/src/components/SeqSetCitations/SeqSetCitationsList.tsx
isLoading: isSeqSetCitationsLoading,
error: seqSetCitationsError,
data: seqSetCitations,
} = seqSetCitationClientHooks(clientConfig).useGetSeqSetCitedBy({
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SeqSetItem and SeqSetCitationsList both issue useGetSeqSetCitedBy with identical params. React Query will deduplicate in-flight requests and share the cache, so there's no extra network call once the data is loaded — but the hook is called on every render of SeqSetCitationsList, including every time the modal opens. This is fine for now, but note that if SeqSetCitationsList is ever moved out of the query-provider boundary it will need its own cache setup.

Copy link
Copy Markdown
Collaborator Author

@tombch tombch May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the issue with the hook called on every render, but the duplication is still there - this could be solved by either moving the fetching into the server-side, or update the withQueryProvider to use a shared query client so we have deduping of requests, but that's a change for outside this PR

Comment thread website/src/components/SeqSetCitations/SeqSetItemActions.tsx
Comment thread backend/src/main/kotlin/org/loculus/backend/service/crossref/CrossRefService.kt Outdated
@theosanderson
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 153cdaccd2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +119 to +121
properties.endpoint +
"/servlet/getForwardLinks?usr=${properties.username}&pwd=${properties.password}&doi=$doiPrefix&endDate=$endDate&include_postedcontent=true",
).toURL().openConnection() as HttpURLConnection
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge URL-encode Crossref query params before building URI

This request string injects username, password, and doiPrefix directly into the query. If any of those values contains reserved characters (for example & or = in a password), the query gets corrupted and the Crossref call will fail or send wrong parameters, which stops citation ingestion. Build the URL from separately encoded query parameter values instead of raw string interpolation.

Useful? React with 👍 / 👎.

Comment on lines +71 to +73
return doc.select("forward_link").map { forwardLink ->
val seqSetDOI = forwardLink.attr("doi").takeIf { it.isNotBlank() }
?: throw IllegalStateException("CrossRef forward_link missing SeqSet DOI: $forwardLink")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Skip malformed forward_link entries instead of aborting sync

The parser currently throws as soon as one forward_link is missing a required field, and that exception bubbles up as a fatal failure for the whole fetch/parse path. In practice this means one bad citation record from Crossref can block updates for all otherwise valid citations in that run, leaving data stale until the bad record disappears. Handle malformed entries per-item (log and continue) so the task can still ingest valid results.

Useful? React with 👍 / 👎.

val seqSetDOIs: Set<String> = emptySet(),
)

data class SeqSetCitation(
Copy link
Copy Markdown
Contributor

@maverbiest maverbiest May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to compose this into SeqSetCitingSource?

I also find that the names of the two classes here don't convey their meaning very clearly (at least to me). Would something like SeqSetCitingPublication and Publication work? (or Record or w/e)

data class SeqSetCitingPublication(
    val publication: Publication,
    val seqSetDOIs: Set<String> = emptySet(),
)

data class Publication(
    val sourceDOI: String,
    val title: String,
    val year: Int,
    val contributors: List<Contributor>,
)

val seqSetDOI: String?,
)

data class CitationContributor(val givenName: String, val surname: String)
Copy link
Copy Markdown
Contributor

@maverbiest maverbiest May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment below, to me the thing that people contribute to is the Publication (or CitingSource etc.), not to the act of citing. The citation is a connection between two publications, not really it's own entity

(of course, feel free to disagree if you think I'm bikeshedding)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I agree in the current layout this one should be CitingSourceContributor

}

@Test
fun `parseCrossRefCitedByXML returns empty list for completely malformed input`() {
Copy link
Copy Markdown
Contributor

@maverbiest maverbiest May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should actually fail quite loudly.

It seems a bit inconsistent that atm we reject a whole batch of citations if one of them is missing a field, but for completely malformed XML we return silently and the task logs "Fetched 0 citing source(s)..." at INFO

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, I left it like this as it seems to be the default behaviour for the Jsoup xml parser but I will update

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this now to fail loudly for completely malformed xml (e.g. <<crossref_result>...) as well as xml missing <crossref_result>

* Runs every six hours, with an initial delay of one minute.
* Adds citing sources from CrossRef, and connects to SeqSets via their DOI.
*/
@Scheduled(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you hard-code how the task is scheduled, which is consistent with the CleanUpAuxTableTask. However, for the CleanUpStaleSequencesInProcessingTask the frequency is set in application.properties and then threaded through BackendSpringProperty.

It's not directly related to this PR but might make sense to do this consistently for scheduled tasks, perhaps someone else would like to weigh in?


private val log = mu.KotlinLogging.logger {}

internal fun mergeCitingSources(citingSources: List<SeqSetCitingSource>): Set<SeqSetCitingSource> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are SeqSetCitingSources guaranteed to be ordered by date here? To me it would make sense to overwrite information from older entries with that from newer entries with the same sourceDOI (but maybe that's already happening)

Copy link
Copy Markdown
Collaborator Author

@tombch tombch May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Crossref cited-by api doesn't return any indicator of when a citation was made, so there is no ordering by date. But (and I might be wrong here) all forward links with the same sourceDOI should be from the same time, as its all the same publication. So title/year/authors should be identical

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend related to the loculus backend component seqset Related to seqsets update_db_schema website Tasks related to the web application

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants