feat(website, backend): SeqSet citation tracking#6304
Conversation
|
This PR may be related to: #1501 (Check Crossref via API to see what papers citing DOIs) |
…d exposing doiPrefix for use outside of the service
…, added periodic task to update citations with crossref results
| isLoading: isSeqSetCitationsLoading, | ||
| error: seqSetCitationsError, | ||
| data: seqSetCitations, | ||
| } = seqSetCitationClientHooks(clientConfig).useGetSeqSetCitedBy({ |
There was a problem hiding this comment.
SeqSetItem and SeqSetCitationsList both issue useGetSeqSetCitedBy with identical params. React Query will deduplicate in-flight requests and share the cache, so there's no extra network call once the data is loaded — but the hook is called on every render of SeqSetCitationsList, including every time the modal opens. This is fine for now, but note that if SeqSetCitationsList is ever moved out of the query-provider boundary it will need its own cache setup.
There was a problem hiding this comment.
I fixed the issue with the hook called on every render, but the duplication is still there - this could be solved by either moving the fetching into the server-side, or update the withQueryProvider to use a shared query client so we have deduping of requests, but that's a change for outside this PR
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 153cdaccd2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| properties.endpoint + | ||
| "/servlet/getForwardLinks?usr=${properties.username}&pwd=${properties.password}&doi=$doiPrefix&endDate=$endDate&include_postedcontent=true", | ||
| ).toURL().openConnection() as HttpURLConnection |
There was a problem hiding this comment.
URL-encode Crossref query params before building URI
This request string injects username, password, and doiPrefix directly into the query. If any of those values contains reserved characters (for example & or = in a password), the query gets corrupted and the Crossref call will fail or send wrong parameters, which stops citation ingestion. Build the URL from separately encoded query parameter values instead of raw string interpolation.
Useful? React with 👍 / 👎.
| return doc.select("forward_link").map { forwardLink -> | ||
| val seqSetDOI = forwardLink.attr("doi").takeIf { it.isNotBlank() } | ||
| ?: throw IllegalStateException("CrossRef forward_link missing SeqSet DOI: $forwardLink") |
There was a problem hiding this comment.
Skip malformed forward_link entries instead of aborting sync
The parser currently throws as soon as one forward_link is missing a required field, and that exception bubbles up as a fatal failure for the whole fetch/parse path. In practice this means one bad citation record from Crossref can block updates for all otherwise valid citations in that run, leaving data stale until the bad record disappears. Handle malformed entries per-item (log and continue) so the task can still ingest valid results.
Useful? React with 👍 / 👎.
| val seqSetDOIs: Set<String> = emptySet(), | ||
| ) | ||
|
|
||
| data class SeqSetCitation( |
There was a problem hiding this comment.
Would it make sense to compose this into SeqSetCitingSource?
I also find that the names of the two classes here don't convey their meaning very clearly (at least to me). Would something like SeqSetCitingPublication and Publication work? (or Record or w/e)
data class SeqSetCitingPublication(
val publication: Publication,
val seqSetDOIs: Set<String> = emptySet(),
)
data class Publication(
val sourceDOI: String,
val title: String,
val year: Int,
val contributors: List<Contributor>,
)| val seqSetDOI: String?, | ||
| ) | ||
|
|
||
| data class CitationContributor(val givenName: String, val surname: String) |
There was a problem hiding this comment.
Similar to my comment below, to me the thing that people contribute to is the Publication (or CitingSource etc.), not to the act of citing. The citation is a connection between two publications, not really it's own entity
(of course, feel free to disagree if you think I'm bikeshedding)
There was a problem hiding this comment.
yeah I agree in the current layout this one should be CitingSourceContributor
| } | ||
|
|
||
| @Test | ||
| fun `parseCrossRefCitedByXML returns empty list for completely malformed input`() { |
There was a problem hiding this comment.
I feel like this should actually fail quite loudly.
It seems a bit inconsistent that atm we reject a whole batch of citations if one of them is missing a field, but for completely malformed XML we return silently and the task logs "Fetched 0 citing source(s)..." at INFO
There was a problem hiding this comment.
Yeah I agree, I left it like this as it seems to be the default behaviour for the Jsoup xml parser but I will update
There was a problem hiding this comment.
Updated this now to fail loudly for completely malformed xml (e.g. <<crossref_result>...) as well as xml missing <crossref_result>
| * Runs every six hours, with an initial delay of one minute. | ||
| * Adds citing sources from CrossRef, and connects to SeqSets via their DOI. | ||
| */ | ||
| @Scheduled( |
There was a problem hiding this comment.
Here you hard-code how the task is scheduled, which is consistent with the CleanUpAuxTableTask. However, for the CleanUpStaleSequencesInProcessingTask the frequency is set in application.properties and then threaded through BackendSpringProperty.
It's not directly related to this PR but might make sense to do this consistently for scheduled tasks, perhaps someone else would like to weigh in?
|
|
||
| private val log = mu.KotlinLogging.logger {} | ||
|
|
||
| internal fun mergeCitingSources(citingSources: List<SeqSetCitingSource>): Set<SeqSetCitingSource> { |
There was a problem hiding this comment.
Are SeqSetCitingSources guaranteed to be ordered by date here? To me it would make sense to overwrite information from older entries with that from newer entries with the same sourceDOI (but maybe that's already happening)
There was a problem hiding this comment.
The Crossref cited-by api doesn't return any indicator of when a citation was made, so there is no ordering by date. But (and I might be wrong here) all forward links with the same sourceDOI should be from the same time, as its all the same publication. So title/year/authors should be identical
…issing crossref_result root element
Summary
Schema Changes
seqset_citing_sourcetable: a 'citing source' being any publication that references one or more SeqSets. A citing source must have a DOI, title, year and contributors, and can have an origin either from Crossref or manually curated (manual curation endpoints will be added in a follow-up PR).seqset_to_citing_sourcetable for database joins between citing sources and the seqsets they cite.Backend Changes
/get-seqset-cited-by-publication: for a particular Seqset ID and version, retrieves citations and returns their source DOIs, titles, years and contributors for each citation of the SeqSet. This endpoint's return type has been modified: previously it returned the backendCitedBytype, now it returnsList<SeqSetCitation>.Website Changes
View Citationsmodal to the SeqSet details page which lists citations for the seqset.get-seqset-citationsendpoint.Additional website changes
BaseDialogcomponent for consistency.Screenshot
PR Checklist
🚀 Preview: Add
previewlabel to enable