Simplify query for runs to export for samples and data during folder export#7448
Simplify query for runs to export for samples and data during folder export#7448
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request refactors the folder export logic for sample types and data classes to reduce memory usage by replacing in-memory filtering with direct SQL queries. Instead of loading all ExpData, ExpMaterial, and ExpRun objects into memory and filtering them in Java, the new implementation queries for run IDs directly from the database with appropriate filters.
Changes:
- Replaced memory-intensive object loading and Java-based filtering with SQL-based queries for determining which runs to export
- Added two new methods to ExperimentService API:
getDerivationRunIdsForDataClassExportandgetDerivationRunIdsForSampleTypesExport - Removed intermediate data structures and helper methods (
isValidRunType) that were used for in-memory filtering
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| experiment/src/org/labkey/experiment/samples/SampleTypeFolderWriter.java | Simplified run selection by replacing material loading and filtering with direct SQL query; removed isValidRunType method |
| experiment/src/org/labkey/experiment/samples/DataClassFolderWriter.java | Simplified run selection by replacing data loading and filtering with direct SQL query; removed isValidRunType method |
| experiment/src/org/labkey/experiment/api/ExperimentServiceImpl.java | Implemented two new SQL-based query methods that return filtered run IDs for folder export |
| api/src/org/labkey/api/exp/api/ExperimentService.java | Added interface declarations for the two new query methods |
Comments suppressed due to low confidence (3)
experiment/src/org/labkey/experiment/samples/SampleTypeFolderWriter.java:106
- The comment states "only want the sample derivation runs" but the implementation also includes aliquot runs (SAMPLE_ALIQUOT_PROTOCOL_LSID). Consider updating the comment to clarify that both derivation and aliquot protocol runs are included.
// only want the sample derivation runs; other runs will get included in the experiment xar.
api/src/org/labkey/api/exp/api/ExperimentService.java:811
- The JavaDoc comment could be more detailed. Consider expanding it to explain: (1) what "derivation run IDs" means in this context, (2) that it includes runs where materials from the specified sample types are used as either inputs or outputs, (3) the behavior of the includeRunsWithDataIO parameter more explicitly (when true, includes all derivation and aliquot runs; when false, includes aliquot runs and only derivation runs without data inputs/outputs).
/** Get derivation/aliquot run IDs for sample types — filtered by protocol and optionally excluding runs with data inputs/outputs */
List<Long> getDerivationRunIdsForSampleTypesExport(Collection<String> sampleTypeLsids, Container c, boolean includeRunsWithDataIO);
api/src/org/labkey/api/exp/api/ExperimentService.java:808
- The JavaDoc comment could be more detailed. Consider expanding it to explain: (1) that it returns run IDs where the specified data class is used as either input or output, (2) that it only includes runs with the SAMPLE_DERIVATION_PROTOCOL that have no material inputs or outputs (since those are handled by the sample type writer).
/** Get derivation run IDs for a data class — runs with SAMPLE_DERIVATION_PROTOCOL that have no material inputs/outputs */
List<Long> getDerivationRunIdsForDataClassExport(long dataClassRowId);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }) | ||
| .collect(Collectors.toSet()) | ||
| .stream().map(ExpObject::getRowId).toList(); | ||
| List<Long> exportedRunIds = ExperimentService.get().getDerivationRunIdsForDataClassExport(dataClass.getRowId()); |
There was a problem hiding this comment.
nit: I realize this is a one-for-one replacement of the current logic but we could consider doing a coalesced query here that takes a collection of data class rowIds and produces the result with a single query. Much like the getDerivationRunIdsForSampleTypesExport implementation. Something like this:
List<Long> dataClassRowIdsForExport = new ArrayList<>();
for (ExpDataClass dataClass : ExperimentService.get().getDataClasses(c, false))
{
// ignore data classes that are filtered out
if (EXCLUDED_TYPES.contains(dataClass.getName()))
continue;
dataClasses.add(dataClass);
typesSelection.addDataClass(dataClass);
exportTypes = true;
if (exportDataClassData)
dataClassRowIdsForExport.add(dataClass.getRowId());
}
if (!dataClassRowIdsForExport.isEmpty())
{
// get the list of derivation runs for these data classes — only sample derivation protocol runs
// with no material inputs/outputs (those are handled by the sample type writer)
// Sample derivation protocols involving data classes can be either to/from another data
// class or also to/from a sample type. If it's the latter, we will let the sample writer handle it
// since on import, data classes run before sample types.
List<Long> exportedRunIds = ExperimentService.get().getDerivationRunIdsForDataClassExport(dataClassRowIdsForExport);
if (!exportedRunIds.isEmpty())
{
runsSelection.addRunIds(exportedRunIds);
exportRuns = true;
}
}The query could then be:
// Note: I did not test this
public List<Long> getDerivationRunIdsForDataClassExport(@NotNull Collection<Long> dataClassRowIds)
{
if (dataClassRowIds.isEmpty())
return List.of();
SQLFragment inClause = new SQLFragment().appendInClause(dataClassRowIds, getExpSchema().getSqlDialect());
SQLFragment sql = new SQLFragment("SELECT DISTINCT er.RowId FROM ")
.append(getTinfoExperimentRun(), "er")
.append(" WHERE er.ProtocolLSID = ? ").add(SAMPLE_DERIVATION_PROTOCOL_LSID);
sql.append("""
AND EXISTS (
SELECT 1 FROM exp.ProtocolApplication pa
LEFT JOIN exp.DataInput di ON di.TargetApplicationId = pa.RowId
LEFT JOIN exp.Data d1 ON di.DataId = d1.RowId AND d1.classId\s
""");
sql.append(inClause);
sql.append(" LEFT JOIN exp.Data d2 ON d2.SourceApplicationId = pa.RowId AND d2.classId ");
sql.append(inClause);
sql.append("""
WHERE pa.RunId = er.RowId
AND (d1.RowId IS NOT NULL OR d2.RowId IS NOT NULL)
)
AND NOT EXISTS (
SELECT 1 FROM exp.ProtocolApplication pa_m
LEFT JOIN exp.MaterialInput mi ON mi.TargetApplicationId = pa_m.RowId
LEFT JOIN exp.Material m ON m.SourceApplicationId = pa_m.RowId
WHERE pa_m.RunId = er.RowId
AND (mi.RowId IS NOT NULL OR m.RowId IS NOT NULL)
)
ORDER BY er.RowId
""");
return new SqlSelector(getExpSchema(), sql).getArrayList(Long.class);
}There was a problem hiding this comment.
Good point. But given we didn't adopt code review then manual testing workflow on this PR, I'll leave the code as currently is...
Rationale
DataClassFolderWriter.write() and SampleTypeFolderWriter.write() follow the problematic pattern that has the potential of large memory usage.
This PR simplifies the process by getting rid of the intermediate steps to query for all materials and runs, and instead query for run.rowId directly based on specified criteria.
Related Pull Requests
Changes