Assigning Files to Samples
After file discovery, sequencing files need to be linked to the correct samples.
SeqDesk provides both automatic matching and manual assignment, from the
facility-admin Sequencing tab of an order (/orders/[id]/sequencing).
Auto-Detect Matching
Matching is barcode-first. For each sample, the engine tries these sources in order and stops at the first that produces a match:
- Run-plan barcode — the barcode assigned to the sample on a sequencing run
plan. Files are matched by looking for the barcode (and, when known, the run
ID) in the file’s directory path. (
matchedBy: run-plan-barcode) - Sample barcode — a
_barcodevalue stored on the sample’s custom fields. (matchedBy: sample-barcode) - Sample identifier — falls back to fuzzy filename matching against
sampleId,sampleAlias, andsampleTitle. (matchedBy: sample-id)
Match status and confidence
Each suggestion has a status and a confidence score (0–1):
| Status | Meaning |
|---|---|
exact | A single confident match (paired ≈ 0.99, single-end ≈ 0.92 for barcode matches; ≥ 0.7 score for identifier matches) |
partial | A weak/single-end match that needs review |
ambiguous | Multiple candidate pairs matched — no auto-pick; the alternatives are listed for manual choice |
none | No candidate files matched |
For the identifier fallback, the score is computed by string similarity: an
exact normalized match scores 1.0; a filename that contains the sample
identifier scores 0.5–0.9; partial overlaps score lower. Two or more pairs
scoring ≥ 0.7 are reported as ambiguous.
Auto-assignment
Auto-assignment is off by default (autoAssign: false). When enabled (per
request or in config), discovery only auto-assigns a sample when all hold:
- the suggestion status is
exact, and - confidence is ≥ 0.9, and
- a Read 1 file is present.
Everything else (partial, ambiguous, lower confidence) is surfaced for manual review and left unassigned. Samples that already have an assigned read are skipped unless you force a re-scan.
Manual Assignment
For files that do not auto-match or need correction, from the order’s Sequencing tab:
- Browse the discovered files / suggestions list
- For each file pair (R1/R2), select the target sample
- Confirm the assignment
The view shows, per sample:
- File path relative to the data base path
- File size
- Current assignment status and read data class
- The matched-by source and confidence for any suggestion
What Gets Stored
When files are assigned, a Read record is created linking the files to the sample:
| Field | Value |
|---|---|
file1 | Path to forward reads (R1) |
file2 | Path to reverse reads (R2), null for single-end |
checksum1/checksum2 | MD5 checksums for integrity verification |
sampleId | The assigned sample |
dataClass | cleaned for associated/uploaded reads (see below) |
dataClassSource | How the class was set (associate, upload, pipeline, manual, …) |
isActive | Whether this is the sample’s current active read |
File paths are stored relative to the data base path and resolved at runtime.
Assigning reads moves the sample’s facility status from WAITING/PROCESSING
to SEQUENCED.
Raw vs Cleaned reads
Each Read carries a data class that protects original sequencer output from being silently overwritten by derived (cleaned) reads.
dataClass | Meaning |
|---|---|
cleaned | Processed / analysis-ready reads (the default for associate and upload) |
raw | Original sequencer output — protected |
unknown | Unclassified — also protected |
raw and unknown are protected classes. When you assign or upload a
cleaned read over a sample whose current active read is protected (and the
files differ), SeqDesk does not overwrite it. Instead it:
- creates a new read record for the cleaned files and marks it active, and
- marks the previously active raw/unknown read
isActive = falseand sets itssupersededByReadIdto the new read.
The protected read is preserved as provenance, not deleted. Replacing a non-protected (cleaned) read in place is allowed.
Choosing the active read
A sample can have several Read records. The active read is selected by preferring, in order: an active cleaned read with a file → any active read with a file → any read with a file. This is what downstream delivery and pipelines use.
Manual re-classification
Facility admins can manually re-classify a read’s data class (for example,
marking an associated read as raw). This sets dataClass,
dataClassSource = manual, and records who changed it and when. Manual
re-classification updates the read in place — it does not supersede it.
Bulk Assignment
For orders with many samples, discovery can process all samples at once:
- Open the order’s Sequencing tab
- Click Discover Files
- The system scans and suggests matches for all samples
- Review the suggestions
- Confirm assignments (or enable auto-assign to apply qualifying
exactmatches automatically — see the conditions above)
Partial, ambiguous, and low-confidence matches are left unassigned for manual review.
Cleaned reads from pipelines
The shipped read-cleaning pipeline does not write to Read records directly.
Instead it produces cleaned-read candidates (artifacts flagged as
sample_read_candidate). A facility admin reviews them under the run’s pending
writebacks and promotes the chosen candidates
(GET/POST /api/pipelines/runs/[id]/pending-writebacks). Promotion copies the
cleaned files into the data directory and creates a new active cleaned read
(with dataClassSource = pipeline), superseding the prior active read while
preserving any protected raw/unknown reads.
Requirements for Pipelines
Before a pipeline can run on a study:
- All included samples must have at least one read record
- Read files must exist at the configured paths
- For paired-end pipelines, both R1 and R2 must be assigned
The pipeline launcher validates these requirements before allowing a run to start. See Running a Pipeline.