Skip to Content
Sequencing FilesAssigning Files to Samples

Assigning Files to Samples

After file discovery, sequencing files need to be linked to the correct samples. SeqDesk provides both automatic matching and manual assignment, from the facility-admin Sequencing tab of an order (/orders/[id]/sequencing).

Auto-Detect Matching

Matching is barcode-first. For each sample, the engine tries these sources in order and stops at the first that produces a match:

  1. Run-plan barcode — the barcode assigned to the sample on a sequencing run plan. Files are matched by looking for the barcode (and, when known, the run ID) in the file’s directory path. (matchedBy: run-plan-barcode)
  2. Sample barcode — a _barcode value stored on the sample’s custom fields. (matchedBy: sample-barcode)
  3. Sample identifier — falls back to fuzzy filename matching against sampleId, sampleAlias, and sampleTitle. (matchedBy: sample-id)

Match status and confidence

Each suggestion has a status and a confidence score (0–1):

StatusMeaning
exactA single confident match (paired ≈ 0.99, single-end ≈ 0.92 for barcode matches; ≥ 0.7 score for identifier matches)
partialA weak/single-end match that needs review
ambiguousMultiple candidate pairs matched — no auto-pick; the alternatives are listed for manual choice
noneNo candidate files matched

For the identifier fallback, the score is computed by string similarity: an exact normalized match scores 1.0; a filename that contains the sample identifier scores 0.5–0.9; partial overlaps score lower. Two or more pairs scoring ≥ 0.7 are reported as ambiguous.

Auto-assignment

Auto-assignment is off by default (autoAssign: false). When enabled (per request or in config), discovery only auto-assigns a sample when all hold:

  • the suggestion status is exact, and
  • confidence is ≥ 0.9, and
  • a Read 1 file is present.

Everything else (partial, ambiguous, lower confidence) is surfaced for manual review and left unassigned. Samples that already have an assigned read are skipped unless you force a re-scan.

Manual Assignment

For files that do not auto-match or need correction, from the order’s Sequencing tab:

  1. Browse the discovered files / suggestions list
  2. For each file pair (R1/R2), select the target sample
  3. Confirm the assignment

The view shows, per sample:

  • File path relative to the data base path
  • File size
  • Current assignment status and read data class
  • The matched-by source and confidence for any suggestion

What Gets Stored

When files are assigned, a Read record is created linking the files to the sample:

FieldValue
file1Path to forward reads (R1)
file2Path to reverse reads (R2), null for single-end
checksum1/checksum2MD5 checksums for integrity verification
sampleIdThe assigned sample
dataClasscleaned for associated/uploaded reads (see below)
dataClassSourceHow the class was set (associate, upload, pipeline, manual, …)
isActiveWhether this is the sample’s current active read

File paths are stored relative to the data base path and resolved at runtime. Assigning reads moves the sample’s facility status from WAITING/PROCESSING to SEQUENCED.

Raw vs Cleaned reads

Each Read carries a data class that protects original sequencer output from being silently overwritten by derived (cleaned) reads.

dataClassMeaning
cleanedProcessed / analysis-ready reads (the default for associate and upload)
rawOriginal sequencer output — protected
unknownUnclassified — also protected

raw and unknown are protected classes. When you assign or upload a cleaned read over a sample whose current active read is protected (and the files differ), SeqDesk does not overwrite it. Instead it:

  1. creates a new read record for the cleaned files and marks it active, and
  2. marks the previously active raw/unknown read isActive = false and sets its supersededByReadId to the new read.

The protected read is preserved as provenance, not deleted. Replacing a non-protected (cleaned) read in place is allowed.

Choosing the active read

A sample can have several Read records. The active read is selected by preferring, in order: an active cleaned read with a file → any active read with a file → any read with a file. This is what downstream delivery and pipelines use.

Manual re-classification

Facility admins can manually re-classify a read’s data class (for example, marking an associated read as raw). This sets dataClass, dataClassSource = manual, and records who changed it and when. Manual re-classification updates the read in place — it does not supersede it.

Bulk Assignment

For orders with many samples, discovery can process all samples at once:

  1. Open the order’s Sequencing tab
  2. Click Discover Files
  3. The system scans and suggests matches for all samples
  4. Review the suggestions
  5. Confirm assignments (or enable auto-assign to apply qualifying exact matches automatically — see the conditions above)

Partial, ambiguous, and low-confidence matches are left unassigned for manual review.

Cleaned reads from pipelines

The shipped read-cleaning pipeline does not write to Read records directly. Instead it produces cleaned-read candidates (artifacts flagged as sample_read_candidate). A facility admin reviews them under the run’s pending writebacks and promotes the chosen candidates (GET/POST /api/pipelines/runs/[id]/pending-writebacks). Promotion copies the cleaned files into the data directory and creates a new active cleaned read (with dataClassSource = pipeline), superseding the prior active read while preserving any protected raw/unknown reads.

Requirements for Pipelines

Before a pipeline can run on a study:

  • All included samples must have at least one read record
  • Read files must exist at the configured paths
  • For paired-end pipelines, both R1 and R2 must be assigned

The pipeline launcher validates these requirements before allowing a run to start. See Running a Pipeline.