Available Pipelines

SeqDesk ships with built-in support for study pipelines and order pipelines. Study pipelines run across the samples of a study. Order pipelines operate on linked sequencing files inside an order and are typically used for simulation, validation, QC, and read cleaning.

The installed-pipeline catalog is built only from packages discovered under pipelines/. There are four built-in order pipelines: Simulate Reads, FASTQ Checksum, FastQC, and Read Cleaning.

Study Pipelines

MAG (Metagenome-Assembled Genomes)

Pipeline: nf-core/mag v3.0.0 Purpose: Assembly and binning of metagenomes Input: Paired-end FASTQ reads Role required: FACILITY_ADMIN

What MAG Does

The MAG pipeline takes raw metagenomic sequencing reads and produces:

Quality-controlled reads — trimmed and host-removed
Assembled contigs — via MEGAHIT and/or SPAdes
Genome bins — via MetaBAT2, MaxBin2, and/or CONCOCT
Refined bins — via DAS Tool
Quality scores — completeness and contamination via CheckM
Taxonomy — classification via GTDB-Tk
QC summary — MultiQC report

Configuration

Parameter	Default	Description
`stubMode`	false	Test mode (fast, no real analysis)
`skipMegahit`	false	Skip MEGAHIT assembler
`skipSpades`	true	Skip SPAdes assembler
`skipProkka`	true	Skip Prokka gene annotation
`skipConcoct`	true	Skip CONCOCT binning
`skipBinQc`	false	Skip bin quality control
`skipGtdb`	false	Skip GTDB-Tk taxonomy
`skipGunc`	false	Skip GUNC contamination check
`gtdbDb`	—	Path to GTDB-Tk database (optional)

Outputs

Output	Location	Description
Assemblies	`Assembly/MEGAHIT/`	Per-sample contig files (`.contigs.fa.gz`)
Bins	`GenomeBinning/DASTool/bins/`	Refined genome bins (`.fa`)
CheckM	`GenomeBinning/QC/`	Completeness and contamination TSV
GTDB-Tk	`Taxonomy/GTDB-Tk/`	Taxonomy classification TSV
MultiQC	`multiqc/`	Aggregate QC HTML report

Assemblies and bins are automatically parsed and linked to samples in the database after the run completes.

SubMG (ENA Submission Pipeline)

Pipeline: ttubb/submg v1.0.0 Purpose: Automated ENA submission of reads, assemblies, and bins Input: Samples with reads, assemblies, and optionally bins Role required: FACILITY_ADMIN

What SubMG Does

SubMG automates the submission of sequencing data to the European Nucleotide Archive (ENA). It handles the full submission workflow:

Validate inputs — checks study/sample prerequisites and ENA credentials
Generate config — creates SubMG YAML manifests and helper files
Submit — executes submg submit for each manifest
Parse receipts — reads ENA responses and stores accession numbers

Prerequisites

Before running SubMG, ensure:

The study has an ENA study accession (PRJEB...)
Samples have taxonomy IDs assigned
Reads are linked to samples
ENA credentials are configured (see ENA Credentials)

Configuration

Parameter	Default	Description
`skipChecks`	true	Skip pre-submission validation
`submitBins`	true	Include genome bins in the submission
`condaEnv`	`submg`	Conda environment name
`assemblySoftware`	`MEGAHIT`	Assembler used (for ENA metadata)
`completenessSoftware`	`CheckM`	QC software used
`binningSoftware`	`MetaBAT2`	Binner used

Outputs

After a successful run, accession numbers are stored back in the database:

Sample accessions — ERS/SAMEA numbers
Read accessions — ERX/ERR numbers
Assembly accessions — linked to Assembly records
Bin accessions — linked to Bin records

MetaxPath (Pathogen Profiling) — external / optional

MetaxPath is not shipped in the public package catalog. The installed-pipeline catalog is built only from packages discovered under pipelines/, and MetaxPath is not one of them. It is an optional/private add-on package that an operator can install separately. The configuration below documents that package for installs that have it.

Pipeline: hzi-bifo/MetaxPath-Nextflow v0.1.6 Purpose: Long-read clinical metagenomics for pathogen identification, virulence, and AMR detection Input: Long-read FASTQ files (Oxford Nanopore or PacBio) Role required: FACILITY_ADMIN

What MetaxPath Does

MetaxPath is designed for clinical metagenomics on long-read sequencing data. It performs:

Human-read filtering — removes host contamination
Taxonomic profiling — via Metax and Sylph
Assembly — via Flye (or configurable assemblers)
Virulence factor prediction — identifies pathogenic gene markers
AMR detection — predicts antibiotic resistance genes
Reporting — generates HTML reports and species abundance dotplots

Supported Sequencers

Oxford Nanopore (MinION, GridION, PromethION)
PacBio (Sequel, Revio)

Configuration

Parameter	Default	Description
`sequencer`	`Nanopore`	Sequencing platform (`Nanopore` or `PacBio`)
`assemblers`	`flye`	Comma-separated assembler list
`threads`	20	CPU threads per process
`topn`	50	Number of top species in reports
`skipSylph`	false	Skip Sylph profiling
`skipVirulence`	false	Skip virulence factor prediction
`skipAmr`	false	Skip AMR detection

Database paths (must be configured by admin):

Parameter	Description
`metaxDb`	Metax database prefix (without .json)
`metaxDmpDir`	NCBI taxonomy dump directory
`kraken2Db`	Kraken2 database path
`sylphDb`	Sylph database path
`refIndex`	Host reference minimap2 index

Outputs

Output	Scope	Description
Profile with VFs/AMRs	Per sample	Merged taxonomic profile with virulence and AMR annotations
Top-N HTML report	Per study	Combined species abundance report
Readcount stats	Per study	Combined readcount summary
Dotplots	Per study	Species abundance visualizations (PDF)

Reads QC (Quality Overview)

Pipeline: reads-qc v0.1.0 Purpose: Per-sample FASTQ statistics with an HTML summary report Input: Linked sample reads (any scope; runs at study level) Role required: FACILITY_ADMIN

Reads QC computes read count, base count, average quality, and GC content for each sample’s FASTQ files and rolls them up into a study-level HTML overview. It’s a lighter alternative to per-sample FastQC when all you need is a quick comparison across the samples in a study. macOS ARM local runs are supported.

Main outputs:

Per-sample read statistics (counts, bases, quality, GC%)
Study-level HTML summary report
Study-level TSV with per-sample metrics

Study Demo Report

Pipeline: study-demo-report v0.1.0 Purpose: Deterministic HTML, Markdown, and TSV outputs for testing pipeline integration Input: Study + samples Role required: FACILITY_ADMIN

Study Demo Report is a smoke-test pipeline that produces deterministic outputs without any real bioinformatics work. Use it to verify that pipeline execution, weblog ingestion, output parsing, and the Assemblies/Results UI all hang together end-to-end — without burning CPU on real analysis. Useful in CI and as a first run after configuring a new install. macOS ARM local runs are supported.

Main outputs:

Study-scope HTML report
Markdown summary
TSV per-sample table

Order Pipelines

Simulate Reads

Pipeline: simulate-reads v0.2.0 Purpose: Generate dummy FASTQ files for selected order samples Input: Order samples Role required: FACILITY_ADMIN

Simulate Reads generates synthetic FASTQ files and links them back to canonical Read records. It is mainly useful for demos, smoke tests, and exercising downstream order-scoped QC workflows.

Main outputs:

Generated FASTQ files linked to Read records
Read counts written back to canonical read fields
Run-level simulation summary TSV

FASTQ Checksum

Pipeline: fastq-checksum v0.1.0 Purpose: Compute MD5 checksums for linked FASTQ files Input: Linked order FASTQ files Role required: FACILITY_ADMIN

FASTQ Checksum computes canonical MD5 checksums for linked read files and stores them back on the corresponding Read records for downstream validation and submission workflows.

Main outputs:

checksum1 / checksum2 on Read records
Run-level checksum summary TSV

FastQC

Pipeline: fastqc v0.1.0 Purpose: Run read quality control on linked FASTQ files Input: Linked order FASTQ files Role required: FACILITY_ADMIN

FastQC runs per-sample QC against linked order reads, publishes HTML reports and zip archives, and stores selected summary metrics back onto the canonical Read record.

Main outputs:

Per-sample FastQC HTML reports
Per-sample FastQC zip archives
Read counts and average quality metrics on Read records
Run-level FastQC summary TSV

Read Cleaning

Pipeline: read-cleaning v0.1.0 (wraps nf-core/detaxizer) Purpose: Screen raw or unknown FASTQ reads for host/contaminant sequences and stage cleaned reads for admin review Input: Active order reads marked raw or unknown (single or paired) Role required: FACILITY_ADMIN

Read Cleaning runs nf-core/detaxizer to identify contaminant reads (by default Homo sapiens), filters them out, and writes per-sample cleaned FASTQ files. The cleaned files are staged as candidate reads (run artifacts) for an admin to review and promote — raw and unknown source reads are never silently overwritten. The pipeline is admin-only and not shown to researchers by default (visibility.showToUser: false, userCanStart: false).

Configuration:

Parameter	Default	Description
`tax2filter`	`Homo sapiens`	Taxon name/ID passed to detaxizer
`classificationKraken2`	true	Use Kraken2 to identify contaminant reads
`kraken2Db`	—	Local Kraken2 DB path or approved reference URI (required when Kraken2 is enabled)
`classificationBbduk`	false	Use BBDuk k-mer matching against a contaminant FASTA
`bbdukReference`	—	Local contaminant reference FASTA for BBDuk
`filteringTool`	`seqkit`	Read filtering backend (`seqkit` or `bbmap`)
`readType`	`auto`	Map single files to short- or long-read detaxizer columns (`auto`/`short`/`long`)
`outputRemovedReads`	false	Also write removed contaminant reads to the output folder

Main outputs:

Per-sample cleaned-read candidates staged for admin review (see Adding Custom Pipelines and Results)
MultiQC HTML report (classification/filtering evidence)
Run-level detaxizer classification/filtering summary TSV

Cleaned reads become the active reads used for downstream order pipelines only after an admin promotes the candidates via the pending-writebacks review flow.

Beta pipelines (in development)

These pipelines are implemented as declarative packages and pass validation, but are not yet runnable — runner integration is in progress (a one-time Bracken DB build for taxonomy; sibling-run gathering for MultiQC; a long-read demo dataset for NanoPlot). They are listed here for reference and will move into the sections above once available.

Taxonomic Profiling (Kraken2 + Bracken)

Pipeline: kraken2-bracken v0.1.0 Purpose: Per-sample taxonomic classification of short reads with abundance estimation and Krona visualization Input: Linked sample short reads (single or paired; runs at study level, also available for sequencing orders) Role required: FACILITY_ADMIN

Kraken2 + Bracken classifies each sample’s reads against a Kraken2 database, re-estimates species (or genus/family) abundances with Bracken, and renders an interactive Krona chart per sample. The top taxa from each Bracken table are parsed into the per-sample result column. It is shown to researchers but only admins can start it (visibility.showToUser: true, userCanStart: false). macOS ARM local runs are supported (conda).

What it does

For each sample:

Kraken2 assigns taxonomy to every read against the configured database.
Bracken re-estimates abundances at the configured rank from the Kraken2 report.
Krona renders an interactive HTML chart of the Bracken abundances.
A run-level summary TSV records the top taxon per sample.

Configuration

Parameter	Default	Description
`kraken2Db`	—	Kraken2 database directory (also holds Bracken kmer distributions). Pinned by admins via the install profile.
`confidence`	0.0	Kraken2 confidence score threshold (0.0–1.0; 0.0 disables filtering)
`brackenReadLength`	150	Read length selecting the Bracken kmer distribution built on the DB
`brackenLevel`	`S`	Bracken rank (`S` species, `G` genus, `F` family)
`krona`	true	Render an interactive Krona chart per sample

Outputs

Output	Scope	Description
`kraken2/{sample_id}.kraken2.report.txt`	Sample	Kraken2 classification report
`bracken/{sample_id}.bracken.tsv`	Sample	Bracken abundance table (top taxa parsed into artifact metadata)
`krona/{sample_id}.krona.html`	Sample	Interactive Krona chart (previewable)
`summary/kraken2-bracken-summary.tsv`	Run	Top taxon per sample

Scope

Runs as a study pipeline across the samples of a study, and is also available as a sequencing order pipeline (targets.supported: ["study", "order"]).

Reference database setup: the Kraken2 database is pinned on the runner to /net/broker/checkm_refdata/kraken2_db via the install profile (hideWhenServerConfigured: true). Bracken additionally requires databaseXXmers.kmer_distrib files inside that directory, built once with bracken-build for the configured read length.

Study MultiQC

Pipeline: SeqDesk Study MultiQC v0.1.0 (bioconda::multiqc=1.21) Purpose: Aggregate the QC outputs of prior study runs into one report Input: Study samples + QC output directories of earlier runs Role required: FACILITY_ADMIN

What Study MultiQC Does

Study MultiQC runs a single MultiQC pass over the QC outputs produced by earlier runs in the same study (FastQC zips, seqkit TSVs, read-cleaning multiqc_data) and produces one consolidated, previewable HTML report plus the parsed MultiQC data tables. It does not re-run any analysis — it only aggregates existing QC.

The report is written as study-multiqc.html (not the default multiqc_report.html) so it never collides with the MultiQC report emitted by the MAG pipeline in the same study.

Configuration

Parameter	Default	Description
`reportTitle`	`Study MultiQC report`	Title shown at the top of the aggregate report

Outputs

Output	Location	Description
MultiQC report	`multiqc/study-multiqc.html`	Aggregate study-level HTML report (previewable, stored as a study report)
MultiQC data	`multiqc/multiqc_data/`	Parsed MultiQC data tables and metadata (run artifacts, downloadable)

Scope

Study-scoped. Appears in the study pipeline catalog and runs across the samples of a study. It consumes the output directories of prior QC runs in the same study; if none have been gathered, it still completes and emits a report shell.

NanoPlot (Long-read QC)

Pipeline: NanoPlot v0.1.0 (local Nextflow workflow, bioconda::nanoplot=1.42.0) Purpose: Quality control of Oxford Nanopore / PacBio long reads Input: Long-read FASTQ files linked to a sequencing order (single-end) Scope: Sequencing order Role required: FACILITY_ADMIN

What NanoPlot Does

For each sample’s long-read FASTQ file, NanoPlot computes quality-control metrics and an interactive HTML report:

Read count and total bases
Read-length statistics — mean, median, and N50
Mean read quality (Phred)
Distribution plots — read length and quality, in a self-contained HTML report

A run-level summary TSV combines the per-sample metrics, and the key values are written back onto the linked Read records.

This pipeline only appears on long-read orders: its sequencingCompatibility.readLengthClass is set to long (layout single, platforms oxford-nanopore / pacbio), so it is not offered on short-read (Illumina) orders.

Configuration

Parameter	Default	Description
(none)	—	NanoPlot QC runs with no user-configurable parameters

Outputs

Output	Location	Description
NanoPlot report	`nanoplot/{sample_id}_NanoPlot-report.html`	Interactive per-sample HTML quality report (previewable)
NanoStats	`nanoplot/{sample_id}_NanoStats.txt`	Per-sample summary metrics
Summary	`summary/nanoplot-summary.tsv`	Combined long-read statistics for all samples

After the run, read count and mean quality are written back to each Read record (read length N50 and mean length are kept in the artifact metadata).

A long-read demo dataset for SeqDesk does not yet exist, so this pipeline has not been exercised against bundled demo data.

Adding More Pipelines

SeqDesk supports adding custom pipelines through a package structure. See Adding Custom Pipelines for details on creating your own pipeline integrations.

To propose a pipeline for the SeqDesk-hosted store, see Contributing to the Official Pipeline Store.