Getting Started with FastQC — Fast, Reliable Read Quality Checks

Getting Started with FastQC — Fast, Reliable Read Quality ChecksFastQC is a widely used tool for evaluating the quality of high-throughput sequencing data. It provides an easy-to-read set of visualizations and summary metrics that help you quickly identify potential problems with sequencing reads before downstream analysis. This guide walks you through installing FastQC, running it on common file types, interpreting its reports, troubleshooting common issues, and integrating FastQC into automated pipelines.


What is FastQC?

FastQC performs a suite of quality control checks on raw sequence data from platforms like Illumina, Ion Torrent, and others. Its goal is not to “fix” data but to highlight patterns and anomalies—such as low-quality cycles, adapter contamination, uneven base composition, or overrepresented sequences—that may require trimming, filtering, or further investigation.

Key outputs: per-base quality scores, per-sequence quality scores, GC content, sequence length distribution, duplication levels, overrepresented sequences, and more.


Installing FastQC

FastQC is a Java-based application. You can run it on Linux, macOS, or Windows.

  1. Download:

    • Get the latest FastQC release from the official distribution (e.g., the Babraham Institute website) or via package managers where available.
  2. Requirements:

    • Java 1.6 or higher (most systems have a compatible JVM).
    • Optional: a graphical environment to view HTML reports; otherwise view them in a browser on another machine.
  3. Installation steps (Linux/macOS example):

    • Unpack the downloaded archive:
      
      tar -xzf FastQC_v0.11.9.zip 
    • Make the binary executable:
      
      chmod +x FastQC/fastqc 
    • Optionally add FastQC to your PATH for convenience:
      
      export PATH=$PATH:/path/to/FastQC 

On Linux, FastQC is also available via package managers like conda:

conda install -c bioconda fastqc 

Input file formats and running FastQC

FastQC accepts FASTQ and BAM files (gzipped or plain). Typical usage on FASTQ files:

Basic command:

fastqc sample_R1.fastq.gz sample_R2.fastq.gz 

Common options:

  • -o / –outdir : specify output directory
  • -t / –threads : number of threads to use
  • -f : input format (fastq, bam)
  • –noextract : don’t unzip the zipped output

Example with options:

fastqc -o qc_reports -t 4 sample_R1.fastq.gz sample_R2.fastq.gz 

For BAM files:

fastqc -f bam aligned_reads.bam 

Output:

  • An HTML report per input file (interactive plots and summaries)
  • A zipped folder containing the HTML and supporting data files
  • A short text summary file (*.summary) showing pass/warn/fail for each module

Understanding FastQC modules and reports

FastQC runs multiple modules; each module scores the data with Pass, Warn, or Fail. Key modules:

  1. Per base sequence quality

    • Visualizes quality scores across read positions.
    • Look for a steady decline towards the 3’ end (common in Illumina). Sharp drops or consistently low scores indicate trimming/filtering is needed.
    • Quality is usually plotted as Phred scores; as a rule of thumb, Phred ≥ 20 is acceptable, ≥ 30 is good.
  2. Per sequence quality scores

    • Shows distribution of average read quality. A long tail of low-quality reads suggests aggressive filtering.
  3. Per base sequence content

    • Checks proportion of A/T/C/G at each position.
    • Ideally flat lines after the first ~5–10 bases for random libraries. Strong biases can indicate adapter contamination or library preparation issues.
  4. Per sequence GC content

    • Compares observed GC distribution to a theoretical normal distribution for random sequences. Significant deviations can indicate contamination or biased libraries.
  5. Per base N content

    • High N content at certain positions suggests sequencing uncertainty.
  6. Sequence Length Distribution

    • Useful for variable-length libraries (e.g., small RNA). Unexpected length distributions may indicate processing problems.
  7. Sequence Duplication Levels

    • High duplication in genomic DNA libraries may signal PCR artifacts or low-complexity libraries. For RNA-seq, some duplication is expected for highly expressed genes.
  8. Overrepresented sequences

    • Lists sequences that appear more frequently than expected; may be adapters, control sequences, or contamination.
  9. Adapter Content

    • Detects common adapter sequences. If present above a low threshold, trimming is recommended.
  10. Kmer Content

    • Looks for overrepresented kmers; spikes can point to contamination or biases.

Typical interpretation and actions

  • Failing “Per base sequence quality” with low scores at read ends:
    • Trim low-quality bases with tools like Trimmomatic, cutadapt, or fastp.
  • High adapter content or overrepresented sequences:
    • Remove adapters with cutadapt or fastp.
  • Strong per-base composition bias at start of reads:
    • Trim the first few bases or review library prep method.
  • High duplication levels in DNA libraries:
    • Consider deeper sequencing, PCR-free library prep, or mark duplicates downstream.
  • Unexpected GC distribution:
    • Check for contamination (e.g., rRNA, bacterial sequences) by running a contaminant screen (Kraken2, FastQ Screen).

Example trimming command with fastp:

fastp -i sample_R1.fastq.gz -I sample_R2.fastq.gz -o trimmed_R1.fastq.gz -O trimmed_R2.fastq.gz -w 4 

Integrating FastQC into pipelines and automation

  • Single-sample runs are common in exploratory analysis; for large projects, run FastQC on all samples and aggregate results.
  • MultiQC is a great tool to summarize many FastQC reports into a single interactive report:
    
    multiqc . 
  • Use workflow managers (Snakemake, Nextflow, or Cromwell) to include FastQC as an early QC step and to run trimming/cleaning conditionally based on results.
  • Example Snakemake rule (simplified):
    
    rule fastqc: input: "{sample}.fastq.gz" output: "{sample}_fastqc.html" shell: "fastqc -o {wildcards.sample}_qc {input}" 

Common pitfalls and troubleshooting

  • Java memory errors: ensure enough memory or run with fewer threads.
  • Corrupted FASTQ headers: FastQC may fail; validate with seqkit or re-generate FASTQ.
  • Interpreting warnings: many “Warn” flags are not fatal; consider context (library type, platform).
  • Over-reliance: FastQC highlights issues but doesn’t replace informed judgment about whether to re-sequence, trim, or proceed.

  1. Run FastQC on raw data.
  2. Inspect per-base quality, adapter content, and overrepresented sequences.
  3. Trim adapters and low-quality bases (fastp or cutadapt).
  4. Re-run FastQC on trimmed reads to confirm improvements.
  5. Aggregate results with MultiQC for batch QC review.
  6. Proceed to alignment/quantification.

Further resources

  • FastQC documentation and user guide for detailed explanations of modules.
  • Tutorials combining FastQC + cutadapt/fastp + MultiQC.
  • Community forums and workflow repositories (GitHub) for example pipelines.

FastQC is a lightweight, informative first step for sequencing QC. Used routinely, it helps catch common problems early and steers decisions about trimming, filtering, and re-sequencing.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *