Create OTUs from raw FASTQ


This tutorial describes how to produce OTUs and an abudance table from the raw sequenced reads.

Download reads

The raw data are those used in Mothur Miseq SOP.

wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
unzip MiSeqSOPData.zip

Set environment

To automate this tutorial we suggest to set an environment that includes:

  • defining the list of files to be processed.
  • create aliases to the necessary tools.

Install:

ls MiSeq_SOP/* | cut -d "/" -f2 | grep "_R1_" > file.list
VSEARCH=/path/to/your/vsearch
BIOM=/path/to/your/biom

Run quality trimming

Trim reads to remove low quality regions.
Create a perl script [auto_trim.pl] containing the code below.
Edit parameters in the script:

  • -threads - Number of CPUs to use.
  • TRAILING:5 - Trim the 3' region until reaches the Phred score of 5.
  • LEADING:5 - Trim the 5' region until reaches the Phred score of 5.
  • SLIDINGWINDOW:4:15 - Slide a window of size 4. Inside the window the mean Phred quality must to be 15 or more. If not, trim the read at that point.
  • MINLEN:130 - Drop trimmed reads smaller than 130 bases.
#!/usr/bin/perl

use strict;
use warnings;

system("mkdir trimmed");
system("mkdir logs");

my $run = $ARGV[1];

open LIST, $ARGV[0];

while(<LIST>) {
        chomp;
        my $base;
        if ($_ =~ /(.*)_R1_/) {
                $base = $1;
        }

        my $out;
        if ($base =~ /(.*)_S\d+/) {
                $out = $1;
        }

        system "java -jar $run PE -threads 10 -phred33 MiSeq_SOP/$base\_R1_001.fastq MiSeq_SOP/$base\_R2_001.fastq trimmed/$out.paired.1.fastq trimmed/$out.single.1.fastq trimmed/$out.paired.2.fastq trimmed/$out.single.2.fastq TRAILING:5 LEADING:5 SLIDINGWINDOW:4:15 MINLEN:130 2> logs/$out.trim.log";
}

close LIST;

Save the script and run it. The parameters are:

  • the file.list containg the files names (generated in "Set environment" section)
  • the path to the trimmomatic jar file.
perl auto_trim.pl file.list /path/to/your/trimmomatic.jar

Run reads merging

Merge overlapping reads to create a contiguous sequence that represents the whole amplicon.
Create a script [auto_merge.pl], paste the code below and edit parameters.

  • -m - Minimum overlap length
  • -M - Maximum overlap length
  • -x - Mismatches rate in overlapping region
#!/usr/bin/perl

use strict;
use warnings;

system("mkdir merged");

my $run = $ARGV[1];

open LIST, $ARGV[0];

while(<LIST>) {
        chomp;
        my $base;
        if ($_ =~ /(.*)_R1_/) {
                $base = $1;
        }

        my $out;
        if ($base =~ /(.*)_S\d+/) {
                $out = $1;
        }
        system "$run -t 40 -m 20 -M 150 -x 0.15 trimmed/$out.paired.1.fastq trimmed/$out.paired.2.fastq -o merged/$out > logs/$out.flash.log";
}

Run the script as the same way as in previous step.

perl auto_merge.pl file.list /path/to/your/flash
grep "Combined pairs:" logs/* | cut -d ":" -f1,3 | sed -e 's/logs\///g'

Generate formatted FASTA

The script below converts the FASTQ files to one FASTA file, and formats the header to convert the mapping to OTU table in future steps.
Create a script [generate_fasta.pl], and paste the code below.

#!/usr/bin/perl

use strict;
use warnings;
use Bio::SeqIO;

open OUTFASTA, ">all_seqs.fasta";

while (<>) {
        chomp;
        my $base;
        if ($_ =~ /(.*)_R1_/) {
                $base = $1;
        }

        my $out;
        if ($base =~ /(.*)_S\d+/) {
                $out = $1;
        }

	my $seqio_obj = Bio::SeqIO->new(-file => "merged/$out.extendedFrags.fastq",
                             -format => "fastq" );

	my $count = 1;

	while (my $seq_obj = $seqio_obj->next_seq ) {
		my $seq =  $seq_obj->seq;
        	print OUTFASTA ">$out\_$count;barcodelabel=$out\n".$seq."\n";
		$count++;
	}

}
close OUTFASTA;

Run the command to generate the all_seqs.fasta file.

perl generate_fasta.pl file.list

Dereplicate

Select unique sequences that are represented at least by 2 reads.

$VSEARCH --derep_fulllength all_seqs.fasta --minuniquesize 2 --output derep.fasta

Create OTUs

At this point, there are two alternatives:

  • Map the reads to the unique dereplicated sequences. It can be considered as a 97% identity mapping to 100% OTU clusters.
  • Create OTU clusters from dereplicated sequences. And map all reads to the OTU sequences to quantify them.

Map to dereplicated sequences

$VSEARCH --usearch_global all_seqs.fasta --db derep.fasta --id 0.97 --uc map.uc --relabel_sha1 --relabel_keep

Create 97% OTU clusters

$VSEARCH  --cluster_fast derep.fasta --id 0.97 --centroids otu97.fasta
$VSEARCH --usearch_global all_seqs.fasta --db otu97.fasta --id 0.97 --uc map.uc --relabel_sha1 --relabel_keep

Create BIOM and OTU Count Table

Convert the UC mapping file to a BIOM file, and then to a Table Separated Values OTU file.

$BIOM from-uc -i map.uc -o otu_table.biom
$BIOM convert -i otu_table.biom -o otu_table.txt --to-tsv