This tutorial describes how to produce OTUs and an abudance table from the raw sequenced reads.
The raw data are those used in Mothur Miseq SOP.
wget http://www.mothur.org/w/images/d/d6/MiSeqSOPData.zip
unzip MiSeqSOPData.zip
To automate this tutorial we suggest to set an environment that includes:
Install:
vsearch
- Github link [10.7717/peerj.2584]biom
- Installation instructions [10.1186/2047-217X-1-7]trimmomatic
- Installation instructions [10.1093/bioinformatics/btu170]flash
- Github link [10.1093/bioinformatics/btr507]bioperl
- Installation instructionsls MiSeq_SOP/* | cut -d "/" -f2 | grep "_R1_" > file.list VSEARCH=/path/to/your/vsearch BIOM=/path/to/your/biom
Trim reads to remove low quality regions.
Create a perl script [auto_trim.pl] containing the code below.
Edit parameters in the script:
-threads
- Number of CPUs to use.TRAILING:5
- Trim the 3' region until reaches the Phred score of 5.LEADING:5
- Trim the 5' region until reaches the Phred score of 5.SLIDINGWINDOW:4:15
- Slide a window of size 4. Inside the window the mean Phred quality must to be 15 or more. If not, trim the read at that point.MINLEN:130
- Drop trimmed reads smaller than 130 bases.#!/usr/bin/perl use strict; use warnings; system("mkdir trimmed"); system("mkdir logs"); my $run = $ARGV[1]; open LIST, $ARGV[0]; while(<LIST>) { chomp; my $base; if ($_ =~ /(.*)_R1_/) { $base = $1; } my $out; if ($base =~ /(.*)_S\d+/) { $out = $1; } system "java -jar $run PE -threads 10 -phred33 MiSeq_SOP/$base\_R1_001.fastq MiSeq_SOP/$base\_R2_001.fastq trimmed/$out.paired.1.fastq trimmed/$out.single.1.fastq trimmed/$out.paired.2.fastq trimmed/$out.single.2.fastq TRAILING:5 LEADING:5 SLIDINGWINDOW:4:15 MINLEN:130 2> logs/$out.trim.log"; } close LIST;
Save the script and run it. The parameters are:
perl auto_trim.pl file.list /path/to/your/trimmomatic.jar
Merge overlapping reads to create a contiguous sequence that represents the whole amplicon.
Create a script [auto_merge.pl], paste the code below and edit parameters.
-m
- Minimum overlap length-M
- Maximum overlap length-x
- Mismatches rate in overlapping region#!/usr/bin/perl use strict; use warnings; system("mkdir merged"); my $run = $ARGV[1]; open LIST, $ARGV[0]; while(<LIST>) { chomp; my $base; if ($_ =~ /(.*)_R1_/) { $base = $1; } my $out; if ($base =~ /(.*)_S\d+/) { $out = $1; } system "$run -t 40 -m 20 -M 150 -x 0.15 trimmed/$out.paired.1.fastq trimmed/$out.paired.2.fastq -o merged/$out > logs/$out.flash.log"; }
Run the script as the same way as in previous step.
perl auto_merge.pl file.list /path/to/your/flash grep "Combined pairs:" logs/* | cut -d ":" -f1,3 | sed -e 's/logs\///g'
The script below converts the FASTQ files to one FASTA file, and formats the header to convert the mapping to OTU table in future steps.
Create a script [generate_fasta.pl], and paste the code below.
#!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; open OUTFASTA, ">all_seqs.fasta"; while (<>) { chomp; my $base; if ($_ =~ /(.*)_R1_/) { $base = $1; } my $out; if ($base =~ /(.*)_S\d+/) { $out = $1; } my $seqio_obj = Bio::SeqIO->new(-file => "merged/$out.extendedFrags.fastq", -format => "fastq" ); my $count = 1; while (my $seq_obj = $seqio_obj->next_seq ) { my $seq = $seq_obj->seq; print OUTFASTA ">$out\_$count;barcodelabel=$out\n".$seq."\n"; $count++; } } close OUTFASTA;
Run the command to generate the all_seqs.fasta
file.
perl generate_fasta.pl file.list
Select unique sequences that are represented at least by 2 reads.
$VSEARCH --derep_fulllength all_seqs.fasta --minuniquesize 2 --output derep.fasta
At this point, there are two alternatives:
$VSEARCH --usearch_global all_seqs.fasta --db derep.fasta --id 0.97 --uc map.uc --relabel_sha1 --relabel_keep
$VSEARCH --cluster_fast derep.fasta --id 0.97 --centroids otu97.fasta $VSEARCH --usearch_global all_seqs.fasta --db otu97.fasta --id 0.97 --uc map.uc --relabel_sha1 --relabel_keep
Convert the UC mapping file to a BIOM file, and then to a Table Separated Values OTU file.
$BIOM from-uc -i map.uc -o otu_table.biom $BIOM convert -i otu_table.biom -o otu_table.txt --to-tsv