Users can build their own databases to train their models.
It is necessary one FASTA file containing the Marker Genes sequence and one Taxonomic description for each sequence.
* Filter your files to keep only sequences with precise taxonomic assignment
In this example we used Silva SSU database.
This step is performed by a script: perl createDB.pl
Requires:
Bioperl Bowtie Vsearch
This step is performed by a script: perl createModel.pl
Usage:
perl createModel.pl -pr <primers.fasta> -fa <16sDB.fasta> [options] Options: -o: Output directory to create the models. [default: .] -t: Number of threads [default: 1] -min: Minimum length of amplicon [default: 200] -max: Maximum length of amplicon [default: 600] -mm: Mismatches allowed in PCR [default: 3]
For each taxon, cumulative variable importance (from 70% to 100%, by 10%), and class limit size (from 80 percentile to 100 percentile, by 5), calculate Sensibility, Specificity:
This step is performed by the script:
for v in 70 80 90; do Rscript ParallelCross_SaveEach.R -v $v; done; for p in 0.85 0.9 1; do Rscript ParallelCross_SaveEach.R -v 80 -p $p; done;
Requires:
R library(optparse) library(randomForest) library(doMC)
Calculate the best values of:
The number of features is estimated by the cumulative sum of the relative importance of each feature until it reaches the cutoff value: 70, 80, 90 and 100%.
The maximum number of points is estimated by the distribution of "points per class" and limited to a percentile cutoff: 80, 85, 90, 95 and 100 percentiles.
For each taxon, cumulative variance and percentile to limit number of points, takes each of the 20 files in "outfiles" folder and calculates the Area Under the Curve (AUC).
Keeps increasing the values until there is no improvement in AUC values (t-test < 0.05).
Run script: Rscript GenerateThreshold.R
Perform 20 fold Cross Validation Model Construction, similar to the previous CrossFold test using the threshold values calculated in previous step.
Generate tables for each taxon containing the Specificity of prediction for each Ratio cutoff value from 0.1 to 10.
The input files are the importance variable for each taxon and the threshold.txt calculated.
The output are TXT files containing the cutoff values. This file will be used as input for the prediction step.
Command line: Rscript ParallelCross.R