Running ChloroScan
ChloroScan is specialized for plastid genome bin recovery, particularly photosynthetic microalgae and macroalgae and complex plastids derived from them, such as diatoms and haptophytes. The primary inputs it wants are contigs and their depth profiles.
For most cases the default settings will be working well, but you can also configure the parameters for each step. Here we will briefly introduce the inputs and outputs of each step, and the parameters you can configure for each step.
1. Are there any plastids in your data?
Photosynthetic algae/protists commonly dwell in environments like ocean waters, freshwater, lichen and humid soils. Anthropogenic environments, underground environments and sediments are less likely to have them in high abundances. So if your data come from these environments, ChloroScan may not recover any plastid MAGs.
Meanwhile, if your data contain too fragmented plastid contigs with low coverage (commonly < 5.0 x), ChloroScan may still miss them. Hopefully our newer versions could resolve these issues.
2. A common minimal command
You’ve got your contigs, mapping files in bam and configuration files. You now can run the whole workflow in one command:
chloroscan run --Inputs-assembly input_contigs.fasta --Inputs-alignment PATH/to/bams \
--Inputs-batch-name "my_batch" --outputdir Path/to/output --use-conda --cores=12 \
--cat-database PATH/to/CAT_db/db --cat-taxonomy path/to/CAT_db/tax --conda-prefix PATH/to/conda_envs
Sometimes you may also wish to get a tabular represented depth profile. It looks like:
S0C861 3.52491757
S0C1664 2.73124830
S0C2713 12.64139886
S0C3242 2.51473363
S0C8106 23.82718202
S0C8631 2.69335600
S0C9609 2.49439900
...
The first column is the contig id, and the second column is the average depth of contig. We also accept this format with the command changed into:
DEPTH_PROFILE=path/to/depth_profile.tsv
chloroscan run --Inputs-assembly input_contigs.fasta --Inputs-depth-profile $DEPTH_PROFILE \
--Inputs-batch-name "my_batch" --outputdir Path/to/output --use-conda --cores=12 \
--cat-database PATH/to/CAT_db/db --cat-taxonomy path/to/CAT_db/tax
For more details about the inputs, please check the inputs_and_outputs section.
If you want to run with our test data, you can use the commands shown in README to download the test data:
figshare download -o simulated_metagenomes.tar.gz 28748540
3. Explanations to arguments in commands
The whole command space of ChloroScan is shown below:
Usage: chloroscan run [OPTIONS]
Run the workflow.
All unrecognized arguments are passed onto Snakemake.
╭─ Options ────────────────────────────────────────────────────────────────────────────────╮
│ --config FILE Path to snakemake config file. Overrides existing │
│ workflow configuration. │
│ [default: None] │
│ --resource -r PATH Additional resources to copy from workflow directory │
│ at run time. │
│ --profile -p TEXT Name of profile to use for configuring Snakemake. │
│ [default: None] │
│ --dry -n Do not execute anything, and display what would be │
│ done. │
│ --lock -l Lock the working directory. │
│ --dag -d PATH Save directed acyclic graph to file. Must end in │
│ .pdf, .png or .svg │
│ [default: None] │
│ --cores -c INTEGER Set the number of cores to use. If None will use all │
│ cores. │
│ [default: None] │
│ --no-conda Do not use conda environments. │
│ --keep-resources Keep resources after pipeline completes. │
│ --keep-snakemake Keep .snakemake folder after pipeline completes. │
│ --verbose -v Run workflow in verbose mode. │
│ --help-snakemake -hs Print the snakemake help and exit. │
│ --help -h Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Workflow Configuration ─────────────────────────────────────────────────────────────────╮
│ --Inputs-assembly -a PATH Path to fasta format │
│ assembly of contigs from │
│ all sorts of organisms. │
│ [default: None] │
│ --Inputs-depth-txt -d PATH Path to a tab-separated │
│ text storing abundance │
│ of each contig in the │
│ sample. │
│ [default: None] │
│ --Inputs-alignment -l PATH Path to the folder │
│ containing alignment │
│ files of the contigs. │
│ [default: None] │
│ --Inputs-batch-name -b TEXT Name of the batch. │
│ [default: None] │
│ --outputdir -o PATH Path to the output │
│ directory of the │
│ workflow. │
│ [default: None] │
│ --tmpdir -t PATH Path to the temporary │
│ directory of the │
│ workflow. │
│ [default: tmp] │
│ --binning-universal-le… INTEGER Length cutoff for │
│ universal binning. │
│ [default: 1500] │
│ --binning-snakemake-env TEXT Customized snakemake │
│ environment for binny to │
│ run. │
│ [default: None] │
│ --binning-mantis-env TEXT Customized Mantis │
│ virtual environment to │
│ have mantis_pfa │
│ installed, annotating │
│ genes. │
│ [default: None] │
│ --binning-outputdir -o PATH Path to the output │
│ directory of the │
│ binning. │
│ [default: binny_output] │
│ --binning-clustering-e… TEXT Range of epsilon values │
│ for HDBSCAN clustering. │
│ [default: 0.250,0.000] │
│ --binning-clustering-h… TEXT Range of min_samples │
│ values for HDBSCAN │
│ clustering, larger value │
│ means larger MAGs. │
│ [default: 1,5,10] │
│ --binning-bin-quality-… FLOAT Starting completeness │
│ for bin quality. │
│ [default: 92.5] │
│ --binning-bin-quality-… FLOAT Minimum completeness for │
│ bin quality. │
│ [default: 50] │
│ --binning-bin-quality-… FLOAT Purity for bin quality. │
│ [default: 95] │
│ --corgi-min-length INTEGER Minimum length of │
│ contigs to be processed │
│ by CORGI. │
│ [default: 500] │
│ --corgi-save-filter --no-corgi-save-filter Save the filtered │
│ contigs by CORGI (Note: │
│ may take long time). │
│ [default: │
│ no-corgi-save-filter] │
│ --corgi-batch-size INTEGER Batch size for CORGI to │
│ process contigs. │
│ [default: 1] │
│ --corgi-pthreshold FLOAT P-value threshold for │
│ CORGI to determine if │
│ the contigs category is │
│ authentically plastidial │
│ or something else. │
│ [default: 0.9] │
│ --cat-database -d PATH Path to the database of │
│ chloroplast genomes. │
│ [default: │
│ /home/yuhtong/scratch/a… │
│ --cat-taxonomy -t PATH Path to the taxonomy of │
│ the database. │
│ [default: │
│ /home/yuhtong/scratch/a… │
│ --krona-env TEXT Path to the Krona │
│ environment. │
│ [default: kronatools] │
╰──────────────────────────────────────────────────────────────────────────────────────────╯
- Below lists those arguments for ChloroScan.
--Inputs-assembly: Path to fasta format assembly of contigs from all sorts of organisms.--Inputs-depth-txt: Path to a tab-separated text storing abundance of each contig in the sample. The first column is the contig id, and the second column is the average depth of contig.--Inputs-alignment: Path to the folder containing bam alignment files of the contigs. The alignment files should be in bam format, and named as “sample_name.bam”. The sample name will be extracted from the bam file name by removing the “.bam” suffix. The sample name will be used in the downstream analysis and output files.--Inputs-batch-name: Name of the batch. This will be used in the downstream analysis and output files. Used to identify data from different running batches.--outputdir: Path to the output directory of the workflow. The final results will be stored in this directory. The intermediate results will be stored in a subdirectory called “working” under the output directory. The default value is “output”.--tmpdir: Path to the temporary directory of the workflow. The default value is “tmp”.--binning-universal-length-cutoff: Contig length cutoff for universal binning. Contigs shorter than this length will be filtered out before binning. The default value is 1500bp.--binning-snakemake-env: Customized snakemake environment for binny to run. If not specified, the default conda environment will be used.--binning-mantis-env: Customized Mantis virtual environment to have mantis_pfa installed, annotating genes. If not specified, the default conda environment will be used.--binning-outputdir: Path to the output directory of the binning. The default value is “binny_output”.--binning-clustering-epsilon-range: Range of epsilon values for HDBSCAN clustering. The default value is “0.250,0.000”.--binning-clustering-hdbscan-min-sample-range: Range of min_samples values for HDBSCAN clustering, larger value means larger MAGs. The default value is “1,5,10”.--binning-bin-quality-purity: Minimum purity for bin quality. The default value is 95.--binning-bin-quality-starting-completeness: Starting completeness for bin quality. Binny uses a sliding completeness to filter bins. The default value is 92.5.--binning-bin-quality-min-completeness: Minimum completeness for bin quality. The default value is 50.--corgi-min-length: Minimum length of contigs to be processed by CORGI. The default value is 500bp.--corgi-save-filter: Save the filtered contigs by CORGI (Note: may take long time). The default value is no-corgi-save-filter.--corgi-batch-size: Batch size for CORGI to infer contigs’ taxonomic labels. The default value is 1.--corgi-pthreshold: P-value threshold for CORGI to determine if the contigs category is authentically plastidial or something else. The default value is 0.9.--cat-database: Path to the database of CAT saving diamond-processed protein sequences. The default value is “PATH/TO/CAT_db/db”.--cat-taxonomy: Path to the taxonomy labels of the CAT database. The default value is “PATH/TO/CAT_db/tax”.--krona-env: Path to the Krona environment. The default value is “kronatools”. Now commonly we don’t need to set up this.