Getting Started¶
Installation¶
The recommended way to install Opfi is with Bioconda, which requires the conda package manager. This will install Opfi and all of its dependencies (which you can read more about below, see Dependencies).
Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system.
Install with conda (Linux and Mac OS only)¶
First, set up conda and Bioconda following the quickstart guide. Once this is done, run:
conda install -c bioconda opfi
And that’s it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:
conda create --name opfi-env -c bioconda opfi
conda activate opfi-env
Install with pip¶
This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is provided below, see Dependencies. Once this step is complete, install Opfi with pip by running:
pip install opfi
Install from source¶
Finally, the latest development build may be installed directly from Github. First, non-Python Dependencies will need to be installed in the working environment. An easy way to do this is to first install Opfi with conda using the Install with conda (Linux and Mac OS only) method (we’ll re-install the development version of the Opfi package in the next step). Alternatively, dependencies can be installed individually.
Once dependencies have been installed in the working environment, run the following code to download and install the development build:
git clone https://github.com/wilkelab/Opfi.git
cd Opfi
pip install . # or pip install -e . for an editable version
pip install -r requirements # if conda was used, this can be skipped
Testing the build¶
Regardless of installation method, users can download and run Opfi’s suite of unit tests to confirm that the build is working as expected. First download the tests from Github:
git clone https://github.com/wilkelab/Opfi
cd Opfi
And then run the test suite using pytest:
pytest --runslow --runmmseqs --rundiamond
This may take a minute or so to complete.
Dependencies¶
Opfi uses the following bioinformatics software packages to find and annotate genomic features:
Application |
Description |
---|---|
Protein and nucleic acid homology search tool |
|
Alternative to BLAST+ for fast protein homology searches |
|
Alternative to BLAST+ for fast protein homology searches |
|
CRISPR repeat detection |
|
Transposon-associated repeat detection |
The first three (BLAST+, Diamond, and MMseqs2) are popular homology search applications, that is, programs that look for local similarities between input sequences (either protein or nucleic acid) and a target. These are used by Opfi in gene_finder.pipeline.Pipeline
for annotation of genes or non-coding regions of interest in the input genome/contig. The user specifies which homology search tool to use during pipeline setup (see gene_finder.pipeline.Pipeline
for details). Note that the BLAST+ distribution contains multiple programs for homology searching, three of which (blastp, blastn, and PSI-BLAST) are currently supported by Opfi.
The following table summarizes the main difference between each homology search program. It may help users decide which application will best meet their needs. Note that performance tests are inherently hardware and context dependent, so this should be taken as a loose guide, rather than a definitive comparison.
Application |
Relative sensitivity |
Relative speed |
Requires a protein or nucleic acid sequence database? |
---|---|---|---|
Diamond |
+ |
++++ |
protein |
MMseqs2 |
++ |
+++ |
protein |
blastp |
+++ |
++ |
protein |
PSI-BLAST |
++++ |
+ |
protein |
blastn |
NA |
NA |
nucleic acid |
The last two software dependencies, PILER-CR and Generic Repeat Finder (GRF), deal with annotation of repetive sequences in DNA. PILER-CR identifies CRISPR arrays, regions of alternatating ~30 bp direct repeat and variable sequences that play a role in prokaryotic immunity. GRF identifies repeats associated with transposable elements, such as terminal inverted repeats (TIRs) and long terminal repeats (LTRs).