Opfi¶
Welcome to the Opfi documentation site! Opfi is a modular, rule-based framework for creating gene cluster identification pipelines, particularly for large genomics or metagenomics datasets.
Opfi is implemented entirely in Python, and can be downloaded with conda or the from the Python Package Index. It consists of two major modules: Gene Finder, for discovery of novel gene clusters, and Operon Analyzer, for rule-based filtering, deduplication, visualization, and re-annotation of systems identified by Gene Finder.
Contents¶
Getting Started¶
Installation¶
The recommended way to install Opfi is with Bioconda, which requires the conda package manager. This will install Opfi and all of its dependencies (which you can read more about below, see Dependencies).
Currently, Bioconda supports only 64-bit Linux and Mac OS. Windows users can still install Opfi with pip (see below); however, the complete installation procedure has not been fully tested on a Windows system.
Install with conda (Linux and Mac OS only)¶
First, set up conda and Bioconda following the quickstart guide. Once this is done, run:
conda install -c bioconda opfi
And that’s it! Note that this will install Opfi in the conda environment that is currently active. To create a fresh environment with Opfi installed, do:
conda create --name opfi-env -c bioconda opfi
conda activate opfi-env
Install with pip¶
This method does not automatically install non-Python dependencies, so they will need to be installed separately, following their individual installation instructions. A complete list of required software is provided below, see Dependencies. Once this step is complete, install Opfi with pip by running:
pip install opfi
Install from source¶
Finally, the latest development build may be installed directly from Github. First, non-Python Dependencies will need to be installed in the working environment. An easy way to do this is to first install Opfi with conda using the Install with conda (Linux and Mac OS only) method (we’ll re-install the development version of the Opfi package in the next step). Alternatively, dependencies can be installed individually.
Once dependencies have been installed in the working environment, run the following code to download and install the development build:
git clone https://github.com/wilkelab/Opfi.git
cd Opfi
pip install . # or pip install -e . for an editable version
pip install -r requirements # if conda was used, this can be skipped
Testing the build¶
Regardless of installation method, users can download and run Opfi’s suite of unit tests to confirm that the build is working as expected. First download the tests from Github:
git clone https://github.com/wilkelab/Opfi
cd Opfi
And then run the test suite using pytest:
pytest --runslow --runmmseqs --rundiamond
This may take a minute or so to complete.
Dependencies¶
Opfi uses the following bioinformatics software packages to find and annotate genomic features:
Application |
Description |
---|---|
Protein and nucleic acid homology search tool |
|
Alternative to BLAST+ for fast protein homology searches |
|
Alternative to BLAST+ for fast protein homology searches |
|
CRISPR repeat detection |
|
Transposon-associated repeat detection |
The first three (BLAST+, Diamond, and MMseqs2) are popular homology search applications, that is, programs that look for local similarities between input sequences (either protein or nucleic acid) and a target. These are used by Opfi in gene_finder.pipeline.Pipeline
for annotation of genes or non-coding regions of interest in the input genome/contig. The user specifies which homology search tool to use during pipeline setup (see gene_finder.pipeline.Pipeline
for details). Note that the BLAST+ distribution contains multiple programs for homology searching, three of which (blastp, blastn, and PSI-BLAST) are currently supported by Opfi.
The following table summarizes the main difference between each homology search program. It may help users decide which application will best meet their needs. Note that performance tests are inherently hardware and context dependent, so this should be taken as a loose guide, rather than a definitive comparison.
Application |
Relative sensitivity |
Relative speed |
Requires a protein or nucleic acid sequence database? |
---|---|---|---|
Diamond |
+ |
++++ |
protein |
MMseqs2 |
++ |
+++ |
protein |
blastp |
+++ |
++ |
protein |
PSI-BLAST |
++++ |
+ |
protein |
blastn |
NA |
NA |
nucleic acid |
The last two software dependencies, PILER-CR and Generic Repeat Finder (GRF), deal with annotation of repetive sequences in DNA. PILER-CR identifies CRISPR arrays, regions of alternatating ~30 bp direct repeat and variable sequences that play a role in prokaryotic immunity. GRF identifies repeats associated with transposable elements, such as terminal inverted repeats (TIRs) and long terminal repeats (LTRs).
Example Usage¶
Example 1: Finding CRISPR-Cas systems in a cyanobacteria genome¶
In this example, we will annotate and visualize CRISPR-Cas systems in the cyanobacteria species Rippkaea orientalis. CRISPR-Cas is a widespread bacterial defense system, found in at least 50% of all known prokaryotic species. This system is significant in that it can be leveraged as a precision gene editing tool, an advancement that was awarded the 2020 Nobel Prize in Chemistry. The genome of R. orientalis harbors two complete CRISPR-Cas loci (one chromosomal, and one extrachromosomal/plasmid).
You can download the complete assembled genome here; it is also available at https://github.com/wilkelab/Opfi under tutorials
, along with the other data files necessary to run these examples, and an interactive jupyter notebook version of this tutorial.
This tutorial assumes the user has already installed Opfi and all dependencies (if installing with conda, this is done automatically). Some familiarity with BLAST and the basic homology search algorithm may also be helpful, but is not required.
1. Use the makeblastdb utility to convert a Cas protein database to BLAST format¶
We start by converting a Cas sequence database to a format that BLAST can recognize, using the command line utility makeblastdb, which is part of the core NCBI BLAST+ distribution. A set of ~20,000 non-redundant Cas sequences, downloaded from Uniprot is available as a tar archive tutorials/cas_database.tar.gz
. We’ll make a new directory, “blastdb”, and extract sequences there:
mkdir blastdb
cd blastdb && tar -xzf cas_database.tar.gz && cd ..
Next, create two BLAST databases for the sequence data: one containing Cas1 sequences only, and another that contains the remaining Cas sequences.
cd blastdb && cat cas1.fasta | makeblastdb -dbtype prot -title cas1 -hash_index -out cas1_db && cd ..
cd blastdb && cat cas[2-9].fasta cas1[0-2].fasta casphi.fasta | makeblastdb -dbtype prot -title cas_all -hash_index -out cas_all_but_1_db && cd ..
-dbtype prot
simply tells makeblastdb to expect amino acid sequences. We use -title
and -out
to name the database (required by BLAST) and to prefix the database files, respectively. -hash_index
directs makeblastdb to generate a hash index of protein sequences, which can speed up computation time.
2. Use Gene Finder to search for CRISPR-Cas loci¶
CRISPR-Cas systems are extremely diverse. The most recent classification effort identifies 6 major types, and over 40 subtypes, of compositionally destinct systems. Although there is sufficent sequence similarity between subtypes to infer the existence of a common ancestor, the only protein family present in the majority of CRISPR-cas subtypes is the conserved endonuclease Cas1. For our search, we will define candidate CRISPR-cas loci as having, minimally, a cas1 gene.
First, create another directory for output:
mkdir example_1_output
The following bit of code uses Opfi’s gene_finder.pipeline
module to search for CRISPR-Cas systems:
from gene_finder.pipeline import Pipeline
import os
genomic_data = "GCF_000024045.1_ASM2404v1_genomic.fna.gz"
output_directory = "example_1_output"
p = Pipeline()
p.add_seed_step(db="blastdb/cas1_db", name="cas1", e_val=0.001, blast_type="PROT", num_threads=1)
p.add_filter_step(db="blastdb/cas_all_but_1_db", name="cas_all", e_val=0.001, blast_type="PROT", num_threads=1)
p.add_crispr_step()
# use the input filename as the job id
# results will be written to the file <job id>_results.csv
job_id = os.path.basename(genomic_data)
results = p.run(job_id=job_id, data=genomic_data, output_directory=output_directory, min_prot_len=90, span=10000, gzip=True)
First, we initialize a gene_finder.pipeline.Pipeline
object, which keeps track of all search parameters, as well as a running list of systems that meet search criteria. Next, we add three search steps to the pipeline:
gene_finder.pipeline.Pipeline.add_seed_step()
: BLAST is used to search the input genome against a database of Cas1 sequences. Regions around putative Cas1 hits become the intial candidates, and the rest of the genome is ignored.gene_finder.pipeline.Pipeline.add_filter_step()
: Candidate regions are searched for any additional Cas genes. Candidates without at least one additional putative Cas gene are also discarded.gene_finder.pipeline.Pipeline.add_crispr_step()
: Remaining candidates are annotated for CRISPR repeat sequences using PILER-CR.
Finally, we run the pipeline, executing steps in the order they we added. min_prot_len
sets the minimum length (in amino acid residues) of hits to keep (really short hits are unlikely real protein encoding genes). span
is the region directly up- and downstream of initial hits. So, each candidate system will be about 20 kbp in length. Results are written to a single CSV file. Final candidate loci contain at least one putative Cas1 gene and one additional Cas gene. As we will see, this relatively permissive criteria captures some non-CRISPR-Cas loci. Opfi has additional modules for reducing unlikely systems after the gene finding stage.
3. Visualize annotated CRISPR-Cas gene clusters with Operon Analyzer¶
It is sometimes useful to visualize candidate systems, especially during the exploratory phase of a genomics survey. Opfi provides a few functions for visualizing candidate systems in operon_analyzer.visualize
. We’ll use these to visualize the CRISPR-Cas gene clusters in R. orientalis:
import csv
import sys
from operon_analyzer import load, visualize
feature_colors = { "cas1": "lightblue",
"cas2": "seagreen",
"cas3": "gold",
"cas4": "springgreen",
"cas5": "darkred",
"cas6": "thistle",
"cas7": "coral",
"cas8": "red",
"cas9": "palegreen",
"cas10": "yellow",
"cas11": "tan",
"cas12": "orange",
"cas13": "saddlebrown",
"casphi": "olive",
"CRISPR array": "purple"
}
# read in the output from Gene Finder and create a gene diagram for each cluster (operon)
with open("example_1_output/GCF_000024045.1_ASM2404v1_genomic.fna.gz_results.csv", "r") as operon_data:
operons = load.load_operons(operon_data)
visualize.plot_operons(operons=operons, output_directory="example_1_output", feature_colors=feature_colors, nucl_per_line=25000)
Running this script produces the following three gene diagrams, one for each system in the input CSV:

A CRISPR-Cas system in the chromosome of R. orientalis.¶

A second CRISPR-Cas system in R. orientalis plasmid 1.¶

An R. orientalis locus with a putative CRISPR-Cas gene.¶
We can see that both CRISPR-Cas systems were identified (Fig. 1 and Fig. 2). We also see some systems that don’t resemble functional CRISPR-Cas operons (Fig. 3). Because we used a relatively permissive e-value threshhold of 0.001 when running BLAST, Opfi retained regions with very low sequence similarity to true CRISPR-Cas genes. In fact, these regions are likely not CRISPR-Cas loci at all. Using a lower e-value would likely eliminate these “false positive” systems, but operon_analyzer.rules
exposes functions for filtering out unlikely candidates after the intial BLAST search.
In general, we have found that using permissive BLAST parameters intially, and then filtering or eliminating candidates during the downstream analysis, is an effective way to search for gene clusters in large amounts of genomic/metagenomic data. In this toy example, we could re-run BLAST many times without significant cost. But on a more realistic dataset, needing to re-do the computationally expensive homology search could detrail a project. Since the optimal search parameters may not be known a priori, it can be better to do a permissive homology search initially, and then narrow down results later.
Finally, clean up the temporary directories, if desired:
rm -r example_1_output blastdb
Example 2: Filter and classify CRISPR-Cas systems based on genomic composition¶
As discussed in the previous example, known CRISPR-Cas systems fall into 6 broad categories, based on the presence of particular “signature” genes, as well as overall composition and genomic architecture. In this example, we will use Opfi to search for and classify CRISPR-Cas systems in ~300 strains of fusobacteria.
This dataset was chosen because it is more representative (in magnitude) of what would be encountered in a real genomics study. Additionally, the fusobacteria phylum contains a variety of CRISPR-Cas subtypes. Given that the homology search portion of the analysis takes several hours (using a single core) to complete, we have pre-run Gene Finder using the same setup as the previous example.
1. Make another temporary directory for output:¶
mkdir example_2_output
2. Filter Gene Finder output and extract high-confidence CRISPR-Cas systems¶
The following code reads in unfiltered output from gene_finder.pipeline.Pipeline
and applies a set of conditions (“rules”) to accomplish two things:
1. Select (and bin) systems according to type, and,
2. Eliminate candidates that likely do not represent true CRISPR-Cas systems
To do this, we’ll leverage the operon_analyzer.rules
and operon_analyzer.analyze
modules.
from operon_analyzer import analyze, rules
fs = rules.FilterSet().pick_overlapping_features_by_bit_score(0.9)
cas_types = ["I", "II", "III", "V"]
rulesets = []
# type I rules
rulesets.append(rules.RuleSet().contains_group(feature_names = ["cas5", "cas7"], max_gap_distance_bp = 1000, require_same_orientation = True) \
.require("cas3"))
# type II rules
rulesets.append(rules.RuleSet().contains_at_least_n_features(feature_names = ["cas1", "cas2", "cas9"], feature_count = 3) \
.minimum_size("cas9", 3000))
# type III rules
rulesets.append(rules.RuleSet().contains_group(feature_names = ["cas5", "cas7"], max_gap_distance_bp = 1000, require_same_orientation = True) \
.require("cas10"))
# type V rules
rulesets.append(rules.RuleSet().contains_at_least_n_features(feature_names = ["cas1", "cas2", "cas12"], feature_count = 3))
for rs, cas_type in zip(rulesets, cas_types):
with open("refseq_fusobacteria.csv", "r") as input_csv:
with open(f"example_2_output/refseq_fuso_filtered_type{cas_type}.csv", "w") as output_csv:
analyze.evaluate_rules_and_reserialize(input_csv, rs, fs, output_csv)
The rule sets are informed by an established CRISPR-Cas classification system, which you can learn more about in this paper . The most recent system recognizes 6 major CRISPR-Cas types, but since fusobacteria doesn’t contain type IV or VI systems that can be identified with our protein dataset, we didn’t define the corresponding rule sets.
3. Verify results with additional visualizations¶
Altogther, this analysis will identify several hundred systems. We won’t look at each system individually (but you are free to do so!). For the sake of confirming that the code ran as expected, we’ll create gene diagrams for just the type V systems, since there are only two:
import csv
import sys
from operon_analyzer import load, visualize
feature_colors = { "cas1": "lightblue",
"cas2": "seagreen",
"cas3": "gold",
"cas4": "springgreen",
"cas5": "darkred",
"cas6": "thistle",
"cas7": "coral",
"cas8": "red",
"cas9": "palegreen",
"cas10": "yellow",
"cas11": "tan",
"cas12": "orange",
"cas13": "saddlebrown",
"casphi": "olive",
"CRISPR array": "purple"
}
# read in the output from Gene Finder and create a gene diagram for each cluster (operon)
with open("example_2_output/refseq_fuso_filtered_typeV.csv", "r") as operon_data:
operons = load.load_operons(operon_data)
visualize.plot_operons(operons=operons, output_directory="example_2_output", feature_colors=feature_colors, nucl_per_line=25000)
The plotted systems should look like this:

A type V CRISPR-Cas system.¶

A second type V CRISPR-Cas system.¶
Finally, clean up the temporary output directory, if desired:
rm -r example_2_output
Inputs and Outputs¶
Building sequence databases¶
To search for gene clusters with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target clusters (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as Uniprot (maintained by the European Bioinformatics Institute ) or one of the databases provided by the National Center for Biotechnology Information.
Once target sequences have been compiled, they must be converted to an application-specific database format. Opfi currently supports BLAST+, mmseqs2, and diamond for homology searching:
Instructions for creating sequence databases for BLAST using makeblastdb
Instructions for creating sequence databases for mmseqs2 using mmseqs createdb
The FASTA file format¶
Both genomic input data and reference sequence data should be in FASTA format. This is a simple flat text representation of biological sequence data, where individual sequences are delineated by the >
greater than character. For example:
>UniRef50_Q02ML7 CRISPR-associated endonuclease Cas1 n=1700 RepID=CAS1_PSEAB
MDDISPSELKTILHSKRANLYYLQHCRVLVNGGRVEYVTDEGRHSHYWNIPIANTTSLLL
GTGTSITQAAMRELARAGVLVGFCGGGGTPLFSANEVDVEVSWLTPQSEYRPTEYLQRWV
GFWFDEEKRLVAARHFQRARLERIRHSWLEDRVLRDAGFAVDATALAVAVEDSARALEQA
PNHEHLLTEEARLSKRLFKLAAQATRYGEFVRAKRGSGGDPANRFLDHGNYLAYGLAATA
TWVLGIPHGLAVLHGKTRRGGLVFDVADLIKDSLILPQAFLSAMRGDEEQDFRQACLDNL
SRAQALDFMIDTLKDVAQRSTVSA
>UniRef50_Q2RY21 CRISPR-associated endonuclease Cas1 1 n=1034 RepID=CAS1A_RHORT
MADPAFVPLRPIAIKDRSSIVFLQRGQLDVVDGAFVLIDQEGVRVQIPVGGLACLMLEPG
TRITHAAIVLCARVGCLVIWVGERGTRLYAAGQPGGARADRLLFQARNALDETARLNVVR
EMYRRRFDDDPPARRSVDQLRGMEGVRVREIYRLLAKKYAVDWNARRYDHNDWDGADIPN
RCLSAATACLYGLCEAAILAAGYAPAIGFLHRGKPQSFVYDVADLYKVETVVPTAFSIAA
KIAAGKGDDSPPERQVRIACRDQFRKSGLLEKIIPDIEEILRAGGLEPPLDAPEAVDPVI
PPEEPSGDDGHRG
The sequence definition (defline) comes directly after the >
character, and should be on a separate line from the sequence (which can be on one or more subsequent lines). There is no specific defline format, however, Opfi requires that, for both genomic input and sequence data, each definition line contain a unique sequence identifer. This should be a single word/token immediately following the >
character (i.e. spaces between the >
character and the identifier are not allowed). Any additional text on the defline is parsed as a single string, and appears in the output CSV (see Opfi output format).
Tip
Biological sequences downloaded from most public databases will have an accession number/identifier by default.
Annotating sequence databases¶
To take full advantage of the rule-based filtering methods in operon_analyzer.rules
, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene clusters.
Gene labels are parsed from sequence deflines; specifically, Opfi looks for the second word/token following the >
character. For example, the following FASTA sequence has been annotated with the label “cas1”:
>UniRef50_Q02ML7 cas1 CRISPR-associated endonuclease Cas1 n=1700 RepID=CAS1_PSEAB
MDDISPSELKTILHSKRANLYYLQHCRVLVNGGRVEYVTDEGRHSHYWNIPIANTTSLLL
GTGTSITQAAMRELARAGVLVGFCGGGGTPLFSANEVDVEVSWLTPQSEYRPTEYLQRWV
GFWFDEEKRLVAARHFQRARLERIRHSWLEDRVLRDAGFAVDATALAVAVEDSARALEQA
PNHEHLLTEEARLSKRLFKLAAQATRYGEFVRAKRGSGGDPANRFLDHGNYLAYGLAATA
TWVLGIPHGLAVLHGKTRRGGLVFDVADLIKDSLILPQAFLSAMRGDEEQDFRQACLDNL
SRAQALDFMIDTLKDVAQRSTVSA
After running gene_finder.pipeline.Pipeline
, users could select candidates with hits against this sequence using the following rule set:
from operon_analyzer.rules import RuleSet
rs = RuleSet.require("cas1")
In practice, a genomics search might use a reference database of hundreds (or even thousands) of representative protein sequences, in which case labeling each sequence individually would be tedious. It is recommended to organize sequences into groups of related proteins that can be given a single label. This script uses the Python package Biopython to annotate sequences in a multi-sequence FASTA file:
from Bio import SeqIO
import os, sys
def annotate_reference(prot_ref_file, label):
records = list(SeqIO.parse(ref_fasta, "fasta"))
for record in records:
des = record.description.split()
prot_id = des.pop(0)
des_with_label = "{} {} {}".format(prot_id, label, " ".join(des))
record.description = des_with_label
SeqIO.write(records, ref_fasta, "fasta")
if __name__ == "__main__":
ref_fasta = sys.argv[1]
label = sys.argv[2]
annotate_reference(ref_fasta, label)
It is possible to use the entire sequence description (i.e. all text following the sequence identifier) as the gene label. This is particularly useful when using a pre-built database like nr, which contains representative protein sequences for many different protein families. When using sequence databases that haven’t been annotated, users should set parse_descriptions=False
for each gene_finder.pipeline.Pipeline
add_step()
method call.
Converting sequence files to a sequence database¶
Once reference sequences have been compiled (and, optionally, labeled) they must be converted to a sequence database format that is specific to the homology search program used. Currently, Opfi supports BLAST, mmseqs2, and diamond. Each software package is automatically installed with a companion utility program for generating sequence databases. The following example shows what a typical call to makeblastdb, the BLAST+ database utility program, might look like:
makeblastdb -in "my_sequences.fasta" -out my_sequences/db -dbtype prot -title "my_sequences" -hash_index
The command takes a text/FASTA file my_sequences.fasta
as input, and writes the resulting database files to the directory my_sequences
. Database files are prefixed with “db”. -dbtype prot
specifies that the input is amino acid sequences. We use -title
to name the database (required by BLAST). -hash_index
directs makeblastdb to generate a hash index of protein sequences, which can speed up computation time.
Tip
mmseqs2 and diamond have similar database creation commands, see Building sequence databases.
BLAST advanced options¶
BLAST+ programs have a number of tunable parameters that can, for example, be used to adjust the sensitivity of the search algorithm. We anticipate that application defaults will be sufficient for most users; nevertheless, it is possible to use non-default program options by passing them as keyword arguments to gene_finder.pipeline.Pipeline
add_step()
methods.
For example, when using blastp on the command line, we could adjust the number of CPUs to four by passing the argument -num_threads 4
to the program. When using Opfi, this would look like num_threads=4
.
Flags (boolean arguments that generally do not precede additional data) are also possible. For example, the command line flag -use_sw_tback
tells blastp to compute locally optimal Smith-Waterman alignments. The correct way to specify this behavior via the gene_finder.pipeline.Pipeline
API would be to use the argument use_sw_tback=True
.
Below is a list of options accepted by Opfi. Note that some BLAST+ options are not allowed, mainly those that modify BLAST output.
Program |
Allowed Options |
---|---|
blastp and psiblast |
dbsize word_size gapopen gapextend qcov_hsp_perc xdrop_ungap xdrop_gap xdrop_gap_final searchsp sum_stats seg soft_masking matrix threshold culling_limit window_size num_threads comp_based_stats gilist seqidlist negative_gilistdb_soft_mask db_hard_mask entrez_query max_hspsbest_hit_overhang best_hit_score_edge max_target_seqsimport_search_strategy export_search_strategy num_alignments |
blastp only |
task |
psiblast only |
gap_trigger num_iterations out_pssm out_ascii_pssm pseudocount inclusion_ethresh |
blastp (flags) |
lcase_masking ungapped use_sw_tback remote |
psiblast (flags) |
lcase_masking use_sw_tback save_pssm_after_last_round save_each_pssm remote |
blastn |
filtering_algorithm sum_stats window_masker_db window_size template_type version parse_deflines min_raw_gapped_score string format max_hsps taxids negative_taxids num_alignments strand off_diagonal_range subject_besthit num_sequences no_greedy negative_taxidlist culling_limit xdrop_ungap open_penalty DUST_options sorthits xdrop_gap_final negative_gilist subject use_index bool_value filename seqidlist task_name sort_hits database_name lcase_masking query_loc subject_loc sort_hsps line_length boolean db_hard_mask negative_seqidlist template_length filtering_db filtering_database penalty searchsp ungapped type gapextend db_soft_mask dbsize qcov_hsp_perc sorthsps window_masker_taxid index_name export_search_strategy float_value soft_masking gilist entrez_query show_gis best_hit_score_edge gapopen subject_input_file range html word_size best_hit_overhang perc_identity input_file num_descriptions xdrop_gap dust taxidlist max_target_seqs num_threads task remote int_value extend_penalty reward import_search_strategy num_letters |
You can read more about BLAST+ options in the BLAST+ appendices.
Note
Using advanced options with mmseqs2 and diamond is not supported at this time.
Opfi output format¶
Results from gene_finder.pipeline.Pipeline
searches are written to a single CSV file. Below is an example from the tutorial (see Example Usage):
NC_013161.1 |
503817..525707 |
cas1 |
514110..513817 |
lcl|514110|513817|2|-1 |
-1 |
UniRef50_A0A179D3U4 |
1.24e-07 |
UniRef50_A0A179D3U4 cas1 CRISPR-associated endoribonuclease Cas2 n=2 Tax=Thermosulfurimonas dismutans TaxID=999894 RepID=A0A179D3U4_9BACT |
MNTLFYLIIYDLPATKAGNKRRKRLYEMLCGYGNWTQFSVFECFLTAVQFANLQSKLENLIQPNEDSVRIYILDAGSVRKTLTYGSEKPRQVDTLIL |
42.4 |
98 |
51 |
43.137 |
22 |
29 |
31 |
0 |
0 |
60.78 |
53 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas1 |
515084..514107 |
lcl|515084|514107|3|-1 |
-1 |
UniRef50_A0A1Z3HN48 |
4.00e-177 |
UniRef50_A0A1Z3HN48 cas1 CRISPR-associated endonuclease Cas1 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A1Z3HN48_9CYAN |
MSILYLTQPDAVLSKKQEAFHVALKQEDGSWKKQLIPAQTVEQIVLIGYPSITGEALCYALELGIPVHYLSCFGKYLGSALPGYSRNGQLRLAQYHVHDNEEQRLALVKTVVTGKIHNQYHVLYRYQQKDNPLKEHKQLVKSKTTLEQVRGVEGLAAKDYFNGFKLILDSQWNFNGRNRRPPTDPVNALLSFAYGLLRVQVTAAVHIAGLDPYIGYLHETTRGQPAMVLDLMEEFRPLIADSLVLSVISHKEIKPTDFNESLGAYLLSDSGRKTFLQAFERKLNTEFKHPVFGYQCSYRRSIELQARLFSRYLQENIPYKSLSLR |
489 |
1260 |
325 |
69.538 |
226 |
99 |
276 |
0 |
0 |
84.92 |
100 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas1 |
515707..515117 |
lcl|515707|515117|1|-1 |
-1 |
UniRef50_A0A2I8A541 |
1.64e-100 |
UniRef50_A0A2I8A541 cas1 CRISPR-associated exonuclease Cas4 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8A541_9NOSO |
MIDNYLPLAYLNAFEYCTRRFYWEYVLGEMANNEHIIIGRHLHRNINQEGIIKEEDTIIHRQQWVWSDRLQIKGIIDAVEEKESSLVPVEYKKGRMSQHLNDHFQLCAAALCLEEKTGKIITYGEIFYHANRRRQRVDFSDRLRCSTEQAIHHAHELVNQKMPSPINNSKKCRDCSLKTMCLPKEVKQLRNSLISD |
285 |
729 |
195 |
66.154 |
129 |
66 |
162 |
0 |
0 |
83.08 |
99 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas2 |
514110..513817 |
lcl|514110|513817|2|-1 |
-1 |
UniRef50_A0A1Z3HN55 |
7.36e-46 |
UniRef50_A0A1Z3HN55 cas2 CRISPR-associated endoribonuclease Cas2 n=68 Tax=Cyanobacteria TaxID=1117 RepID=A0A1Z3HN55_9CYAN |
MNTLFYLIIYDLPATKAGNKRRKRLYEMLCGYGNWTQFSVFECFLTAVQFANLQSKLENLIQPNEDSVRIYILDAGSVRKTLTYGSEKPRQVDTLIL |
142 |
357 |
94 |
67.021 |
63 |
31 |
77 |
0 |
0 |
81.91 |
97 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas4 |
515084..514107 |
lcl|515084|514107|3|-1 |
-1 |
UniRef50_A0A1E5G3J0 |
1.01e-72 |
UniRef50_A0A1E5G3J0 cas4 CRISPR-associated endonuclease Cas1 n=4 Tax=Firmicutes TaxID=1239 RepID=A0A1E5G3J0_9BACL |
MSILYLTQPDAVLSKKQEAFHVALKQEDGSWKKQLIPAQTVEQIVLIGYPSITGEALCYALELGIPVHYLSCFGKYLGSALPGYSRNGQLRLAQYHVHDNEEQRLALVKTVVTGKIHNQYHVLYRYQQKDNPLKEHKQLVKSKTTLEQVRGVEGLAAKDYFNGFKLILDSQWNFNGRNRRPPTDPVNALLSFAYGLLRVQVTAAVHIAGLDPYIGYLHETTRGQPAMVLDLMEEFRPLIADSLVLSVISHKEIKPTDFNESLGAYLLSDSGRKTFLQAFERKLNTEFKHPVFGYQCSYRRSIELQARLFSRYLQENIPYKSLSLR |
233 |
595 |
333 |
39.940 |
133 |
179 |
191 |
6 |
21 |
57.36 |
98 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas4 |
515707..515117 |
lcl|515707|515117|1|-1 |
-1 |
UniRef50_A0A2I8A541 |
1.92e-99 |
UniRef50_A0A2I8A541 cas4 CRISPR-associated exonuclease Cas4 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8A541_9NOSO |
MIDNYLPLAYLNAFEYCTRRFYWEYVLGEMANNEHIIIGRHLHRNINQEGIIKEEDTIIHRQQWVWSDRLQIKGIIDAVEEKESSLVPVEYKKGRMSQHLNDHFQLCAAALCLEEKTGKIITYGEIFYHANRRRQRVDFSDRLRCSTEQAIHHAHELVNQKMPSPINNSKKCRDCSLKTMCLPKEVKQLRNSLISD |
285 |
729 |
195 |
66.154 |
129 |
66 |
162 |
0 |
0 |
83.08 |
99 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas6 |
516642..515833 |
lcl|516642|515833|2|-1 |
-1 |
UniRef50_A0A654SHL3 |
2.64e-108 |
UniRef50_A0A654SHL3 cas6 CRISPR_Cas6 domain-containing protein n=30 Tax=Cyanobacteria TaxID=1117 RepID=A0A654SHL3_9CYAN |
MVQDILPQLHKYQLQSLVIELGVAKQGKLPATLSRAIHACVLNWLSLADSQLANQIHDSQISPLCLSGLIGNRRQPYSLLGDYFLLRIGVLQPSLIKPLLKGIEAQETQTLELGKFPFIIRQVYSMPQSHKLSQLTDYYSLALYSPTMTEIQLKFLSPTSFKQIQGVQPFPLPELVFNSLLRKWNHFAPQELKFPEIQWQSFVSAFELKTHALKMEGGAQIGSQGWAKYCFKDTEQARIASILSHFAFYAGVGRKTTMGMGQTQLLVNT |
314 |
804 |
270 |
55.926 |
151 |
118 |
195 |
1 |
1 |
72.22 |
100 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas5 |
517387..516611 |
lcl|517387|516611|1|-1 |
-1 |
UniRef50_A0A2I8AFZ3 |
1.43e-118 |
UniRef50_A0A2I8AFZ3 cas5 Type I-D CRISPR-associated protein Cas5/Csc1 n=62 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8AFZ3_9NOSO |
MNIYYCQLTLHDNIFFATREMGLLYETEKYLHNWALSYAFFKGTYIPHPYRLQGKSAQKPDYLDSTGEQSLAHLNRLKIYVFPAKPLRWSYQINTFKAAQTTYYGKSQQFGDKGANRNYPINYGRAKELAVGSEYHTFLISSQELNIPHWIRVGKWSAKVEVTSYLIPQKAISQHSGIYLCDHPLNPIDLPFDQELLLYNRIVMPPVSLVSQAQLQGNYCKINKNNWNDCPSNLTDLPQQICLPLGVNYGAGYIASAS |
338 |
866 |
252 |
65.079 |
164 |
71 |
194 |
3 |
17 |
76.98 |
98 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas7 |
518600..517530 |
lcl|518600|517530|3|-1 |
-1 |
UniRef50_B7JVM8 |
0.0 |
UniRef50_B7JVM8 cas7 CRISPR-associated protein Csc2 n=52 Tax=Cyanobacteria TaxID=1117 RepID=B7JVM8_RIPO1 |
MSILETLKPQFQSAFPRLASANYVHFIMLRHSQSFPVFQTDGVLNTVRTQAGLMAKDSLSRLVMFKRKQTTPERLTGRELLRSLNITTADKNDKEKGCEYNGEGSCKKCPDCIIYGFAIGDSGSERSKVYSDSTFSLSAYEQSHRTFTFNAPFEGGTMSEQGVMRSAINELDHILPEITFPNIETLRDSTYEGFIYVLGNILRTKRYGAQESRTGTMKNHLVGIAFCDGEIFSNLRFTQALYDGLEGDVNKPIDEICYQASQIVQTLLSDEPVRKIKTIFGEELNHLINEVSGIYQNDALLTETLNMLYQQTKTYSENHGSLAKSKPPKAEGNKSKGRTKKKGDDEQTSLDLNIEE |
733 |
1891 |
356 |
98.876 |
352 |
4 |
354 |
0 |
0 |
99.44 |
100 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas10 |
521597..518673 |
lcl|521597|518673|3|-1 |
-1 |
UniRef50_B7KB38 |
0.0 |
UniRef50_B7KB38 cas10 CRISPR-associated protein Csc3 n=52 Tax=Cyanobacteria TaxID=1117 RepID=B7KB38_GLOC7 |
MTLLQILLLETISQDTDPILISYLETVLPAMEPEFALIPALGGSQQIHYQNLIAIGNRYAQENAKRFSDKADQNLLVHVLNALLTAWNLVDHLTKPLSDIEKYLLCLGLTLHDYNKYCLGHGEESPKVSNINEIINICQELGKKLNFQAFWSDWEQYLPEIVYLAQNTQFKAGTNAIPANYPLFTLADSRRLDLPLRRLLAFGDIAVHLQDPADIISKTGGDRLREHLRFLGIKKALVYHRLRDTLGILSNGIHNATLRFAKDLNWQPLLFFAQGVIYLAPIDYTSPEKMELQGFIWQEISQLLASSMLKGEIGFKRDGKGLKVAPQTLELFTPVQLIRNLADVINVKVANAKVPATPKRLEKLELTDIERQLLEKGADLRADRIAELIILAQREFLADSPEFIDWTLQFWGLEKQITAEQTQEQSGGVNYGWYRVAANYIANHSTLSLEDVSGKLVDFCQQLADWATSNQLLSSHSSSTFEVFNSYLEQYLEIQGWQSSTPNFSQELSTYIMAKTQSSKQPICSLSSGEFISEDQMDSVVLFKPQQYSNKNPLGGGKIKRGISKIWALEMLLRQALWTVPSGKFEDQQPVFLYIFPAYVYSPQIAAAIRSLVNDMKRINLWDVRKHWLHEDMNLDSLRSLQWRKEEAEVGRFKDKYSRADIPFMGTVYTTTRGKTLTEAWIDPAFLTLALPILLGVKVIATSSSVPLYNSDNDFLDSVILDAPAGFWQLLKLSTSLRIQELSVALKRLLTIYTIHLDNRSNPPDARWQALNSTVREVITDVLNVFSIADEKLREDQREASPQEVQRYWKFAEIFAQGDTIMTEKLKLTKELVRQYRTFYQVKWSESSHTILLPLTKALEEILSTPEHWDDEELILQGAGILNDALDRQEVYKRPLLQDKSIPYEIRKQQELQAIHQFMTTCVKELFGQMCKGDRALLQEYRNRIKSGAESAYKLLAFEEKSNSSQQQKSSEDQ |
1073 |
2775 |
978 |
56.544 |
553 |
399 |
710 |
12 |
26 |
72.60 |
99 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
cas3 |
523760..521655 |
lcl|523760|521655|3|-1 |
-1 |
UniRef50_A0A168SWH5 |
0.0 |
UniRef50_A0A168SWH5 cas3 Type I-D CRISPR-associated helicase Cas3 n=2 Tax=Phormidium TaxID=1198 RepID=A0A168SWH5_9CYAN |
MKINLKPLYSKLNAGVGNCPLGCQEMCRVQQQAPQFKAPSGCNCPLYQHQAESYPYLTKGDTDIIFITAPTAGGKSLLASLPSLLDPNFRMMGLYPTIELVEDQTEQQNNYHNLFGLNSEERIDKLFGVELTQRIKEFNSNRFQQLWLAIETKEVILTNPDIFHLMTHFRYRDNAYGTDELPLALAKFPDLWVFDEFHIFGAHQETAVLNSMMLIRRTQQQKKRFLFTSATVKTDFVEQLKQTGLKIKEIAGEYKSEAQQGYRQILQAVELSIINLKEEDGFSWLINNAAKIRKILKAEDKGRGLIILNSVVMVRRISQELQSLLPEIVVREISGRIDRKERSQTQQLLQEEEKPVLVVATSAVDVGVDFRIHLLITESSDSATVIQRLGRLGRHSGFSNYQAFLLLSGRTPWVINRLQEKLESKQDVTREELIEAIQYAFDPPKEYQEYRNRWGAIQVQGMFSQMMGSNAKVMQSIKERISEDLKRIYGNTLDNKAWYAMGHNCLGKAIQSELLRFRGGSTLQAAVWDEQRFYTYDLLRLLPYATVDILDRETFLKAATKAGHIEEAFPSQYLQVYLRIEQWLDKRLNLNLFCNRESDELLVGKLFLITRLKLDGHPQSDVISCLSRCNLLTFLVPVDRSRTQSHWEVSYCLHLNPLFGLYRLKDASEQAYACAFNQDALLLEALNWKLTKFYRERSLIF |
671 |
1731 |
720 |
49.028 |
353 |
341 |
479 |
10 |
26 |
66.53 |
100 |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
NC_013161.1 |
503817..525707 |
CRISPR array |
512560..513624 |
Copies: 15, Repeat: 37, Spacer: 36 |
–GTTTCAATCCC———–ATTACTAGGATTCATTAAAAAGAAAC |
data/GCF_000024045.1_ASM2404v1_genomic.fna.gz |
The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cluster, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cluster. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV.
Descriptions of each output field are provided below. Alignment statistic naming conventions are from the BLAST documentation, see BLAST+ appendices (specifically “outfmt” in table C1). This glossary of common BLAST terms may also be useful in interpreting alignment statistic meaning.
index |
field name |
data type |
description |
---|---|---|---|
0 |
Contig |
string |
ID/accession for the parent contig/genome sequence. |
1 |
Loc_coordinates |
string |
Start and end position of the candidate locus (relative to the parent sequence). |
2 |
Name |
string |
Feature name/label. This is will be identical to “Description” (index 8) if |
3 |
Coordinates |
string |
Start and end position of this feature, relative to the parent sequence. |
4 |
ORFID |
string |
A unique ID given to this feature, primarily for internal use. Only applies to features that are genes. |
5 |
Strand |
signed int |
Specifies if the feature was found in the forward (1) or backward (-1) direction. Only applied to features that are genes. |
6 |
Accession |
string |
ID/accession for the reference sequence that had the best alignment (by e-value) with this feature’s translated sequence. |
7 |
E_val |
float |
The e-value score for the best alignment for this feature. |
8 |
Description |
string |
A description of this putative feature, parsed from the defline of best aligned reference sequence. |
9 |
Sequence |
string |
The (translated) amino acid sequence for this feature. |
10 |
Bitscore |
float |
The bitscore for the best alignment for this feature. |
11 |
Rawscore |
int |
The raw score for the best alignment for this feature. |
12 |
Aln_len |
int |
The length of the best scoring alignment, in base pairs. |
13 |
Pident |
float |
The fraction of identical positions in the best alignment. |
14 |
Nident |
int |
The number of identical positions in the best alignment. |
15 |
Mismatch |
int |
The number of mismatched positions in the best alignment. |
16 |
Positive |
int |
The number of positive-scoring matches in the best alignment. |
17 |
Gapopen |
int |
The number of gap openings. |
18 |
Gaps |
int |
Total number of gaps in the alignment. |
19 |
Ppos |
float |
Percentage of positive scoring matches. |
20 |
Qcovhsp |
int |
Query coverage per HSP. That is, the fraction of the query (this feature’s translated amino acid sequence) that was covered in the best alignment. |
21 |
Contig_filename |
string |
The input data (genomic sequence(s)) file path. |
API Reference¶
Gene Finder¶
gene_finder.pipeline¶
- class gene_finder.pipeline.Pipeline¶
Coordinates protein (or nucleic acid) searches to find gene clusters of interest in genomic/metagenomic data.
- add_seed_step(db, name, e_val, blast_type, sensitivity=None, parse_descriptions=True, blast_path=None, **kwargs)¶
Find genomic regions that contain at least one “seed” sequence.
- Parameters
db (str) – Path to the target (seed) protein database.
name (str) – A unique name/ID for this step in the pipeline.
e_val (float) – Expect value to use. Only keep hits with a an equivalent or better (lower) score.
blast_type (str) – Specifies which search program to use. This can be either “PROT” (blastp), “PSI” (psiblast), “mmseqs” (mmseqs2), or “diamond” (diamond).
sensitivity (str) – Sets the sensitivity param for mmseqs and diamond (does nothing if BLAST is the seach type).
parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI BLAST, DO NOT use the
-parse_seqids
flag when creating protein databases with makeblastdb.blast_path (string, optional) – Path to the blastp/mmseqs/diamond program, if not using the system default.
**kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.
Note
This should be the first step added to a
gene_finder.pipeline.Pipeline
object. Additional gene finding steps can be added in any order.
- add_seed_with_coordinates_step(db, name, e_val, blast_type, sensitivity=None, parse_descriptions=True, start=None, end=None, contig_id=None, blast_path=None, **kwargs)¶
Define a genomic region of interest with coordinates instead of a seed sequence.
An alternative to
gene_finder.pipeline.Pipeline.add_seed_step()
. Most useful for re-annotating putative systems of interest, where the region coordinates are already known.- Parameters
db (str) – Path to the target database to search against.
name (str) – A unique name/ID for this step in the pipeline.
e_val (float) – Expect value to use. Only keep hits with an equivalent or better (lower) score.
blast_type (str) – Specifies which search program to use. This can be either “PROT” (blastp), “PSI” (psiblast), “mmseqs” (mmseqs2), or “diamond” (diamond).
sensitivity (str) – Sets the sensitivity param for mmseqs and diamond (does nothing if BLAST is the seach type).
parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI BLAST, DO NOT use the
-parse_seqids
flag when creating protein databases with makeblastdb.start (int) – Defines the beginning of the region to search, in base pairs (bp). If no start position is given the first (zero indexed) position in the genome/contig is used.
end (int) – Defines the end of the region to search, in base pairs (bp). If no end position is given the last position in the contig is used.
contig_id (string, optional) – An identifier for the contig to search. If no ID is given, the pipeline will search every contig in the input file using the coordinates specified. Note that the contig ID is defined as the substring between the “>” character and the first ” ” character in the contig header.
blast_path (string, optional) – Path to the blastp/mmseqs/diamond program, if not using the system default.
**kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.
- add_filter_step(db, name, e_val, blast_type, min_prot_count=1, sensitivity=None, parse_descriptions=True, blast_path=None, **kwargs)¶
Add a step to search candidate regions for target sequences, and filter out candidates that do not have at least
min_prot_count
matching sequences.- Parameters
db (str) – Path to the target protein sequence database.
name (str) – A unique name/ID for this step in the pipeline.
e_val (float) – Expect value to use. Only keep hits with a an equivalent or better (lower) score.
blast_type (str) – Specifies which search program to use. This can be either “PROT” (blastp), “PSI” (psiblast), “mmseqs” (mmseqs2), or “diamond” (diamond).
min_prot_count (int, optional) – Minimum number of hits needed to keep each candidate.
sensitivity (str) – Sets the sensitivity param for mmseqs and diamond (does nothing if BLAST is the seach type).
parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI blast, DO NOT use the
-parse_seqids
flag when creating protein databases with makeblastdb.blast_path (string, optional) – Path to the blastp/mmseqs/diamond program, if not using the system default.
**kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.
- add_blast_step(db, name, e_val, blast_type, sensitivity=None, parse_descriptions=True, blast_path=None, **kwargs)¶
Add a non-filtering search step to the pipeline. That is, search each candidate for target sequences without applying any filtering logic. This is most useful for annotating candidates for non-essential or ancillary genes.
- Parameters
db (str) – Path to the target protein sequence database.
name (str) – A unique name/ID for this step in the pipeline.
e_val (float) – Expect value to use. Only keep hits with a an equivalent or better (lower) score.
blast_type (str) – Specifies which search program to use. This can be either “PROT” (blastp), “PSI” (psiblast), “mmseqs” (mmseqs2), or “diamond” (diamond).
sensitivity (str) – Sets the sensitivity param for mmseqs and diamond (does nothing if BLAST is the seach type).
parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI BLAST, DO NOT use the
-parse_seqids
flag when creating protein databases with makeblastdb.blast_path (string, optional) – Path to the blastp/mmseqs/diamond program, if not using the system default.
**kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.
- add_crispr_step()¶
Add a step to search for CRISPR arrays using PILER-CR.
- add_blastn_step(db, name, e_val, parse_descriptions=False, blastn_path='blastn', **kwargs)¶
Add a step to do nucleotide BLAST.
- Parameters
db (str) – Path to the target protein sequence database.
name (str) – A unique name/ID for this step in the pipeline.
e_val (float) – Expect value to use. Only keep hits with a an equivalent or better (lower) score.
parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI BLAST, DO NOT use the
-parse_seqids
flag when creating protein databases with makeblastdb.blast_path (string, optional) – Path to the blastn program, if not using the system default.
**kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.
- run(data, job_id=None, output_directory=None, min_prot_len=60, span=10000, record_all_hits=False, incremental_output=False, starting_contig=None, gzip=False) → dict¶
Execute each step in the pipeline, in the order they were added.
- Parameters
data (str) – Path to the input data file. Can be a single- or multi-sequence file in fasta format.
job_id (str, optional) – A unique ID to prefix all output files. If no ID is given, the string “gene_finder” will be used as the prefix. In any case, results from the pipeline are written to the file <prefix>_results.csv.
output_directory (str, optional) – The directory to write output data files to. If no directory is given then the current (working) directory is used.
min_prot_len (int, optional) – Minimum ORF length (aa). Default is 60.
span (int, optional) – Length (nt) upsteam and downstream of each seed hit to keep. Defines the aproximate size of the genomic neighborhoods that will be used as the search space after the seed step.
record_all_hits (bool, optional) – Write data about all genes found (even discarded ones) to the file <job_id>_hits.json, grouped by contig. Note that this contains much of the same information as is in the results CSV file; nevertheless, it may be useful for analysis or troubleshooting a search.
incremental_output (bool, optional) – Write results to disk after each contig is processed. Using this option also creates a checkpoint file that gives the ID of the contig that is currently being processed; if the job finishes successfully, this file will be automatically cleaned up. This feature is especially useful for long-running jobs.
starting_contig (bool, optional) – The sequence identifier of the contig where the run should begin. In other words, skip over records in the input file until the specified contig is reached, and then run the pipeline as normal. This is usually used in conjunction with
incremental_output
.gzip (bool, optional) – Was this file compressed with gzip?
- Returns
Candidate systems, grouped by contig id and genomic location.
- Return type
dict
Operon Analyzer¶
The following modules comprise the core Operon Analyzer functionality.
operon_analyzer.genes¶
- class operon_analyzer.genes.Feature(name: str, coordinates: Tuple[int, int], orfid: str, strand: Optional[int], accession: str, e_val: Optional[float], description: str, sequence: str, bit_score: Optional[int] = None, raw_score: Optional[int] = None, aln_len: Optional[int] = None, pident: Optional[float] = None, nident: Optional[float] = None, mismatch: Optional[float] = None, positive: Optional[float] = None, gapopen: Optional[int] = None, gaps: Optional[int] = None, ppos: Optional[float] = None, qcovhsp: Optional[int] = None)¶
Represents a gene or CRISPR repeat array. This is used internally by
operon_analyzer.genes.Operon
, but appears in the auto-generated documentation for reference.
- class operon_analyzer.genes.Operon(contig: str, contig_filename: str, start: int, end: int, features: List[operon_analyzer.genes.Feature])¶
Provides access to features that were found in the same genomic region, which presumably comprise an actual operon. Whether this is true in reality must be determined by the user, if that is meaningful to them.
- set_sequence(sequence: Bio.Seq.Seq)¶
Stores the nucleotide sequence of the operon.
- property feature_region_sequence: str¶
Returns the nucleotide sequence of the operon, excluding the regions outside of the outermost Features.
- property all_genes¶
Iterates over all genes (i.e. not CRISPR arrays) in the operon regardless of whether it’s been ignored.
- property all_features¶
Iterates over all features in the operon regardless of whether it’s been ignored.
- property feature_names¶
Iterates over the name of each feature in the operon
- get(feature_name: str, regex=False) → List[operon_analyzer.genes.Feature]¶
Returns a list of every Feature with a given name.
- get_unique(feature_name: str, regex=False) → Optional[operon_analyzer.genes.Feature]¶
Returns a Feature or None if there is more than one Feature with the same name
- as_str() → str¶
Writes an Operon back out in the same CSV format that gene_finder produces. The text won’t be completely identical in the case where floats have trailing decimals, or zero values in scientific format are recast as a simple float.
operon_analyzer.rules¶
- class operon_analyzer.rules.SerializableFunction(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶
A base class for functions that we need to be able to serialize. Do not instantiate this directly.
- class operon_analyzer.rules.Rule(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶
Defines a requirement that elements of an operon must adhere to.
- evaluate(operon: operon_analyzer.genes.Operon) → bool¶
Determine if an operon adheres to this rule.
- class operon_analyzer.rules.Result(operon: operon_analyzer.genes.Operon)¶
Records which rules an operon passed and handles serialization of the data. Also makes it easy to run follow-up queries.
- add_passing(rule: operon_analyzer.rules.Rule)¶
Mark this rule as being one that the operon passed.
- add_failing(rule: operon_analyzer.rules.Rule)¶
Mark this rule as being one that the operon failed.
- property is_passing: bool¶
Declares whether the given Operon adhered to all given Rules.
- class operon_analyzer.rules.Filter(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶
A function that will be run on an Operon that marks Features as being ignorable for the purposes of evaluating RuleSets.
- run(operon: operon_analyzer.genes.Operon)¶
Mark Features in the Operon as ignored if they don’t pass the filter. This will prevent them from being taken into account during Rule evaluation and (by default) during visualization.
- class operon_analyzer.rules.FilterSet¶
Stores functions that take an Operon and mark individual Features as ignored in case we think they are not actually worth taking into account when evaluating rules. Features can be ignored for multiple reasons.
- must_be_within_n_bp_of_anything(distance_bp: int)¶
If a feature is very far away from anything it’s probably not part of an operon.
- must_be_within_n_bp_of_feature(feature_name: str, distance_bp: int, regex: bool = False)¶
There may be situations where two features always appear near each other in functional operons.
- pick_overlapping_features_by_bit_score(minimum_overlap_threshold: float)¶
If two features overlap by more than
minimum_overlap_threshold
, the one with the lower bit score is ignored.
- custom(filt: operon_analyzer.rules.Filter)¶
Add a rule with a user-defined function.
- evaluate(operon: operon_analyzer.genes.Operon)¶
Run the filters on the operon and set Features that fail to meet the requirements to be ignored.
- Parameters
operon – The
operon_analyzer.genes.Operon
object whose features will be evaluated.
- class operon_analyzer.rules.RuleSet¶
Creates, stores and evaluates
operon_analyzer.rules.Rule
s that an operon must adhere to.- exclude(feature_name: str, regex: bool = False)¶
Forbid the presence of a particular feature.
- require(feature_name: str, regex: bool = False)¶
Require the presence of a particular feature.
- max_distance(feature1_name: str, feature2_name: str, distance_bp: int, closest_pair_only: bool = False, regex: bool = False)¶
The two given features must be no further than
distance_bp
base pairs apart. If there is more than one match, all possible pairs must meet the criteria, unlessclosest_pair_only
is True in which case only the closets pair is considered.
- at_least_n_bp_from_anything(feature_name: str, distance_bp: int, regex=False)¶
Requires that a feature be at least
distance_bp
base pairs away from any other feature. This is mostly useful for eliminating overlapping features.
- at_most_n_bp_from_anything(feature_name: str, distance_bp: int, regex: bool = False)¶
A given feature must be within
distance_bp
base pairs of another feature. Requires exactly one matching feature to be present. Returns False if the given feature is the only feature.
- same_orientation(exceptions: Optional[List[str]] = None)¶
All features in the operon must have the same orientation.
- contains_any_set_of_features(sets: List[List[str]])¶
Returns True if the operon contains features with all of the names in at least one of the lists. Useful for determining if an operon contains all of the essential genes for a particular system, for example.
- contains_exactly_one_of(feature1_name: str, feature2_name: str, regex: bool = False)¶
An exclusive-or of the presence of two features. That is, one of the features must be present and the other must not.
- contains_at_least_n_features(feature_names: List[str], feature_count: int, count_multiple_copies: bool = False)¶
The operon must contain at least
feature_count
features in the list. By default, a matching feature that appears multiple times in the operon will only be counted once; to count multiple copies of the same feature, setcount_multiple_copies
to True.
- contains_group(feature_names: List[str], max_gap_distance_bp: int, require_same_orientation: bool)¶
The operon must contain a contiguous set of features (in any order) separated by no more than
max_gap_distance_bp
. Optionally, the user may require that the features must all have the same orientation.
- maximum_size(feature_name: str, max_bp: int, all_matching_features_must_pass: bool = False, regex: bool = False)¶
The operon must contain at least one feature with
feature_name
with a size (in base pairs) ofmax_bp
or smaller. Ifall_matching_features_must_pass
is True, every matching Feature must be at leastmax_bp
long.
- minimum_size(feature_name: str, min_bp: int, all_matching_features_must_pass: bool = False, regex: bool = False)¶
The operon must contain at least one feature with
feature_name
with a size (in base pairs) ofmin_bp
or larger. Ifall_matching_features_must_pass
is True, every matching Feature must be at leastmin_bp
long.
- custom(rule: operon_analyzer.rules.Rule)¶
Add a rule with a user-defined function.
- evaluate(operon: operon_analyzer.genes.Operon) → operon_analyzer.rules.Result¶
See if an operon adheres to all rules.
- Parameters
operon – The
operon_analyzer.genes.Operon
object to evaluate.
operon_analyzer.analyze¶
- operon_analyzer.analyze.analyze(input_lines: IO[str], ruleset: operon_analyzer.rules.RuleSet, filterset: Optional[operon_analyzer.rules.FilterSet] = None, output: Optional[IO] = None)¶
Takes a handle to the CSV generated by
gene_finder.pipeline.Pipeline
and aoperon_analyzer.rules.RuleSet
object, and produces text that describes which operons adhered to those rules. If an operon fails any of the rules, the exact rules will be enumerated.
- operon_analyzer.analyze.evaluate_rules_and_reserialize(input_lines: IO[str], ruleset: operon_analyzer.rules.RuleSet, filterset: Optional[operon_analyzer.rules.FilterSet] = None, output: Optional[IO] = None)¶
Takes a handle to the CSV generated by
gene_finder.pipeline.Pipeline
and aoperon_analyzer.rules.RuleSet
object, and writes passing operons back to stdout.
- operon_analyzer.analyze.load_analyzed_operons(f: IO[str]) → Iterator[Tuple[str, int, int, str]]¶
Loads and parses the data from the output of
operon_analyzer.analyze.analyze()
. This is typically used for analyzing or visualizing candidate operons.
- operon_analyzer.analyze.group_similar_operons(operons: List[operon_analyzer.genes.Operon], load_sequences: bool = True)¶
Groups operons together if the nucleotide sequences bounded by their outermost features (represented by
operon_analyzer.genes.Feature
objects) are identical. Ifload_sequences
is True, the nucleotide sequence of each operon will be loaded from disk as it is encountered.- Returns
A representative
operon_analyzer.genes.Operon
object for each group.- Return type
list
- operon_analyzer.analyze.deduplicate_operons_approximate(operons: Iterator[operon_analyzer.genes.Operon]) → List[operon_analyzer.genes.Operon]¶
Deduplicates Operons by examining the names and sequences of their features (represented by
operon_analyzer.genes.Feature
objects) and the sizes of the gaps between them. This is an approximate algorithm: false positives are possible when the nucleotide sequence varies between the Features (without changing the total number of base pairs) or if there are silent mutations in the Feature CDS. However, it is much faster than the exact method.- Returns
A representative
operon_analyzer.genes.Operon
object for each group.- Return type
list
- operon_analyzer.analyze.dedup_supersets(operons: List[operon_analyzer.genes.Operon]) → List[operon_analyzer.genes.Operon]¶
If the same inputs are searched with
gene_finder.pipeline.Pipeline
using an expanded database, the new results will be either exactly identical to the previous results, or will be supersets of the old results.This function takes all operons, and removes ones with identical accession IDs and contig coordinates, where the smaller operon’s features are all contained in the larger one.
- Returns
The non-redundant
operon_analyzer.genes.Operon
objects.- Return type
list
- operon_analyzer.analyze.cluster_operons_by_feature_order(operons: Iterator[operon_analyzer.genes.Operon])¶
Organizes all operons into a dictionary based on the order/identity of their features (represented by
operon_analyzer.genes.Feature
objects). Cases where the overall order is inverted are considered to be the same. The keys of the dictionary are the dash-delimited feature names, with one of the two orientations (if both exist) arbitrarily chosen. If there are ignored features, they will not appear in the key.- Returns
The resulting
operon_analyzer.genes.Operon
clusters.- Return type
dict
operon_analyzer.visualize¶
- operon_analyzer.visualize.plot_operons(operons: List[operon_analyzer.genes.Operon], output_directory: str, plot_ignored: bool = True, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = {}, nucl_per_line: Optional[int] = None, show_accession: bool = False, show_description: bool = False)¶
Takes
operon_analyzer.genes.Operon
objects and saves plots of them to disk.- Parameters
operons (list) – Operons to be plotted.
output_directory (str) – Path to the directory to save operon plots to.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by
operon_analyzer.rules.FilterSet
.color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
nucl_per_line (int, optional) – Length (in base pairs) to wrap gene diagrams on.
show_accession (bool, optional) – Show the accession number of the best hit for each plotted feature.
show_description (bool, optional) – Show the description of the best hit for each plotted feature.
- operon_analyzer.visualize.plot_operon_pairs(operons: List[operon_analyzer.genes.Operon], other_operons: List[operon_analyzer.genes.Operon], output_directory: str, color_by_blast_statistic: Optional[str] = None, plot_ignored: bool = False, feature_colors: Optional[dict] = {})¶
Takes two lists of presumably related Operons, pairs them up such that the pairs overlap the same genomic region, and plots one on top of the other. This allows side-by-side comparison of two different pipeline runs, so that you can, for example, run your regular pipeline, then re-BLAST with a more general protein database like nr, and easily see how the annotations differ.
- Parameters
operons (list) – Operons to be plotted.
other_operons (list) – Related operons to be plotted for comparison.
output_directory (str) – Path to the directory to save operon plots to.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by
operon_analyzer.rules.FilterSet
.color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
- operon_analyzer.visualize.make_clustered_stacked_operon_plots(operons: Iterable[operon_analyzer.genes.Operon], other_operons: Iterable[operon_analyzer.genes.Operon], image_directory: str, min_count: int = 10, plot_ignored: bool = False, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = None)¶
Clusters operons and plots them on top of a reannotated version of the same operon. This allows the user to BLAST some set of data with a curated database, then re-BLAST it against a more general database, and compare the two directly in a cluster-specific manner.
If a
operon_analyzer.rules.FilterSet
was used during analysis, that same set should be evaluated on each operon before passing it into this function.- Parameters
operons (iterable) – The operons of interest.
other_operons (iterable) – Reannotated operons.
image_directory (str) – The directory where all subdirectories will be created. Will be created if it does not exist.
min_count (int, optional) – Groups must have at least this many systems in order to be plotted. Default is 10.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by
operon_analyzer.rules.FilterSet
.color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
nucl_per_line (int, optional) – Length (in base pairs) to wrap gene diagrams on.
- operon_analyzer.visualize.make_clustered_operon_plots(analysis_csv: str, operons: Iterable[operon_analyzer.genes.Operon], image_directory: str, min_count: int = 10, diff_against_csv: Optional[str] = None, plot_ignored: bool = False, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = None, nucl_per_line: Optional[int] = None)¶
Clusters operons by the order of their features and plots them in separate directories, adding the number of systems to the directory name. Only systems that passed the rules specified as a
operon_analyzer.rules.RuleSet
object will be eligible to be plotted.If a
operon_analyzer.rules.FilterSet
was used during analysis, that same FilterSet should be evaluated on each operon before passing it into this function.- Parameters
analysis_csv (str) – Path to the CSV file created by
operon_analyzer.analyze.analyze()
.operons (iterable) – The operons of interest.
image_directory (str) – The directory where all subdirectories will be created. Will be created if it does not exist.
min_count (int, optional) – Groups must have at least this many systems in order to be plotted. Default is 10.
diff_against_csv (str) – Path to a CSV file created by operon analyzer. Any clusters in this file will be skipped when clustering operons from analysis_csv. The point of this is to see only new systems when making slight alterations to rules.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by
operon_analyzer.rules.FilterSet
.color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
operon_analyzer.overview¶
- operon_analyzer.overview.load_counts(lines: IO[str]) → Tuple[Dict[str, int], Dict[str, int], Dict[int, int]]¶
Takes a stream of the output from
operon_analyzer.analyze.analyze()
and computes some useful statistics. Namely, the number of times each rule was the only one broken for a contig, the number of times each rule was broken regardless of context, and the occurrences of the count of broken rules per contig.
operon_analyzer.reannotation¶
- operon_analyzer.reannotation.summarize(operons: List[operon_analyzer.genes.Operon], reannotated_operons: List[operon_analyzer.genes.Operon])¶
For each protein in each operon, prints out how frequently it was reannotated to a particular protein. For example, a putative Cas9 may be identified as a transposase when BLASTed with a more expansive database.
operon_analyzer.load¶
- operon_analyzer.load.load_sequence(operon: operon_analyzer.genes.Operon) → Optional[Bio.Seq.Seq]¶
Loads the DNA sequence for a given operon’s contig.
- operon_analyzer.load.load_gzipped_operons(filename: str) → Iterator[operon_analyzer.genes.Operon]¶
Create a
operon_analyzer.genes.Operon
representation for each candidate present in gzipped output fromgene_finder.pipeline.Pipeline
.- Parameters
filename (str) – Path to the compressed pipeline CSV file.
- operon_analyzer.load.load_operons(handle: IO[str]) → Iterator[operon_analyzer.genes.Operon]¶
Create a
operon_analyzer.genes.Operon
representation for each candidate present in output fromgene_finder.pipeline.Pipeline
.- Parameters
handle (IO) – Handle for the pipeline CSV file.
operon_analyzer.parse¶
- operon_analyzer.parse.assemble_operons(lines: Iterator[Tuple[str, str, str, str, str, str, str, str, str, str]]) → Iterator[operon_analyzer.genes.Operon]¶
Takes the output from
gene_finder.pipeline.Pipeline
and loads all features, then assembles them intooperon_analyzer.genes.Operon
objects.To keep things memory efficient while not allowing redundant operons from being loaded, we keep track of the hash of each operon, which is just an integer. Even for very large metagenomic databases this should use less than a gigabyte of memory.
This is a helper function used by the loaders in
operon_analyzer.load
, but appears in the auto-generated documentation for reference.
The next set of modules expose simple functions for dealing with CRISPR array sequences. Presumably, these would only be useful to researchers interested in CRISPR-Cas genomic systems.
operon_analyzer.piler_parse¶
Parses piler-cr output to extract the sequences of every spacer.
- class operon_analyzer.piler_parse.RepeatSpacer(position, repeat_len, spacer_len, sequence)¶
- property position¶
Alias for field number 0
- property repeat_len¶
Alias for field number 1
- property sequence¶
Alias for field number 3
- property spacer_len¶
Alias for field number 2
- class operon_analyzer.piler_parse.BrokenSpacer(position, repeat_len, sequence)¶
- property position¶
Alias for field number 0
- property repeat_len¶
Alias for field number 1
- property sequence¶
Alias for field number 2
- operon_analyzer.piler_parse.parse_pilercr_output(text: str, start: int, end: int)¶
Takes the text of pilercr raw output, and a start and end coordinate for an operon’s neighborhood, and extracts all spacers in that region.
operon_analyzer.repeat_finder¶
- class operon_analyzer.repeat_finder.Repeat(upstream_sequence, upstream_start, downstream_sequence, downstream_start)¶
- property downstream_sequence¶
Alias for field number 2
- property downstream_start¶
Alias for field number 3
- property upstream_sequence¶
Alias for field number 0
- property upstream_start¶
Alias for field number 1
- class operon_analyzer.repeat_finder.GRFResult(start, end, alignment)¶
- property alignment¶
Alias for field number 2
- property end¶
Alias for field number 1
- property start¶
Alias for field number 0
- operon_analyzer.repeat_finder.find_inverted_repeats(operon: operon_analyzer.genes.Operon, buffer_around_operon: int, min_repeat_size: int)¶
Searches an operon and the DNA flanking it for inverted repeats using GenericRepeatFinder. If found, they will be added to the operon as Feature objects with the name “TIR” and the sequence. The strand will be set to 1 for the upstream sequence and -1 for the downstream sequence.
- Parameters
operon (Operon) – The
operon_analyzer.genes.Operon
object.buffer_around_operon (int) – The number of base pairs on either side of the operon to search in addition to the operon’s internal sequence.
min_repeat_size (int) – The minimum number of base pairs that an inverted repeat must have.
operon_analyzer.spacers¶
- class operon_analyzer.spacers.AlignmentResult(match_count, spacer_sequence, contig_sequence, contig_start, contig_end, spacer_alignment, contig_alignment, comp_string, strand, spacer_order, array_length)¶
- property array_length¶
Alias for field number 10
- property comp_string¶
Alias for field number 7
- property contig_alignment¶
Alias for field number 6
- property contig_end¶
Alias for field number 4
- property contig_sequence¶
Alias for field number 2
- property contig_start¶
Alias for field number 3
- property match_count¶
Alias for field number 0
- property spacer_alignment¶
Alias for field number 5
- property spacer_order¶
Alias for field number 9
- property spacer_sequence¶
Alias for field number 1
- property strand¶
Alias for field number 8
- operon_analyzer.spacers.find_self_targeting_spacers(operons: List[operon_analyzer.genes.Operon], min_matching_fraction: float, num_processes: int = 32)¶
For each given Operon, this will determine if CRISPR spacers target a location in the operon’s parent contig. Matches with more than
min_matching_fraction
homology will be added to the Operon as a Feature named “CRISPR target”.
Contributing¶
Thank you for your interest in contributing to Opfi. Contribututions can take many forms, including:
Reporting a bug
Discussing the current state of the code
Submitting a fix
Proposing new features
Issues¶
Issues should be used to report problems or bugs with the library, to request new features, or to discuss potential changes before PRs are created.
Good bug reports have:
A summary of the issue
Specific steps to reproduce the problem, preferably in the form of code examples
A description of expected behavior
A description of actual behavior
Pull Requests¶
In general, we use Github Flow. This means that all changes happen through Pull Requests:
Fork the repository to your own Github account
Clone the project to your machine
Create a branch locally with a succinct but descriptive name
Commit changes to the branch
Push changes to your fork
Open a PR in our repository and include a description of the changes and a reference to the issue the PR addresses
Coding Style¶
Contributors should attempt to adhere to the PEP8 style guide when possible, although this is not currently strictly enforced. At a minimum, new contributions should follow a style that is consistent with the preexisting code base.
Testing¶
We use the pytest framework for creating unit tests. Please write tests for any new code contributed to this project.
License¶
By contributing, you agree that your contributions will be licensed under its MIT License.
Code of Conduct¶
We take our open source community seriously and hold ourselves and other contributors to high standards of communication. By participating and contributing to this project, you agree to uphold our Code of Conduct.
Attribution¶
This document was adapted from the General Contributing Guidelines of the rextendr project, and from the contributing template developed by briandk.