Operon Analyzer¶
The following modules comprise the core Operon Analyzer functionality.
operon_analyzer.genes¶
- class operon_analyzer.genes.Feature(name: str, coordinates: Tuple[int, int], orfid: str, strand: Optional[int], accession: str, e_val: Optional[float], description: str, sequence: str, bit_score: Optional[int] = None, raw_score: Optional[int] = None, aln_len: Optional[int] = None, pident: Optional[float] = None, nident: Optional[float] = None, mismatch: Optional[float] = None, positive: Optional[float] = None, gapopen: Optional[int] = None, gaps: Optional[int] = None, ppos: Optional[float] = None, qcovhsp: Optional[int] = None)¶
Represents a gene or CRISPR repeat array. This is used internally by
operon_analyzer.genes.Operon
, but appears in the auto-generated documentation for reference.
- class operon_analyzer.genes.Operon(contig: str, contig_filename: str, start: int, end: int, features: List[operon_analyzer.genes.Feature])¶
Provides access to features that were found in the same genomic region, which presumably comprise an actual operon. Whether this is true in reality must be determined by the user, if that is meaningful to them.
- set_sequence(sequence: Bio.Seq.Seq)¶
Stores the nucleotide sequence of the operon.
- property feature_region_sequence: str¶
Returns the nucleotide sequence of the operon, excluding the regions outside of the outermost Features.
- property all_genes¶
Iterates over all genes (i.e. not CRISPR arrays) in the operon regardless of whether it’s been ignored.
- property all_features¶
Iterates over all features in the operon regardless of whether it’s been ignored.
- property feature_names¶
Iterates over the name of each feature in the operon
- get(feature_name: str, regex=False) → List[operon_analyzer.genes.Feature]¶
Returns a list of every Feature with a given name.
- get_unique(feature_name: str, regex=False) → Optional[operon_analyzer.genes.Feature]¶
Returns a Feature or None if there is more than one Feature with the same name
- as_str() → str¶
Writes an Operon back out in the same CSV format that gene_finder produces. The text won’t be completely identical in the case where floats have trailing decimals, or zero values in scientific format are recast as a simple float.
operon_analyzer.rules¶
- class operon_analyzer.rules.SerializableFunction(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶
A base class for functions that we need to be able to serialize. Do not instantiate this directly.
- class operon_analyzer.rules.Rule(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶
Defines a requirement that elements of an operon must adhere to.
- evaluate(operon: operon_analyzer.genes.Operon) → bool¶
Determine if an operon adheres to this rule.
- class operon_analyzer.rules.Result(operon: operon_analyzer.genes.Operon)¶
Records which rules an operon passed and handles serialization of the data. Also makes it easy to run follow-up queries.
- add_passing(rule: operon_analyzer.rules.Rule)¶
Mark this rule as being one that the operon passed.
- add_failing(rule: operon_analyzer.rules.Rule)¶
Mark this rule as being one that the operon failed.
- property is_passing: bool¶
Declares whether the given Operon adhered to all given Rules.
- class operon_analyzer.rules.Filter(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶
A function that will be run on an Operon that marks Features as being ignorable for the purposes of evaluating RuleSets.
- run(operon: operon_analyzer.genes.Operon)¶
Mark Features in the Operon as ignored if they don’t pass the filter. This will prevent them from being taken into account during Rule evaluation and (by default) during visualization.
- class operon_analyzer.rules.FilterSet¶
Stores functions that take an Operon and mark individual Features as ignored in case we think they are not actually worth taking into account when evaluating rules. Features can be ignored for multiple reasons.
- must_be_within_n_bp_of_anything(distance_bp: int)¶
If a feature is very far away from anything it’s probably not part of an operon.
- must_be_within_n_bp_of_feature(feature_name: str, distance_bp: int, regex: bool = False)¶
There may be situations where two features always appear near each other in functional operons.
- pick_overlapping_features_by_bit_score(minimum_overlap_threshold: float)¶
If two features overlap by more than
minimum_overlap_threshold
, the one with the lower bit score is ignored.
- custom(filt: operon_analyzer.rules.Filter)¶
Add a rule with a user-defined function.
- evaluate(operon: operon_analyzer.genes.Operon)¶
Run the filters on the operon and set Features that fail to meet the requirements to be ignored.
- Parameters
operon – The
operon_analyzer.genes.Operon
object whose features will be evaluated.
- class operon_analyzer.rules.RuleSet¶
Creates, stores and evaluates
operon_analyzer.rules.Rule
s that an operon must adhere to.- exclude(feature_name: str, regex: bool = False)¶
Forbid the presence of a particular feature.
- require(feature_name: str, regex: bool = False)¶
Require the presence of a particular feature.
- max_distance(feature1_name: str, feature2_name: str, distance_bp: int, closest_pair_only: bool = False, regex: bool = False)¶
The two given features must be no further than
distance_bp
base pairs apart. If there is more than one match, all possible pairs must meet the criteria, unlessclosest_pair_only
is True in which case only the closets pair is considered.
- at_least_n_bp_from_anything(feature_name: str, distance_bp: int, regex=False)¶
Requires that a feature be at least
distance_bp
base pairs away from any other feature. This is mostly useful for eliminating overlapping features.
- at_most_n_bp_from_anything(feature_name: str, distance_bp: int, regex: bool = False)¶
A given feature must be within
distance_bp
base pairs of another feature. Requires exactly one matching feature to be present. Returns False if the given feature is the only feature.
- same_orientation(exceptions: Optional[List[str]] = None)¶
All features in the operon must have the same orientation.
- contains_any_set_of_features(sets: List[List[str]])¶
Returns True if the operon contains features with all of the names in at least one of the lists. Useful for determining if an operon contains all of the essential genes for a particular system, for example.
- contains_exactly_one_of(feature1_name: str, feature2_name: str, regex: bool = False)¶
An exclusive-or of the presence of two features. That is, one of the features must be present and the other must not.
- contains_at_least_n_features(feature_names: List[str], feature_count: int, count_multiple_copies: bool = False)¶
The operon must contain at least
feature_count
features in the list. By default, a matching feature that appears multiple times in the operon will only be counted once; to count multiple copies of the same feature, setcount_multiple_copies
to True.
- contains_group(feature_names: List[str], max_gap_distance_bp: int, require_same_orientation: bool)¶
The operon must contain a contiguous set of features (in any order) separated by no more than
max_gap_distance_bp
. Optionally, the user may require that the features must all have the same orientation.
- maximum_size(feature_name: str, max_bp: int, all_matching_features_must_pass: bool = False, regex: bool = False)¶
The operon must contain at least one feature with
feature_name
with a size (in base pairs) ofmax_bp
or smaller. Ifall_matching_features_must_pass
is True, every matching Feature must be at leastmax_bp
long.
- minimum_size(feature_name: str, min_bp: int, all_matching_features_must_pass: bool = False, regex: bool = False)¶
The operon must contain at least one feature with
feature_name
with a size (in base pairs) ofmin_bp
or larger. Ifall_matching_features_must_pass
is True, every matching Feature must be at leastmin_bp
long.
- custom(rule: operon_analyzer.rules.Rule)¶
Add a rule with a user-defined function.
- evaluate(operon: operon_analyzer.genes.Operon) → operon_analyzer.rules.Result¶
See if an operon adheres to all rules.
- Parameters
operon – The
operon_analyzer.genes.Operon
object to evaluate.
operon_analyzer.analyze¶
- operon_analyzer.analyze.analyze(input_lines: IO[str], ruleset: operon_analyzer.rules.RuleSet, filterset: Optional[operon_analyzer.rules.FilterSet] = None, output: Optional[IO] = None)¶
Takes a handle to the CSV generated by
gene_finder.pipeline.Pipeline
and aoperon_analyzer.rules.RuleSet
object, and produces text that describes which operons adhered to those rules. If an operon fails any of the rules, the exact rules will be enumerated.
- operon_analyzer.analyze.evaluate_rules_and_reserialize(input_lines: IO[str], ruleset: operon_analyzer.rules.RuleSet, filterset: Optional[operon_analyzer.rules.FilterSet] = None, output: Optional[IO] = None)¶
Takes a handle to the CSV generated by
gene_finder.pipeline.Pipeline
and aoperon_analyzer.rules.RuleSet
object, and writes passing operons back to stdout.
- operon_analyzer.analyze.load_analyzed_operons(f: IO[str]) → Iterator[Tuple[str, int, int, str]]¶
Loads and parses the data from the output of
operon_analyzer.analyze.analyze()
. This is typically used for analyzing or visualizing candidate operons.
- operon_analyzer.analyze.group_similar_operons(operons: List[operon_analyzer.genes.Operon], load_sequences: bool = True)¶
Groups operons together if the nucleotide sequences bounded by their outermost features (represented by
operon_analyzer.genes.Feature
objects) are identical. Ifload_sequences
is True, the nucleotide sequence of each operon will be loaded from disk as it is encountered.- Returns
A representative
operon_analyzer.genes.Operon
object for each group.- Return type
list
- operon_analyzer.analyze.deduplicate_operons_approximate(operons: Iterator[operon_analyzer.genes.Operon]) → List[operon_analyzer.genes.Operon]¶
Deduplicates Operons by examining the names and sequences of their features (represented by
operon_analyzer.genes.Feature
objects) and the sizes of the gaps between them. This is an approximate algorithm: false positives are possible when the nucleotide sequence varies between the Features (without changing the total number of base pairs) or if there are silent mutations in the Feature CDS. However, it is much faster than the exact method.- Returns
A representative
operon_analyzer.genes.Operon
object for each group.- Return type
list
- operon_analyzer.analyze.dedup_supersets(operons: List[operon_analyzer.genes.Operon]) → List[operon_analyzer.genes.Operon]¶
If the same inputs are searched with
gene_finder.pipeline.Pipeline
using an expanded database, the new results will be either exactly identical to the previous results, or will be supersets of the old results.This function takes all operons, and removes ones with identical accession IDs and contig coordinates, where the smaller operon’s features are all contained in the larger one.
- Returns
The non-redundant
operon_analyzer.genes.Operon
objects.- Return type
list
- operon_analyzer.analyze.cluster_operons_by_feature_order(operons: Iterator[operon_analyzer.genes.Operon])¶
Organizes all operons into a dictionary based on the order/identity of their features (represented by
operon_analyzer.genes.Feature
objects). Cases where the overall order is inverted are considered to be the same. The keys of the dictionary are the dash-delimited feature names, with one of the two orientations (if both exist) arbitrarily chosen. If there are ignored features, they will not appear in the key.- Returns
The resulting
operon_analyzer.genes.Operon
clusters.- Return type
dict
operon_analyzer.visualize¶
- operon_analyzer.visualize.plot_operons(operons: List[operon_analyzer.genes.Operon], output_directory: str, plot_ignored: bool = True, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = {}, nucl_per_line: Optional[int] = None, show_accession: bool = False, show_description: bool = False)¶
Takes
operon_analyzer.genes.Operon
objects and saves plots of them to disk.- Parameters
operons (list) – Operons to be plotted.
output_directory (str) – Path to the directory to save operon plots to.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by
operon_analyzer.rules.FilterSet
.color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
nucl_per_line (int, optional) – Length (in base pairs) to wrap gene diagrams on.
show_accession (bool, optional) – Show the accession number of the best hit for each plotted feature.
show_description (bool, optional) – Show the description of the best hit for each plotted feature.
- operon_analyzer.visualize.plot_operon_pairs(operons: List[operon_analyzer.genes.Operon], other_operons: List[operon_analyzer.genes.Operon], output_directory: str, color_by_blast_statistic: Optional[str] = None, plot_ignored: bool = False, feature_colors: Optional[dict] = {})¶
Takes two lists of presumably related Operons, pairs them up such that the pairs overlap the same genomic region, and plots one on top of the other. This allows side-by-side comparison of two different pipeline runs, so that you can, for example, run your regular pipeline, then re-BLAST with a more general protein database like nr, and easily see how the annotations differ.
- Parameters
operons (list) – Operons to be plotted.
other_operons (list) – Related operons to be plotted for comparison.
output_directory (str) – Path to the directory to save operon plots to.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by
operon_analyzer.rules.FilterSet
.color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
- operon_analyzer.visualize.make_clustered_stacked_operon_plots(operons: Iterable[operon_analyzer.genes.Operon], other_operons: Iterable[operon_analyzer.genes.Operon], image_directory: str, min_count: int = 10, plot_ignored: bool = False, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = None)¶
Clusters operons and plots them on top of a reannotated version of the same operon. This allows the user to BLAST some set of data with a curated database, then re-BLAST it against a more general database, and compare the two directly in a cluster-specific manner.
If a
operon_analyzer.rules.FilterSet
was used during analysis, that same set should be evaluated on each operon before passing it into this function.- Parameters
operons (iterable) – The operons of interest.
other_operons (iterable) – Reannotated operons.
image_directory (str) – The directory where all subdirectories will be created. Will be created if it does not exist.
min_count (int, optional) – Groups must have at least this many systems in order to be plotted. Default is 10.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by
operon_analyzer.rules.FilterSet
.color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
nucl_per_line (int, optional) – Length (in base pairs) to wrap gene diagrams on.
- operon_analyzer.visualize.make_clustered_operon_plots(analysis_csv: str, operons: Iterable[operon_analyzer.genes.Operon], image_directory: str, min_count: int = 10, diff_against_csv: Optional[str] = None, plot_ignored: bool = False, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = None, nucl_per_line: Optional[int] = None)¶
Clusters operons by the order of their features and plots them in separate directories, adding the number of systems to the directory name. Only systems that passed the rules specified as a
operon_analyzer.rules.RuleSet
object will be eligible to be plotted.If a
operon_analyzer.rules.FilterSet
was used during analysis, that same FilterSet should be evaluated on each operon before passing it into this function.- Parameters
analysis_csv (str) – Path to the CSV file created by
operon_analyzer.analyze.analyze()
.operons (iterable) – The operons of interest.
image_directory (str) – The directory where all subdirectories will be created. Will be created if it does not exist.
min_count (int, optional) – Groups must have at least this many systems in order to be plotted. Default is 10.
diff_against_csv (str) – Path to a CSV file created by operon analyzer. Any clusters in this file will be skipped when clustering operons from analysis_csv. The point of this is to see only new systems when making slight alterations to rules.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by
operon_analyzer.rules.FilterSet
.color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
operon_analyzer.overview¶
- operon_analyzer.overview.load_counts(lines: IO[str]) → Tuple[Dict[str, int], Dict[str, int], Dict[int, int]]¶
Takes a stream of the output from
operon_analyzer.analyze.analyze()
and computes some useful statistics. Namely, the number of times each rule was the only one broken for a contig, the number of times each rule was broken regardless of context, and the occurrences of the count of broken rules per contig.
operon_analyzer.reannotation¶
- operon_analyzer.reannotation.summarize(operons: List[operon_analyzer.genes.Operon], reannotated_operons: List[operon_analyzer.genes.Operon])¶
For each protein in each operon, prints out how frequently it was reannotated to a particular protein. For example, a putative Cas9 may be identified as a transposase when BLASTed with a more expansive database.
operon_analyzer.load¶
- operon_analyzer.load.load_sequence(operon: operon_analyzer.genes.Operon) → Optional[Bio.Seq.Seq]¶
Loads the DNA sequence for a given operon’s contig.
- operon_analyzer.load.load_gzipped_operons(filename: str) → Iterator[operon_analyzer.genes.Operon]¶
Create a
operon_analyzer.genes.Operon
representation for each candidate present in gzipped output fromgene_finder.pipeline.Pipeline
.- Parameters
filename (str) – Path to the compressed pipeline CSV file.
- operon_analyzer.load.load_operons(handle: IO[str]) → Iterator[operon_analyzer.genes.Operon]¶
Create a
operon_analyzer.genes.Operon
representation for each candidate present in output fromgene_finder.pipeline.Pipeline
.- Parameters
handle (IO) – Handle for the pipeline CSV file.
operon_analyzer.parse¶
- operon_analyzer.parse.assemble_operons(lines: Iterator[Tuple[str, str, str, str, str, str, str, str, str, str]]) → Iterator[operon_analyzer.genes.Operon]¶
Takes the output from
gene_finder.pipeline.Pipeline
and loads all features, then assembles them intooperon_analyzer.genes.Operon
objects.To keep things memory efficient while not allowing redundant operons from being loaded, we keep track of the hash of each operon, which is just an integer. Even for very large metagenomic databases this should use less than a gigabyte of memory.
This is a helper function used by the loaders in
operon_analyzer.load
, but appears in the auto-generated documentation for reference.
The next set of modules expose simple functions for dealing with CRISPR array sequences. Presumably, these would only be useful to researchers interested in CRISPR-Cas genomic systems.
operon_analyzer.piler_parse¶
Parses piler-cr output to extract the sequences of every spacer.
- class operon_analyzer.piler_parse.RepeatSpacer(position, repeat_len, spacer_len, sequence)¶
- property position¶
Alias for field number 0
- property repeat_len¶
Alias for field number 1
- property sequence¶
Alias for field number 3
- property spacer_len¶
Alias for field number 2
- class operon_analyzer.piler_parse.BrokenSpacer(position, repeat_len, sequence)¶
- property position¶
Alias for field number 0
- property repeat_len¶
Alias for field number 1
- property sequence¶
Alias for field number 2
- operon_analyzer.piler_parse.parse_pilercr_output(text: str, start: int, end: int)¶
Takes the text of pilercr raw output, and a start and end coordinate for an operon’s neighborhood, and extracts all spacers in that region.
operon_analyzer.repeat_finder¶
- class operon_analyzer.repeat_finder.Repeat(upstream_sequence, upstream_start, downstream_sequence, downstream_start)¶
- property downstream_sequence¶
Alias for field number 2
- property downstream_start¶
Alias for field number 3
- property upstream_sequence¶
Alias for field number 0
- property upstream_start¶
Alias for field number 1
- class operon_analyzer.repeat_finder.GRFResult(start, end, alignment)¶
- property alignment¶
Alias for field number 2
- property end¶
Alias for field number 1
- property start¶
Alias for field number 0
- operon_analyzer.repeat_finder.find_inverted_repeats(operon: operon_analyzer.genes.Operon, buffer_around_operon: int, min_repeat_size: int)¶
Searches an operon and the DNA flanking it for inverted repeats using GenericRepeatFinder. If found, they will be added to the operon as Feature objects with the name “TIR” and the sequence. The strand will be set to 1 for the upstream sequence and -1 for the downstream sequence.
- Parameters
operon (Operon) – The
operon_analyzer.genes.Operon
object.buffer_around_operon (int) – The number of base pairs on either side of the operon to search in addition to the operon’s internal sequence.
min_repeat_size (int) – The minimum number of base pairs that an inverted repeat must have.
operon_analyzer.spacers¶
- class operon_analyzer.spacers.AlignmentResult(match_count, spacer_sequence, contig_sequence, contig_start, contig_end, spacer_alignment, contig_alignment, comp_string, strand, spacer_order, array_length)¶
- property array_length¶
Alias for field number 10
- property comp_string¶
Alias for field number 7
- property contig_alignment¶
Alias for field number 6
- property contig_end¶
Alias for field number 4
- property contig_sequence¶
Alias for field number 2
- property contig_start¶
Alias for field number 3
- property match_count¶
Alias for field number 0
- property spacer_alignment¶
Alias for field number 5
- property spacer_order¶
Alias for field number 9
- property spacer_sequence¶
Alias for field number 1
- property strand¶
Alias for field number 8
- operon_analyzer.spacers.find_self_targeting_spacers(operons: List[operon_analyzer.genes.Operon], min_matching_fraction: float, num_processes: int = 32)¶
For each given Operon, this will determine if CRISPR spacers target a location in the operon’s parent contig. Matches with more than
min_matching_fraction
homology will be added to the Operon as a Feature named “CRISPR target”.