Operon Analyzer¶

The following modules comprise the core Operon Analyzer functionality.

operon_analyzer.genes¶

class operon_analyzer.genes.Feature(name: str, coordinates: Tuple[int, int], orfid: str, strand: Optional[int], accession: str, e_val: Optional[float], description: str, sequence: str, bit_score: Optional[int] = None, raw_score: Optional[int] = None, aln_len: Optional[int] = None, pident: Optional[float] = None, nident: Optional[float] = None, mismatch: Optional[float] = None, positive: Optional[float] = None, gapopen: Optional[int] = None, gaps: Optional[int] = None, ppos: Optional[float] = None, qcovhsp: Optional[int] = None)¶: Represents a gene or CRISPR repeat array. This is used internally by operon_analyzer.genes.Operon, but appears in the auto-generated documentation for reference.

class operon_analyzer.genes.Operon(contig: str, contig_filename: str, start: int, end: int, features: List[operon_analyzer.genes.Feature])¶

Provides access to features that were found in the same genomic region, which presumably comprise an actual operon. Whether this is true in reality must be determined by the user, if that is meaningful to them.

set_sequence(sequence: Bio.Seq.Seq)¶: Stores the nucleotide sequence of the operon.

property feature_region_sequence: str¶: Returns the nucleotide sequence of the operon, excluding the regions outside of the outermost Features.

property all_genes¶: Iterates over all genes (i.e. not CRISPR arrays) in the operon regardless of whether it’s been ignored.

property all_features¶: Iterates over all features in the operon regardless of whether it’s been ignored.

property feature_names¶: Iterates over the name of each feature in the operon

get(feature_name: str, regex=False) → List[operon_analyzer.genes.Feature]¶: Returns a list of every Feature with a given name.

get_unique(feature_name: str, regex=False) → Optional[operon_analyzer.genes.Feature]¶: Returns a Feature or None if there is more than one Feature with the same name

as_str() → str¶: Writes an Operon back out in the same CSV format that gene_finder produces. The text won’t be completely identical in the case where floats have trailing decimals, or zero values in scientific format are recast as a simple float.

operon_analyzer.rules¶

class operon_analyzer.rules.SerializableFunction(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶: A base class for functions that we need to be able to serialize. Do not instantiate this directly.

class operon_analyzer.rules.Rule(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶

Defines a requirement that elements of an operon must adhere to.

evaluate(operon: operon_analyzer.genes.Operon) → bool¶: Determine if an operon adheres to this rule.

class operon_analyzer.rules.Result(operon: operon_analyzer.genes.Operon)¶

Records which rules an operon passed and handles serialization of the data. Also makes it easy to run follow-up queries.

add_passing(rule: operon_analyzer.rules.Rule)¶: Mark this rule as being one that the operon passed.

add_failing(rule: operon_analyzer.rules.Rule)¶: Mark this rule as being one that the operon failed.

property is_passing: bool¶: Declares whether the given Operon adhered to all given Rules.

class operon_analyzer.rules.Filter(name: str, function: Callable, *args, custom_repr: Optional[str] = None)¶

A function that will be run on an Operon that marks Features as being ignorable for the purposes of evaluating RuleSets.

run(operon: operon_analyzer.genes.Operon)¶: Mark Features in the Operon as ignored if they don’t pass the filter. This will prevent them from being taken into account during Rule evaluation and (by default) during visualization.

class operon_analyzer.rules.FilterSet¶

Stores functions that take an Operon and mark individual Features as ignored in case we think they are not actually worth taking into account when evaluating rules. Features can be ignored for multiple reasons.

must_be_within_n_bp_of_anything(distance_bp: int)¶: If a feature is very far away from anything it’s probably not part of an operon.

must_be_within_n_bp_of_feature(feature_name: str, distance_bp: int, regex: bool = False)¶: There may be situations where two features always appear near each other in functional operons.

pick_overlapping_features_by_bit_score(minimum_overlap_threshold: float)¶: If two features overlap by more than minimum_overlap_threshold, the one with the lower bit score is ignored.

custom(filt: operon_analyzer.rules.Filter)¶: Add a rule with a user-defined function.

evaluate(operon: operon_analyzer.genes.Operon)¶

Run the filters on the operon and set Features that fail to meet the requirements to be ignored.

Parameters: operon – The operon_analyzer.genes.Operon object whose features will be evaluated.

class operon_analyzer.rules.RuleSet¶

Creates, stores and evaluates operon_analyzer.rules.Rule s that an operon must adhere to.

exclude(feature_name: str, regex: bool = False)¶: Forbid the presence of a particular feature.

require(feature_name: str, regex: bool = False)¶: Require the presence of a particular feature.

max_distance(feature1_name: str, feature2_name: str, distance_bp: int, closest_pair_only: bool = False, regex: bool = False)¶: The two given features must be no further than distance_bp base pairs apart. If there is more than one match, all possible pairs must meet the criteria, unless closest_pair_only is True in which case only the closets pair is considered.

at_least_n_bp_from_anything(feature_name: str, distance_bp: int, regex=False)¶: Requires that a feature be at least distance_bp base pairs away from any other feature. This is mostly useful for eliminating overlapping features.

at_most_n_bp_from_anything(feature_name: str, distance_bp: int, regex: bool = False)¶: A given feature must be within distance_bp base pairs of another feature. Requires exactly one matching feature to be present. Returns False if the given feature is the only feature.

same_orientation(exceptions: Optional[List[str]] = None)¶: All features in the operon must have the same orientation.

contains_any_set_of_features(sets: List[List[str]])¶: Returns True if the operon contains features with all of the names in at least one of the lists. Useful for determining if an operon contains all of the essential genes for a particular system, for example.

contains_exactly_one_of(feature1_name: str, feature2_name: str, regex: bool = False)¶: An exclusive-or of the presence of two features. That is, one of the features must be present and the other must not.

contains_at_least_n_features(feature_names: List[str], feature_count: int, count_multiple_copies: bool = False)¶: The operon must contain at least feature_count features in the list. By default, a matching feature that appears multiple times in the operon will only be counted once; to count multiple copies of the same feature, set count_multiple_copies to True.

contains_group(feature_names: List[str], max_gap_distance_bp: int, require_same_orientation: bool)¶: The operon must contain a contiguous set of features (in any order) separated by no more than max_gap_distance_bp. Optionally, the user may require that the features must all have the same orientation.

maximum_size(feature_name: str, max_bp: int, all_matching_features_must_pass: bool = False, regex: bool = False)¶: The operon must contain at least one feature with feature_name with a size (in base pairs) of max_bp or smaller. If all_matching_features_must_pass is True, every matching Feature must be at least max_bp long.

minimum_size(feature_name: str, min_bp: int, all_matching_features_must_pass: bool = False, regex: bool = False)¶: The operon must contain at least one feature with feature_name with a size (in base pairs) of min_bp or larger. If all_matching_features_must_pass is True, every matching Feature must be at least min_bp long.

custom(rule: operon_analyzer.rules.Rule)¶: Add a rule with a user-defined function.

evaluate(operon: operon_analyzer.genes.Operon) → operon_analyzer.rules.Result ¶

See if an operon adheres to all rules.

Parameters: operon – The operon_analyzer.genes.Operon object to evaluate.

operon_analyzer.analyze¶

operon_analyzer.analyze.analyze(input_lines: IO[str], ruleset: operon_analyzer.rules.RuleSet, filterset: Optional[operon_analyzer.rules.FilterSet] = None, output: Optional[IO] = None)¶: Takes a handle to the CSV generated by gene_finder.pipeline.Pipeline and a operon_analyzer.rules.RuleSet object, and produces text that describes which operons adhered to those rules. If an operon fails any of the rules, the exact rules will be enumerated.

operon_analyzer.analyze.evaluate_rules_and_reserialize(input_lines: IO[str], ruleset: operon_analyzer.rules.RuleSet, filterset: Optional[operon_analyzer.rules.FilterSet] = None, output: Optional[IO] = None)¶: Takes a handle to the CSV generated by gene_finder.pipeline.Pipeline and a operon_analyzer.rules.RuleSet object, and writes passing operons back to stdout.

operon_analyzer.analyze.load_analyzed_operons(f: IO[str]) → Iterator[Tuple[str, int, int, str]]¶: Loads and parses the data from the output of operon_analyzer.analyze.analyze(). This is typically used for analyzing or visualizing candidate operons.

operon_analyzer.analyze.group_similar_operons(operons: List[operon_analyzer.genes.Operon], load_sequences: bool = True)¶

Groups operons together if the nucleotide sequences bounded by their outermost features (represented by operon_analyzer.genes.Feature objects) are identical. If load_sequences is True, the nucleotide sequence of each operon will be loaded from disk as it is encountered.

Returns: A representative operon_analyzer.genes.Operon object for each group.
Return type: list

operon_analyzer.analyze.deduplicate_operons_approximate(operons: Iterator[operon_analyzer.genes.Operon]) → List[operon_analyzer.genes.Operon]¶

Deduplicates Operons by examining the names and sequences of their features (represented by operon_analyzer.genes.Feature objects) and the sizes of the gaps between them. This is an approximate algorithm: false positives are possible when the nucleotide sequence varies between the Features (without changing the total number of base pairs) or if there are silent mutations in the Feature CDS. However, it is much faster than the exact method.

Returns: A representative operon_analyzer.genes.Operon object for each group.
Return type: list

operon_analyzer.analyze.dedup_supersets(operons: List[operon_analyzer.genes.Operon]) → List[operon_analyzer.genes.Operon]¶

If the same inputs are searched with gene_finder.pipeline.Pipeline using an expanded database, the new results will be either exactly identical to the previous results, or will be supersets of the old results.

This function takes all operons, and removes ones with identical accession IDs and contig coordinates, where the smaller operon’s features are all contained in the larger one.

Returns: The non-redundant operon_analyzer.genes.Operon objects.
Return type: list

operon_analyzer.analyze.cluster_operons_by_feature_order(operons: Iterator[operon_analyzer.genes.Operon])¶

Organizes all operons into a dictionary based on the order/identity of their features (represented by operon_analyzer.genes.Feature objects). Cases where the overall order is inverted are considered to be the same. The keys of the dictionary are the dash-delimited feature names, with one of the two orientations (if both exist) arbitrarily chosen. If there are ignored features, they will not appear in the key.

Returns: The resulting operon_analyzer.genes.Operon clusters.
Return type: dict

operon_analyzer.visualize¶

operon_analyzer.visualize.plot_operons(operons: List[operon_analyzer.genes.Operon], output_directory: str, plot_ignored: bool = True, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = {}, nucl_per_line: Optional[int] = None, show_accession: bool = False, show_description: bool = False)¶

Takes operon_analyzer.genes.Operon objects and saves plots of them to disk.

Parameters

operons (list) – Operons to be plotted.
output_directory (str) – Path to the directory to save operon plots to.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by operon_analyzer.rules.FilterSet .
color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
nucl_per_line (int, optional) – Length (in base pairs) to wrap gene diagrams on.
show_accession (bool, optional) – Show the accession number of the best hit for each plotted feature.
show_description (bool, optional) – Show the description of the best hit for each plotted feature.

operon_analyzer.visualize.plot_operon_pairs(operons: List[operon_analyzer.genes.Operon], other_operons: List[operon_analyzer.genes.Operon], output_directory: str, color_by_blast_statistic: Optional[str] = None, plot_ignored: bool = False, feature_colors: Optional[dict] = {})¶

Takes two lists of presumably related Operons, pairs them up such that the pairs overlap the same genomic region, and plots one on top of the other. This allows side-by-side comparison of two different pipeline runs, so that you can, for example, run your regular pipeline, then re-BLAST with a more general protein database like nr, and easily see how the annotations differ.

Parameters

operons (list) – Operons to be plotted.
other_operons (list) – Related operons to be plotted for comparison.
output_directory (str) – Path to the directory to save operon plots to.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by operon_analyzer.rules.FilterSet .
color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .

operon_analyzer.visualize.make_clustered_stacked_operon_plots(operons: Iterable[operon_analyzer.genes.Operon], other_operons: Iterable[operon_analyzer.genes.Operon], image_directory: str, min_count: int = 10, plot_ignored: bool = False, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = None)¶

Clusters operons and plots them on top of a reannotated version of the same operon. This allows the user to BLAST some set of data with a curated database, then re-BLAST it against a more general database, and compare the two directly in a cluster-specific manner.

If a operon_analyzer.rules.FilterSet was used during analysis, that same set should be evaluated on each operon before passing it into this function.

Parameters

operons (iterable) – The operons of interest.
other_operons (iterable) – Reannotated operons.
image_directory (str) – The directory where all subdirectories will be created. Will be created if it does not exist.
min_count (int, optional) – Groups must have at least this many systems in order to be plotted. Default is 10.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by operon_analyzer.rules.FilterSet .
color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .
nucl_per_line (int, optional) – Length (in base pairs) to wrap gene diagrams on.

operon_analyzer.visualize.make_clustered_operon_plots(analysis_csv: str, operons: Iterable[operon_analyzer.genes.Operon], image_directory: str, min_count: int = 10, diff_against_csv: Optional[str] = None, plot_ignored: bool = False, color_by_blast_statistic: Optional[str] = None, feature_colors: Optional[dict] = None, nucl_per_line: Optional[int] = None)¶

Clusters operons by the order of their features and plots them in separate directories, adding the number of systems to the directory name. Only systems that passed the rules specified as a operon_analyzer.rules.RuleSet object will be eligible to be plotted.

If a operon_analyzer.rules.FilterSet was used during analysis, that same FilterSet should be evaluated on each operon before passing it into this function.

Parameters

analysis_csv (str) – Path to the CSV file created by operon_analyzer.analyze.analyze() .
operons (iterable) – The operons of interest.
image_directory (str) – The directory where all subdirectories will be created. Will be created if it does not exist.
min_count (int, optional) – Groups must have at least this many systems in order to be plotted. Default is 10.
diff_against_csv (str) – Path to a CSV file created by operon analyzer. Any clusters in this file will be skipped when clustering operons from analysis_csv. The point of this is to see only new systems when making slight alterations to rules.
plot_ignored (bool, optional) – Toggles plotting of features that were marked as ignorable by operon_analyzer.rules.FilterSet .
color_by_blast_statistic (str, optional) – Map an alignment quality statistic using the virdis color scale. For a list of alignment statistics captured by Opfi, see Opfi output format .
feature_colors (dict, optional) – If a labeled database was used during candidate identification, features can be colored accordingly using “label”: “feature-color” pairs. For more information about labeling sequence databases, see Annotating sequence databases .

operon_analyzer.overview¶

operon_analyzer.overview.load_counts(lines: IO[str]) → Tuple[Dict[str, int], Dict[str, int], Dict[int, int]]¶: Takes a stream of the output from operon_analyzer.analyze.analyze() and computes some useful statistics. Namely, the number of times each rule was the only one broken for a contig, the number of times each rule was broken regardless of context, and the occurrences of the count of broken rules per contig.

operon_analyzer.reannotation¶

operon_analyzer.reannotation.summarize(operons: List[operon_analyzer.genes.Operon], reannotated_operons: List[operon_analyzer.genes.Operon])¶: For each protein in each operon, prints out how frequently it was reannotated to a particular protein. For example, a putative Cas9 may be identified as a transposase when BLASTed with a more expansive database.

operon_analyzer.load¶

operon_analyzer.load.load_sequence(operon: operon_analyzer.genes.Operon) → Optional[Bio.Seq.Seq]¶: Loads the DNA sequence for a given operon’s contig.

operon_analyzer.load.load_gzipped_operons(filename: str) → Iterator[operon_analyzer.genes.Operon]¶

Create a operon_analyzer.genes.Operon representation for each candidate present in gzipped output from gene_finder.pipeline.Pipeline.

Parameters: filename (str) – Path to the compressed pipeline CSV file.

operon_analyzer.load.load_operons(handle: IO[str]) → Iterator[operon_analyzer.genes.Operon]¶

Create a operon_analyzer.genes.Operon representation for each candidate present in output from gene_finder.pipeline.Pipeline.

Parameters: handle (IO) – Handle for the pipeline CSV file.

operon_analyzer.parse¶

operon_analyzer.parse.assemble_operons(lines: Iterator[Tuple[str, str, str, str, str, str, str, str, str, str]]) → Iterator[operon_analyzer.genes.Operon]¶

Takes the output from gene_finder.pipeline.Pipeline and loads all features, then assembles them into operon_analyzer.genes.Operon objects.

To keep things memory efficient while not allowing redundant operons from being loaded, we keep track of the hash of each operon, which is just an integer. Even for very large metagenomic databases this should use less than a gigabyte of memory.

This is a helper function used by the loaders in operon_analyzer.load, but appears in the auto-generated documentation for reference.

The next set of modules expose simple functions for dealing with CRISPR array sequences. Presumably, these would only be useful to researchers interested in CRISPR-Cas genomic systems.

operon_analyzer.piler_parse¶

Parses piler-cr output to extract the sequences of every spacer.

class operon_analyzer.piler_parse.RepeatSpacer(position, repeat_len, spacer_len, sequence)¶

property position¶: Alias for field number 0

property repeat_len¶: Alias for field number 1

property sequence¶: Alias for field number 3

property spacer_len¶: Alias for field number 2

class operon_analyzer.piler_parse.BrokenSpacer(position, repeat_len, sequence)¶

property position¶: Alias for field number 0

property repeat_len¶: Alias for field number 1

property sequence¶: Alias for field number 2

operon_analyzer.piler_parse.parse_pilercr_output(text: str, start: int, end: int)¶: Takes the text of pilercr raw output, and a start and end coordinate for an operon’s neighborhood, and extracts all spacers in that region.

operon_analyzer.repeat_finder¶

class operon_analyzer.repeat_finder.Repeat(upstream_sequence, upstream_start, downstream_sequence, downstream_start)¶

property downstream_sequence¶: Alias for field number 2

property downstream_start¶: Alias for field number 3

property upstream_sequence¶: Alias for field number 0

property upstream_start¶: Alias for field number 1

class operon_analyzer.repeat_finder.GRFResult(start, end, alignment)¶

property alignment¶: Alias for field number 2

property end¶: Alias for field number 1

property start¶: Alias for field number 0

operon_analyzer.repeat_finder.find_inverted_repeats(operon: operon_analyzer.genes.Operon, buffer_around_operon: int, min_repeat_size: int)¶

Searches an operon and the DNA flanking it for inverted repeats using GenericRepeatFinder. If found, they will be added to the operon as Feature objects with the name “TIR” and the sequence. The strand will be set to 1 for the upstream sequence and -1 for the downstream sequence.

Parameters

operon (Operon) – The operon_analyzer.genes.Operon object.
buffer_around_operon (int) – The number of base pairs on either side of the operon to search in addition to the operon’s internal sequence.
min_repeat_size (int) – The minimum number of base pairs that an inverted repeat must have.

operon_analyzer.spacers¶

class operon_analyzer.spacers.AlignmentResult(match_count, spacer_sequence, contig_sequence, contig_start, contig_end, spacer_alignment, contig_alignment, comp_string, strand, spacer_order, array_length)¶

property array_length¶: Alias for field number 10

property comp_string¶: Alias for field number 7

property contig_alignment¶: Alias for field number 6

property contig_end¶: Alias for field number 4

property contig_sequence¶: Alias for field number 2

property contig_start¶: Alias for field number 3

property match_count¶: Alias for field number 0

property spacer_alignment¶: Alias for field number 5

property spacer_order¶: Alias for field number 9

property spacer_sequence¶: Alias for field number 1

property strand¶: Alias for field number 8

operon_analyzer.spacers.find_self_targeting_spacers(operons: List[operon_analyzer.genes.Operon], min_matching_fraction: float, num_processes: int = 32)¶: For each given Operon, this will determine if CRISPR spacers target a location in the operon’s parent contig. Matches with more than min_matching_fraction homology will be added to the Operon as a Feature named “CRISPR target”.