Gene Finder

gene_finder.pipeline

class gene_finder.pipeline.Pipeline

Coordinates protein (or nucleic acid) searches to find gene clusters of interest in genomic/metagenomic data.

add_seed_step(db, name, e_val, blast_type, sensitivity=None, parse_descriptions=True, blast_path=None, **kwargs)

Find genomic regions that contain at least one “seed” sequence.

Parameters
  • db (str) – Path to the target (seed) protein database.

  • name (str) – A unique name/ID for this step in the pipeline.

  • e_val (float) – Expect value to use. Only keep hits with a an equivalent or better (lower) score.

  • blast_type (str) – Specifies which search program to use. This can be either “PROT” (blastp), “PSI” (psiblast), “mmseqs” (mmseqs2), or “diamond” (diamond).

  • sensitivity (str) – Sets the sensitivity param for mmseqs and diamond (does nothing if BLAST is the seach type).

  • parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI BLAST, DO NOT use the -parse_seqids flag when creating protein databases with makeblastdb.

  • blast_path (string, optional) – Path to the blastp/mmseqs/diamond program, if not using the system default.

  • **kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.

Note

This should be the first step added to a gene_finder.pipeline.Pipeline object. Additional gene finding steps can be added in any order.

add_seed_with_coordinates_step(db, name, e_val, blast_type, sensitivity=None, parse_descriptions=True, start=None, end=None, contig_id=None, blast_path=None, **kwargs)

Define a genomic region of interest with coordinates instead of a seed sequence.

An alternative to gene_finder.pipeline.Pipeline.add_seed_step(). Most useful for re-annotating putative systems of interest, where the region coordinates are already known.

Parameters
  • db (str) – Path to the target database to search against.

  • name (str) – A unique name/ID for this step in the pipeline.

  • e_val (float) – Expect value to use. Only keep hits with an equivalent or better (lower) score.

  • blast_type (str) – Specifies which search program to use. This can be either “PROT” (blastp), “PSI” (psiblast), “mmseqs” (mmseqs2), or “diamond” (diamond).

  • sensitivity (str) – Sets the sensitivity param for mmseqs and diamond (does nothing if BLAST is the seach type).

  • parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI BLAST, DO NOT use the -parse_seqids flag when creating protein databases with makeblastdb.

  • start (int) – Defines the beginning of the region to search, in base pairs (bp). If no start position is given the first (zero indexed) position in the genome/contig is used.

  • end (int) – Defines the end of the region to search, in base pairs (bp). If no end position is given the last position in the contig is used.

  • contig_id (string, optional) – An identifier for the contig to search. If no ID is given, the pipeline will search every contig in the input file using the coordinates specified. Note that the contig ID is defined as the substring between the “>” character and the first ” ” character in the contig header.

  • blast_path (string, optional) – Path to the blastp/mmseqs/diamond program, if not using the system default.

  • **kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.

add_filter_step(db, name, e_val, blast_type, min_prot_count=1, sensitivity=None, parse_descriptions=True, blast_path=None, **kwargs)

Add a step to search candidate regions for target sequences, and filter out candidates that do not have at least min_prot_count matching sequences.

Parameters
  • db (str) – Path to the target protein sequence database.

  • name (str) – A unique name/ID for this step in the pipeline.

  • e_val (float) – Expect value to use. Only keep hits with a an equivalent or better (lower) score.

  • blast_type (str) – Specifies which search program to use. This can be either “PROT” (blastp), “PSI” (psiblast), “mmseqs” (mmseqs2), or “diamond” (diamond).

  • min_prot_count (int, optional) – Minimum number of hits needed to keep each candidate.

  • sensitivity (str) – Sets the sensitivity param for mmseqs and diamond (does nothing if BLAST is the seach type).

  • parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI blast, DO NOT use the -parse_seqids flag when creating protein databases with makeblastdb.

  • blast_path (string, optional) – Path to the blastp/mmseqs/diamond program, if not using the system default.

  • **kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.

add_blast_step(db, name, e_val, blast_type, sensitivity=None, parse_descriptions=True, blast_path=None, **kwargs)

Add a non-filtering search step to the pipeline. That is, search each candidate for target sequences without applying any filtering logic. This is most useful for annotating candidates for non-essential or ancillary genes.

Parameters
  • db (str) – Path to the target protein sequence database.

  • name (str) – A unique name/ID for this step in the pipeline.

  • e_val (float) – Expect value to use. Only keep hits with a an equivalent or better (lower) score.

  • blast_type (str) – Specifies which search program to use. This can be either “PROT” (blastp), “PSI” (psiblast), “mmseqs” (mmseqs2), or “diamond” (diamond).

  • sensitivity (str) – Sets the sensitivity param for mmseqs and diamond (does nothing if BLAST is the seach type).

  • parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI BLAST, DO NOT use the -parse_seqids flag when creating protein databases with makeblastdb.

  • blast_path (string, optional) – Path to the blastp/mmseqs/diamond program, if not using the system default.

  • **kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.

add_crispr_step()

Add a step to search for CRISPR arrays using PILER-CR.

add_blastn_step(db, name, e_val, parse_descriptions=False, blastn_path='blastn', **kwargs)

Add a step to do nucleotide BLAST.

Parameters
  • db (str) – Path to the target protein sequence database.

  • name (str) – A unique name/ID for this step in the pipeline.

  • e_val (float) – Expect value to use. Only keep hits with a an equivalent or better (lower) score.

  • parse_descriptions (bool, optional) – By default, reference protein descriptions (from fasta headers) are parsed for gene name labels; specifically, descriptions are split on whitespace characters and the second item is used for the label. Make this false to simply use the whole protein description for the label (i.e everything after the first whitespace in the header). If using this option with NCBI BLAST, DO NOT use the -parse_seqids flag when creating protein databases with makeblastdb.

  • blast_path (string, optional) – Path to the blastn program, if not using the system default.

  • **kwargs – These can be any additional BLAST parameters, specified as key-value pairs. Note that certain parameters are not allowed, mainly those that control output formatting. Currently only supported for blastp/psiblast; if blast_type is set to mmseqs or diamond, kwargs will be silently ignored.

run(data, job_id=None, output_directory=None, min_prot_len=60, span=10000, record_all_hits=False, incremental_output=False, starting_contig=None, gzip=False)dict

Execute each step in the pipeline, in the order they were added.

Parameters
  • data (str) – Path to the input data file. Can be a single- or multi-sequence file in fasta format.

  • job_id (str, optional) – A unique ID to prefix all output files. If no ID is given, the string “gene_finder” will be used as the prefix. In any case, results from the pipeline are written to the file <prefix>_results.csv.

  • output_directory (str, optional) – The directory to write output data files to. If no directory is given then the current (working) directory is used.

  • min_prot_len (int, optional) – Minimum ORF length (aa). Default is 60.

  • span (int, optional) – Length (nt) upsteam and downstream of each seed hit to keep. Defines the aproximate size of the genomic neighborhoods that will be used as the search space after the seed step.

  • record_all_hits (bool, optional) – Write data about all genes found (even discarded ones) to the file <job_id>_hits.json, grouped by contig. Note that this contains much of the same information as is in the results CSV file; nevertheless, it may be useful for analysis or troubleshooting a search.

  • incremental_output (bool, optional) – Write results to disk after each contig is processed. Using this option also creates a checkpoint file that gives the ID of the contig that is currently being processed; if the job finishes successfully, this file will be automatically cleaned up. This feature is especially useful for long-running jobs.

  • starting_contig (bool, optional) – The sequence identifier of the contig where the run should begin. In other words, skip over records in the input file until the specified contig is reached, and then run the pipeline as normal. This is usually used in conjunction with incremental_output.

  • gzip (bool, optional) – Was this file compressed with gzip?

Returns

Candidate systems, grouped by contig id and genomic location.

Return type

dict