Inputs and Outputs

Building sequence databases

To search for gene clusters with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target clusters (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as Uniprot (maintained by the European Bioinformatics Institute ) or one of the databases provided by the National Center for Biotechnology Information.

Once target sequences have been compiled, they must be converted to an application-specific database format. Opfi currently supports BLAST+, mmseqs2, and diamond for homology searching:

The FASTA file format

Both genomic input data and reference sequence data should be in FASTA format. This is a simple flat text representation of biological sequence data, where individual sequences are delineated by the > greater than character. For example:

>UniRef50_Q02ML7 CRISPR-associated endonuclease Cas1 n=1700 RepID=CAS1_PSEAB
MDDISPSELKTILHSKRANLYYLQHCRVLVNGGRVEYVTDEGRHSHYWNIPIANTTSLLL
GTGTSITQAAMRELARAGVLVGFCGGGGTPLFSANEVDVEVSWLTPQSEYRPTEYLQRWV
GFWFDEEKRLVAARHFQRARLERIRHSWLEDRVLRDAGFAVDATALAVAVEDSARALEQA
PNHEHLLTEEARLSKRLFKLAAQATRYGEFVRAKRGSGGDPANRFLDHGNYLAYGLAATA
TWVLGIPHGLAVLHGKTRRGGLVFDVADLIKDSLILPQAFLSAMRGDEEQDFRQACLDNL
SRAQALDFMIDTLKDVAQRSTVSA
>UniRef50_Q2RY21 CRISPR-associated endonuclease Cas1 1 n=1034 RepID=CAS1A_RHORT
MADPAFVPLRPIAIKDRSSIVFLQRGQLDVVDGAFVLIDQEGVRVQIPVGGLACLMLEPG
TRITHAAIVLCARVGCLVIWVGERGTRLYAAGQPGGARADRLLFQARNALDETARLNVVR
EMYRRRFDDDPPARRSVDQLRGMEGVRVREIYRLLAKKYAVDWNARRYDHNDWDGADIPN
RCLSAATACLYGLCEAAILAAGYAPAIGFLHRGKPQSFVYDVADLYKVETVVPTAFSIAA
KIAAGKGDDSPPERQVRIACRDQFRKSGLLEKIIPDIEEILRAGGLEPPLDAPEAVDPVI
PPEEPSGDDGHRG

The sequence definition (defline) comes directly after the > character, and should be on a separate line from the sequence (which can be on one or more subsequent lines). There is no specific defline format, however, Opfi requires that, for both genomic input and sequence data, each definition line contain a unique sequence identifer. This should be a single word/token immediately following the > character (i.e. spaces between the > character and the identifier are not allowed). Any additional text on the defline is parsed as a single string, and appears in the output CSV (see Opfi output format).

Tip

Biological sequences downloaded from most public databases will have an accession number/identifier by default.

Annotating sequence databases

To take full advantage of the rule-based filtering methods in operon_analyzer.rules, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene clusters.

Gene labels are parsed from sequence deflines; specifically, Opfi looks for the second word/token following the > character. For example, the following FASTA sequence has been annotated with the label “cas1”:

>UniRef50_Q02ML7 cas1 CRISPR-associated endonuclease Cas1 n=1700 RepID=CAS1_PSEAB
MDDISPSELKTILHSKRANLYYLQHCRVLVNGGRVEYVTDEGRHSHYWNIPIANTTSLLL
GTGTSITQAAMRELARAGVLVGFCGGGGTPLFSANEVDVEVSWLTPQSEYRPTEYLQRWV
GFWFDEEKRLVAARHFQRARLERIRHSWLEDRVLRDAGFAVDATALAVAVEDSARALEQA
PNHEHLLTEEARLSKRLFKLAAQATRYGEFVRAKRGSGGDPANRFLDHGNYLAYGLAATA
TWVLGIPHGLAVLHGKTRRGGLVFDVADLIKDSLILPQAFLSAMRGDEEQDFRQACLDNL
SRAQALDFMIDTLKDVAQRSTVSA

After running gene_finder.pipeline.Pipeline, users could select candidates with hits against this sequence using the following rule set:

from operon_analyzer.rules import RuleSet

rs = RuleSet.require("cas1")

In practice, a genomics search might use a reference database of hundreds (or even thousands) of representative protein sequences, in which case labeling each sequence individually would be tedious. It is recommended to organize sequences into groups of related proteins that can be given a single label. This script uses the Python package Biopython to annotate sequences in a multi-sequence FASTA file:

from Bio import SeqIO
import os, sys

def annotate_reference(prot_ref_file, label):
    records = list(SeqIO.parse(ref_fasta, "fasta"))

    for record in records:
        des = record.description.split()
        prot_id = des.pop(0)
        des_with_label = "{} {} {}".format(prot_id, label, " ".join(des))
        record.description = des_with_label

    SeqIO.write(records, ref_fasta, "fasta")

if __name__ == "__main__":
    ref_fasta = sys.argv[1]
    label = sys.argv[2]
    annotate_reference(ref_fasta, label)

It is possible to use the entire sequence description (i.e. all text following the sequence identifier) as the gene label. This is particularly useful when using a pre-built database like nr, which contains representative protein sequences for many different protein families. When using sequence databases that haven’t been annotated, users should set parse_descriptions=False for each gene_finder.pipeline.Pipeline add_step() method call.

Converting sequence files to a sequence database

Once reference sequences have been compiled (and, optionally, labeled) they must be converted to a sequence database format that is specific to the homology search program used. Currently, Opfi supports BLAST, mmseqs2, and diamond. Each software package is automatically installed with a companion utility program for generating sequence databases. The following example shows what a typical call to makeblastdb, the BLAST+ database utility program, might look like:

makeblastdb -in "my_sequences.fasta" -out my_sequences/db -dbtype prot -title "my_sequences" -hash_index

The command takes a text/FASTA file my_sequences.fasta as input, and writes the resulting database files to the directory my_sequences. Database files are prefixed with “db”. -dbtype prot specifies that the input is amino acid sequences. We use -title to name the database (required by BLAST). -hash_index directs makeblastdb to generate a hash index of protein sequences, which can speed up computation time.

Tip

mmseqs2 and diamond have similar database creation commands, see Building sequence databases.

BLAST advanced options

BLAST+ programs have a number of tunable parameters that can, for example, be used to adjust the sensitivity of the search algorithm. We anticipate that application defaults will be sufficient for most users; nevertheless, it is possible to use non-default program options by passing them as keyword arguments to gene_finder.pipeline.Pipeline add_step() methods.

For example, when using blastp on the command line, we could adjust the number of CPUs to four by passing the argument -num_threads 4 to the program. When using Opfi, this would look like num_threads=4.

Flags (boolean arguments that generally do not precede additional data) are also possible. For example, the command line flag -use_sw_tback tells blastp to compute locally optimal Smith-Waterman alignments. The correct way to specify this behavior via the gene_finder.pipeline.Pipeline API would be to use the argument use_sw_tback=True.

Below is a list of options accepted by Opfi. Note that some BLAST+ options are not allowed, mainly those that modify BLAST output.

Program

Allowed Options

blastp and psiblast

dbsize word_size gapopen gapextend qcov_hsp_perc xdrop_ungap xdrop_gap xdrop_gap_final searchsp sum_stats seg soft_masking matrix threshold culling_limit window_size num_threads comp_based_stats gilist seqidlist negative_gilistdb_soft_mask db_hard_mask entrez_query max_hspsbest_hit_overhang best_hit_score_edge max_target_seqsimport_search_strategy export_search_strategy num_alignments

blastp only

task

psiblast only

gap_trigger num_iterations out_pssm out_ascii_pssm pseudocount inclusion_ethresh

blastp (flags)

lcase_masking ungapped use_sw_tback remote

psiblast (flags)

lcase_masking use_sw_tback save_pssm_after_last_round save_each_pssm remote

blastn

filtering_algorithm sum_stats window_masker_db window_size template_type version parse_deflines min_raw_gapped_score string format max_hsps taxids negative_taxids num_alignments strand off_diagonal_range subject_besthit num_sequences no_greedy negative_taxidlist culling_limit xdrop_ungap open_penalty DUST_options sorthits xdrop_gap_final negative_gilist subject use_index bool_value filename seqidlist task_name sort_hits database_name lcase_masking query_loc subject_loc sort_hsps line_length boolean db_hard_mask negative_seqidlist template_length filtering_db filtering_database penalty searchsp ungapped type gapextend db_soft_mask dbsize qcov_hsp_perc sorthsps window_masker_taxid index_name export_search_strategy float_value soft_masking gilist entrez_query show_gis best_hit_score_edge gapopen subject_input_file range html word_size best_hit_overhang perc_identity input_file num_descriptions xdrop_gap dust taxidlist max_target_seqs num_threads task remote int_value extend_penalty reward import_search_strategy num_letters

You can read more about BLAST+ options in the BLAST+ appendices.

Note

Using advanced options with mmseqs2 and diamond is not supported at this time.

Opfi output format

Results from gene_finder.pipeline.Pipeline searches are written to a single CSV file. Below is an example from the tutorial (see Example Usage):

NC_013161.1

503817..525707

cas1

514110..513817

lcl|514110|513817|2|-1

-1

UniRef50_A0A179D3U4

1.24e-07

UniRef50_A0A179D3U4 cas1 CRISPR-associated endoribonuclease Cas2 n=2 Tax=Thermosulfurimonas dismutans TaxID=999894 RepID=A0A179D3U4_9BACT

MNTLFYLIIYDLPATKAGNKRRKRLYEMLCGYGNWTQFSVFECFLTAVQFANLQSKLENLIQPNEDSVRIYILDAGSVRKTLTYGSEKPRQVDTLIL

42.4

98

51

43.137

22

29

31

0

0

60.78

53

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas1

515084..514107

lcl|515084|514107|3|-1

-1

UniRef50_A0A1Z3HN48

4.00e-177

UniRef50_A0A1Z3HN48 cas1 CRISPR-associated endonuclease Cas1 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A1Z3HN48_9CYAN

MSILYLTQPDAVLSKKQEAFHVALKQEDGSWKKQLIPAQTVEQIVLIGYPSITGEALCYALELGIPVHYLSCFGKYLGSALPGYSRNGQLRLAQYHVHDNEEQRLALVKTVVTGKIHNQYHVLYRYQQKDNPLKEHKQLVKSKTTLEQVRGVEGLAAKDYFNGFKLILDSQWNFNGRNRRPPTDPVNALLSFAYGLLRVQVTAAVHIAGLDPYIGYLHETTRGQPAMVLDLMEEFRPLIADSLVLSVISHKEIKPTDFNESLGAYLLSDSGRKTFLQAFERKLNTEFKHPVFGYQCSYRRSIELQARLFSRYLQENIPYKSLSLR

489

1260

325

69.538

226

99

276

0

0

84.92

100

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas1

515707..515117

lcl|515707|515117|1|-1

-1

UniRef50_A0A2I8A541

1.64e-100

UniRef50_A0A2I8A541 cas1 CRISPR-associated exonuclease Cas4 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8A541_9NOSO

MIDNYLPLAYLNAFEYCTRRFYWEYVLGEMANNEHIIIGRHLHRNINQEGIIKEEDTIIHRQQWVWSDRLQIKGIIDAVEEKESSLVPVEYKKGRMSQHLNDHFQLCAAALCLEEKTGKIITYGEIFYHANRRRQRVDFSDRLRCSTEQAIHHAHELVNQKMPSPINNSKKCRDCSLKTMCLPKEVKQLRNSLISD

285

729

195

66.154

129

66

162

0

0

83.08

99

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas2

514110..513817

lcl|514110|513817|2|-1

-1

UniRef50_A0A1Z3HN55

7.36e-46

UniRef50_A0A1Z3HN55 cas2 CRISPR-associated endoribonuclease Cas2 n=68 Tax=Cyanobacteria TaxID=1117 RepID=A0A1Z3HN55_9CYAN

MNTLFYLIIYDLPATKAGNKRRKRLYEMLCGYGNWTQFSVFECFLTAVQFANLQSKLENLIQPNEDSVRIYILDAGSVRKTLTYGSEKPRQVDTLIL

142

357

94

67.021

63

31

77

0

0

81.91

97

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas4

515084..514107

lcl|515084|514107|3|-1

-1

UniRef50_A0A1E5G3J0

1.01e-72

UniRef50_A0A1E5G3J0 cas4 CRISPR-associated endonuclease Cas1 n=4 Tax=Firmicutes TaxID=1239 RepID=A0A1E5G3J0_9BACL

MSILYLTQPDAVLSKKQEAFHVALKQEDGSWKKQLIPAQTVEQIVLIGYPSITGEALCYALELGIPVHYLSCFGKYLGSALPGYSRNGQLRLAQYHVHDNEEQRLALVKTVVTGKIHNQYHVLYRYQQKDNPLKEHKQLVKSKTTLEQVRGVEGLAAKDYFNGFKLILDSQWNFNGRNRRPPTDPVNALLSFAYGLLRVQVTAAVHIAGLDPYIGYLHETTRGQPAMVLDLMEEFRPLIADSLVLSVISHKEIKPTDFNESLGAYLLSDSGRKTFLQAFERKLNTEFKHPVFGYQCSYRRSIELQARLFSRYLQENIPYKSLSLR

233

595

333

39.940

133

179

191

6

21

57.36

98

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas4

515707..515117

lcl|515707|515117|1|-1

-1

UniRef50_A0A2I8A541

1.92e-99

UniRef50_A0A2I8A541 cas4 CRISPR-associated exonuclease Cas4 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8A541_9NOSO

MIDNYLPLAYLNAFEYCTRRFYWEYVLGEMANNEHIIIGRHLHRNINQEGIIKEEDTIIHRQQWVWSDRLQIKGIIDAVEEKESSLVPVEYKKGRMSQHLNDHFQLCAAALCLEEKTGKIITYGEIFYHANRRRQRVDFSDRLRCSTEQAIHHAHELVNQKMPSPINNSKKCRDCSLKTMCLPKEVKQLRNSLISD

285

729

195

66.154

129

66

162

0

0

83.08

99

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas6

516642..515833

lcl|516642|515833|2|-1

-1

UniRef50_A0A654SHL3

2.64e-108

UniRef50_A0A654SHL3 cas6 CRISPR_Cas6 domain-containing protein n=30 Tax=Cyanobacteria TaxID=1117 RepID=A0A654SHL3_9CYAN

MVQDILPQLHKYQLQSLVIELGVAKQGKLPATLSRAIHACVLNWLSLADSQLANQIHDSQISPLCLSGLIGNRRQPYSLLGDYFLLRIGVLQPSLIKPLLKGIEAQETQTLELGKFPFIIRQVYSMPQSHKLSQLTDYYSLALYSPTMTEIQLKFLSPTSFKQIQGVQPFPLPELVFNSLLRKWNHFAPQELKFPEIQWQSFVSAFELKTHALKMEGGAQIGSQGWAKYCFKDTEQARIASILSHFAFYAGVGRKTTMGMGQTQLLVNT

314

804

270

55.926

151

118

195

1

1

72.22

100

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas5

517387..516611

lcl|517387|516611|1|-1

-1

UniRef50_A0A2I8AFZ3

1.43e-118

UniRef50_A0A2I8AFZ3 cas5 Type I-D CRISPR-associated protein Cas5/Csc1 n=62 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8AFZ3_9NOSO

MNIYYCQLTLHDNIFFATREMGLLYETEKYLHNWALSYAFFKGTYIPHPYRLQGKSAQKPDYLDSTGEQSLAHLNRLKIYVFPAKPLRWSYQINTFKAAQTTYYGKSQQFGDKGANRNYPINYGRAKELAVGSEYHTFLISSQELNIPHWIRVGKWSAKVEVTSYLIPQKAISQHSGIYLCDHPLNPIDLPFDQELLLYNRIVMPPVSLVSQAQLQGNYCKINKNNWNDCPSNLTDLPQQICLPLGVNYGAGYIASAS

338

866

252

65.079

164

71

194

3

17

76.98

98

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas7

518600..517530

lcl|518600|517530|3|-1

-1

UniRef50_B7JVM8

0.0

UniRef50_B7JVM8 cas7 CRISPR-associated protein Csc2 n=52 Tax=Cyanobacteria TaxID=1117 RepID=B7JVM8_RIPO1

MSILETLKPQFQSAFPRLASANYVHFIMLRHSQSFPVFQTDGVLNTVRTQAGLMAKDSLSRLVMFKRKQTTPERLTGRELLRSLNITTADKNDKEKGCEYNGEGSCKKCPDCIIYGFAIGDSGSERSKVYSDSTFSLSAYEQSHRTFTFNAPFEGGTMSEQGVMRSAINELDHILPEITFPNIETLRDSTYEGFIYVLGNILRTKRYGAQESRTGTMKNHLVGIAFCDGEIFSNLRFTQALYDGLEGDVNKPIDEICYQASQIVQTLLSDEPVRKIKTIFGEELNHLINEVSGIYQNDALLTETLNMLYQQTKTYSENHGSLAKSKPPKAEGNKSKGRTKKKGDDEQTSLDLNIEE

733

1891

356

98.876

352

4

354

0

0

99.44

100

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas10

521597..518673

lcl|521597|518673|3|-1

-1

UniRef50_B7KB38

0.0

UniRef50_B7KB38 cas10 CRISPR-associated protein Csc3 n=52 Tax=Cyanobacteria TaxID=1117 RepID=B7KB38_GLOC7

MTLLQILLLETISQDTDPILISYLETVLPAMEPEFALIPALGGSQQIHYQNLIAIGNRYAQENAKRFSDKADQNLLVHVLNALLTAWNLVDHLTKPLSDIEKYLLCLGLTLHDYNKYCLGHGEESPKVSNINEIINICQELGKKLNFQAFWSDWEQYLPEIVYLAQNTQFKAGTNAIPANYPLFTLADSRRLDLPLRRLLAFGDIAVHLQDPADIISKTGGDRLREHLRFLGIKKALVYHRLRDTLGILSNGIHNATLRFAKDLNWQPLLFFAQGVIYLAPIDYTSPEKMELQGFIWQEISQLLASSMLKGEIGFKRDGKGLKVAPQTLELFTPVQLIRNLADVINVKVANAKVPATPKRLEKLELTDIERQLLEKGADLRADRIAELIILAQREFLADSPEFIDWTLQFWGLEKQITAEQTQEQSGGVNYGWYRVAANYIANHSTLSLEDVSGKLVDFCQQLADWATSNQLLSSHSSSTFEVFNSYLEQYLEIQGWQSSTPNFSQELSTYIMAKTQSSKQPICSLSSGEFISEDQMDSVVLFKPQQYSNKNPLGGGKIKRGISKIWALEMLLRQALWTVPSGKFEDQQPVFLYIFPAYVYSPQIAAAIRSLVNDMKRINLWDVRKHWLHEDMNLDSLRSLQWRKEEAEVGRFKDKYSRADIPFMGTVYTTTRGKTLTEAWIDPAFLTLALPILLGVKVIATSSSVPLYNSDNDFLDSVILDAPAGFWQLLKLSTSLRIQELSVALKRLLTIYTIHLDNRSNPPDARWQALNSTVREVITDVLNVFSIADEKLREDQREASPQEVQRYWKFAEIFAQGDTIMTEKLKLTKELVRQYRTFYQVKWSESSHTILLPLTKALEEILSTPEHWDDEELILQGAGILNDALDRQEVYKRPLLQDKSIPYEIRKQQELQAIHQFMTTCVKELFGQMCKGDRALLQEYRNRIKSGAESAYKLLAFEEKSNSSQQQKSSEDQ

1073

2775

978

56.544

553

399

710

12

26

72.60

99

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

cas3

523760..521655

lcl|523760|521655|3|-1

-1

UniRef50_A0A168SWH5

0.0

UniRef50_A0A168SWH5 cas3 Type I-D CRISPR-associated helicase Cas3 n=2 Tax=Phormidium TaxID=1198 RepID=A0A168SWH5_9CYAN

MKINLKPLYSKLNAGVGNCPLGCQEMCRVQQQAPQFKAPSGCNCPLYQHQAESYPYLTKGDTDIIFITAPTAGGKSLLASLPSLLDPNFRMMGLYPTIELVEDQTEQQNNYHNLFGLNSEERIDKLFGVELTQRIKEFNSNRFQQLWLAIETKEVILTNPDIFHLMTHFRYRDNAYGTDELPLALAKFPDLWVFDEFHIFGAHQETAVLNSMMLIRRTQQQKKRFLFTSATVKTDFVEQLKQTGLKIKEIAGEYKSEAQQGYRQILQAVELSIINLKEEDGFSWLINNAAKIRKILKAEDKGRGLIILNSVVMVRRISQELQSLLPEIVVREISGRIDRKERSQTQQLLQEEEKPVLVVATSAVDVGVDFRIHLLITESSDSATVIQRLGRLGRHSGFSNYQAFLLLSGRTPWVINRLQEKLESKQDVTREELIEAIQYAFDPPKEYQEYRNRWGAIQVQGMFSQMMGSNAKVMQSIKERISEDLKRIYGNTLDNKAWYAMGHNCLGKAIQSELLRFRGGSTLQAAVWDEQRFYTYDLLRLLPYATVDILDRETFLKAATKAGHIEEAFPSQYLQVYLRIEQWLDKRLNLNLFCNRESDELLVGKLFLITRLKLDGHPQSDVISCLSRCNLLTFLVPVDRSRTQSHWEVSYCLHLNPLFGLYRLKDASEQAYACAFNQDALLLEALNWKLTKFYRERSLIF

671

1731

720

49.028

353

341

479

10

26

66.53

100

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

NC_013161.1

503817..525707

CRISPR array

512560..513624

Copies: 15, Repeat: 37, Spacer: 36

–GTTTCAATCCC———–ATTACTAGGATTCATTAAAAAGAAAC

data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cluster, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cluster. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV.

Descriptions of each output field are provided below. Alignment statistic naming conventions are from the BLAST documentation, see BLAST+ appendices (specifically “outfmt” in table C1). This glossary of common BLAST terms may also be useful in interpreting alignment statistic meaning.

index

field name

data type

description

0

Contig

string

ID/accession for the parent contig/genome sequence.

1

Loc_coordinates

string

Start and end position of the candidate locus (relative to the parent sequence).

2

Name

string

Feature name/label. This is will be identical to “Description” (index 8) if parse_descriptions is True.

3

Coordinates

string

Start and end position of this feature, relative to the parent sequence.

4

ORFID

string

A unique ID given to this feature, primarily for internal use. Only applies to features that are genes.

5

Strand

signed int

Specifies if the feature was found in the forward (1) or backward (-1) direction. Only applied to features that are genes.

6

Accession

string

ID/accession for the reference sequence that had the best alignment (by e-value) with this feature’s translated sequence.

7

E_val

float

The e-value score for the best alignment for this feature.

8

Description

string

A description of this putative feature, parsed from the defline of best aligned reference sequence.

9

Sequence

string

The (translated) amino acid sequence for this feature.

10

Bitscore

float

The bitscore for the best alignment for this feature.

11

Rawscore

int

The raw score for the best alignment for this feature.

12

Aln_len

int

The length of the best scoring alignment, in base pairs.

13

Pident

float

The fraction of identical positions in the best alignment.

14

Nident

int

The number of identical positions in the best alignment.

15

Mismatch

int

The number of mismatched positions in the best alignment.

16

Positive

int

The number of positive-scoring matches in the best alignment.

17

Gapopen

int

The number of gap openings.

18

Gaps

int

Total number of gaps in the alignment.

19

Ppos

float

Percentage of positive scoring matches.

20

Qcovhsp

int

Query coverage per HSP. That is, the fraction of the query (this feature’s translated amino acid sequence) that was covered in the best alignment.

21

Contig_filename

string

The input data (genomic sequence(s)) file path.