Inputs and Outputs¶

Building sequence databases¶

To search for gene clusters with Opfi, users must compile representative protein (or nucleic acid) sequences for any genes expected in target clusters (or for any non-essential accessory genes of interest). These may be from a pre-existing, private collection of sequences (perhaps from a previous bioinformatics analysis). Alternatively, users may download sequences from a publically available database such as Uniprot (maintained by the European Bioinformatics Institute ) or one of the databases provided by the National Center for Biotechnology Information.

Once target sequences have been compiled, they must be converted to an application-specific database format. Opfi currently supports BLAST+, mmseqs2, and diamond for homology searching:

The FASTA file format¶

Both genomic input data and reference sequence data should be in FASTA format. This is a simple flat text representation of biological sequence data, where individual sequences are delineated by the > greater than character. For example:

>UniRef50_Q02ML7 CRISPR-associated endonuclease Cas1 n=1700 RepID=CAS1_PSEAB
MDDISPSELKTILHSKRANLYYLQHCRVLVNGGRVEYVTDEGRHSHYWNIPIANTTSLLL
GTGTSITQAAMRELARAGVLVGFCGGGGTPLFSANEVDVEVSWLTPQSEYRPTEYLQRWV
GFWFDEEKRLVAARHFQRARLERIRHSWLEDRVLRDAGFAVDATALAVAVEDSARALEQA
PNHEHLLTEEARLSKRLFKLAAQATRYGEFVRAKRGSGGDPANRFLDHGNYLAYGLAATA
TWVLGIPHGLAVLHGKTRRGGLVFDVADLIKDSLILPQAFLSAMRGDEEQDFRQACLDNL
SRAQALDFMIDTLKDVAQRSTVSA
>UniRef50_Q2RY21 CRISPR-associated endonuclease Cas1 1 n=1034 RepID=CAS1A_RHORT
MADPAFVPLRPIAIKDRSSIVFLQRGQLDVVDGAFVLIDQEGVRVQIPVGGLACLMLEPG
TRITHAAIVLCARVGCLVIWVGERGTRLYAAGQPGGARADRLLFQARNALDETARLNVVR
EMYRRRFDDDPPARRSVDQLRGMEGVRVREIYRLLAKKYAVDWNARRYDHNDWDGADIPN
RCLSAATACLYGLCEAAILAAGYAPAIGFLHRGKPQSFVYDVADLYKVETVVPTAFSIAA
KIAAGKGDDSPPERQVRIACRDQFRKSGLLEKIIPDIEEILRAGGLEPPLDAPEAVDPVI
PPEEPSGDDGHRG

The sequence definition (defline) comes directly after the > character, and should be on a separate line from the sequence (which can be on one or more subsequent lines). There is no specific defline format, however, Opfi requires that, for both genomic input and sequence data, each definition line contain a unique sequence identifer. This should be a single word/token immediately following the > character (i.e. spaces between the > character and the identifier are not allowed). Any additional text on the defline is parsed as a single string, and appears in the output CSV (see Opfi output format).

Tip

Biological sequences downloaded from most public databases will have an accession number/identifier by default.

Annotating sequence databases¶

To take full advantage of the rule-based filtering methods in operon_analyzer.rules, users are encouraged to annotate reference sequences with a name/label that is easily searched. Labels can be as broad or as specific as is necessary to provide meaningful annotation of target gene clusters.

Gene labels are parsed from sequence deflines; specifically, Opfi looks for the second word/token following the > character. For example, the following FASTA sequence has been annotated with the label “cas1”:

>UniRef50_Q02ML7 cas1 CRISPR-associated endonuclease Cas1 n=1700 RepID=CAS1_PSEAB
MDDISPSELKTILHSKRANLYYLQHCRVLVNGGRVEYVTDEGRHSHYWNIPIANTTSLLL
GTGTSITQAAMRELARAGVLVGFCGGGGTPLFSANEVDVEVSWLTPQSEYRPTEYLQRWV
GFWFDEEKRLVAARHFQRARLERIRHSWLEDRVLRDAGFAVDATALAVAVEDSARALEQA
PNHEHLLTEEARLSKRLFKLAAQATRYGEFVRAKRGSGGDPANRFLDHGNYLAYGLAATA
TWVLGIPHGLAVLHGKTRRGGLVFDVADLIKDSLILPQAFLSAMRGDEEQDFRQACLDNL
SRAQALDFMIDTLKDVAQRSTVSA

After running gene_finder.pipeline.Pipeline, users could select candidates with hits against this sequence using the following rule set:

from operon_analyzer.rules import RuleSet

rs = RuleSet.require("cas1")

In practice, a genomics search might use a reference database of hundreds (or even thousands) of representative protein sequences, in which case labeling each sequence individually would be tedious. It is recommended to organize sequences into groups of related proteins that can be given a single label. This script uses the Python package Biopython to annotate sequences in a multi-sequence FASTA file:

from Bio import SeqIO
import os, sys

def annotate_reference(prot_ref_file, label):
    records = list(SeqIO.parse(ref_fasta, "fasta"))

    for record in records:
        des = record.description.split()
        prot_id = des.pop(0)
        des_with_label = "{} {} {}".format(prot_id, label, " ".join(des))
        record.description = des_with_label

    SeqIO.write(records, ref_fasta, "fasta")

if __name__ == "__main__":
    ref_fasta = sys.argv[1]
    label = sys.argv[2]
    annotate_reference(ref_fasta, label)

It is possible to use the entire sequence description (i.e. all text following the sequence identifier) as the gene label. This is particularly useful when using a pre-built database like nr, which contains representative protein sequences for many different protein families. When using sequence databases that haven’t been annotated, users should set parse_descriptions=False for each gene_finder.pipeline.Pipeline add_step() method call.

Converting sequence files to a sequence database¶

Once reference sequences have been compiled (and, optionally, labeled) they must be converted to a sequence database format that is specific to the homology search program used. Currently, Opfi supports BLAST, mmseqs2, and diamond. Each software package is automatically installed with a companion utility program for generating sequence databases. The following example shows what a typical call to makeblastdb, the BLAST+ database utility program, might look like:

makeblastdb -in "my_sequences.fasta" -out my_sequences/db -dbtype prot -title "my_sequences" -hash_index

The command takes a text/FASTA file my_sequences.fasta as input, and writes the resulting database files to the directory my_sequences. Database files are prefixed with “db”. -dbtype prot specifies that the input is amino acid sequences. We use -title to name the database (required by BLAST). -hash_index directs makeblastdb to generate a hash index of protein sequences, which can speed up computation time.

Tip

mmseqs2 and diamond have similar database creation commands, see Building sequence databases.

BLAST advanced options¶

BLAST+ programs have a number of tunable parameters that can, for example, be used to adjust the sensitivity of the search algorithm. We anticipate that application defaults will be sufficient for most users; nevertheless, it is possible to use non-default program options by passing them as keyword arguments to gene_finder.pipeline.Pipeline add_step() methods.

For example, when using blastp on the command line, we could adjust the number of CPUs to four by passing the argument -num_threads 4 to the program. When using Opfi, this would look like num_threads=4.

Flags (boolean arguments that generally do not precede additional data) are also possible. For example, the command line flag -use_sw_tback tells blastp to compute locally optimal Smith-Waterman alignments. The correct way to specify this behavior via the gene_finder.pipeline.Pipeline API would be to use the argument use_sw_tback=True.

Below is a list of options accepted by Opfi. Note that some BLAST+ options are not allowed, mainly those that modify BLAST output.

Program	Allowed Options
blastp and psiblast	dbsize word_size gapopen gapextend qcov_hsp_perc xdrop_ungap xdrop_gap xdrop_gap_final searchsp sum_stats seg soft_masking matrix threshold culling_limit window_size num_threads comp_based_stats gilist seqidlist negative_gilistdb_soft_mask db_hard_mask entrez_query max_hspsbest_hit_overhang best_hit_score_edge max_target_seqsimport_search_strategy export_search_strategy num_alignments
blastp only	task
psiblast only	gap_trigger num_iterations out_pssm out_ascii_pssm pseudocount inclusion_ethresh
blastp (flags)	lcase_masking ungapped use_sw_tback remote
psiblast (flags)	lcase_masking use_sw_tback save_pssm_after_last_round save_each_pssm remote
blastn	filtering_algorithm sum_stats window_masker_db window_size template_type version parse_deflines min_raw_gapped_score string format max_hsps taxids negative_taxids num_alignments strand off_diagonal_range subject_besthit num_sequences no_greedy negative_taxidlist culling_limit xdrop_ungap open_penalty DUST_options sorthits xdrop_gap_final negative_gilist subject use_index bool_value filename seqidlist task_name sort_hits database_name lcase_masking query_loc subject_loc sort_hsps line_length boolean db_hard_mask negative_seqidlist template_length filtering_db filtering_database penalty searchsp ungapped type gapextend db_soft_mask dbsize qcov_hsp_perc sorthsps window_masker_taxid index_name export_search_strategy float_value soft_masking gilist entrez_query show_gis best_hit_score_edge gapopen subject_input_file range html word_size best_hit_overhang perc_identity input_file num_descriptions xdrop_gap dust taxidlist max_target_seqs num_threads task remote int_value extend_penalty reward import_search_strategy num_letters

You can read more about BLAST+ options in the BLAST+ appendices.

Note

Using advanced options with mmseqs2 and diamond is not supported at this time.

Opfi output format¶

Results from gene_finder.pipeline.Pipeline searches are written to a single CSV file. Below is an example from the tutorial (see Example Usage):

NC_013161.1	503817..525707	cas1	514110..513817	lcl\|514110\|513817\|2\|-1	-1	UniRef50_A0A179D3U4	1.24e-07	UniRef50_A0A179D3U4 cas1 CRISPR-associated endoribonuclease Cas2 n=2 Tax=Thermosulfurimonas dismutans TaxID=999894 RepID=A0A179D3U4_9BACT	MNTLFYLIIYDLPATKAGNKRRKRLYEMLCGYGNWTQFSVFECFLTAVQFANLQSKLENLIQPNEDSVRIYILDAGSVRKTLTYGSEKPRQVDTLIL	42.4	98	51	43.137	22	29	31	0	0	60.78	53	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas1	515084..514107	lcl\|515084\|514107\|3\|-1	-1	UniRef50_A0A1Z3HN48	4.00e-177	UniRef50_A0A1Z3HN48 cas1 CRISPR-associated endonuclease Cas1 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A1Z3HN48_9CYAN	MSILYLTQPDAVLSKKQEAFHVALKQEDGSWKKQLIPAQTVEQIVLIGYPSITGEALCYALELGIPVHYLSCFGKYLGSALPGYSRNGQLRLAQYHVHDNEEQRLALVKTVVTGKIHNQYHVLYRYQQKDNPLKEHKQLVKSKTTLEQVRGVEGLAAKDYFNGFKLILDSQWNFNGRNRRPPTDPVNALLSFAYGLLRVQVTAAVHIAGLDPYIGYLHETTRGQPAMVLDLMEEFRPLIADSLVLSVISHKEIKPTDFNESLGAYLLSDSGRKTFLQAFERKLNTEFKHPVFGYQCSYRRSIELQARLFSRYLQENIPYKSLSLR	489	1260	325	69.538	226	99	276	0	0	84.92	100	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas1	515707..515117	lcl\|515707\|515117\|1\|-1	-1	UniRef50_A0A2I8A541	1.64e-100	UniRef50_A0A2I8A541 cas1 CRISPR-associated exonuclease Cas4 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8A541_9NOSO	MIDNYLPLAYLNAFEYCTRRFYWEYVLGEMANNEHIIIGRHLHRNINQEGIIKEEDTIIHRQQWVWSDRLQIKGIIDAVEEKESSLVPVEYKKGRMSQHLNDHFQLCAAALCLEEKTGKIITYGEIFYHANRRRQRVDFSDRLRCSTEQAIHHAHELVNQKMPSPINNSKKCRDCSLKTMCLPKEVKQLRNSLISD	285	729	195	66.154	129	66	162	0	0	83.08	99	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas2	514110..513817	lcl\|514110\|513817\|2\|-1	-1	UniRef50_A0A1Z3HN55	7.36e-46	UniRef50_A0A1Z3HN55 cas2 CRISPR-associated endoribonuclease Cas2 n=68 Tax=Cyanobacteria TaxID=1117 RepID=A0A1Z3HN55_9CYAN	MNTLFYLIIYDLPATKAGNKRRKRLYEMLCGYGNWTQFSVFECFLTAVQFANLQSKLENLIQPNEDSVRIYILDAGSVRKTLTYGSEKPRQVDTLIL	142	357	94	67.021	63	31	77	0	0	81.91	97	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas4	515084..514107	lcl\|515084\|514107\|3\|-1	-1	UniRef50_A0A1E5G3J0	1.01e-72	UniRef50_A0A1E5G3J0 cas4 CRISPR-associated endonuclease Cas1 n=4 Tax=Firmicutes TaxID=1239 RepID=A0A1E5G3J0_9BACL	MSILYLTQPDAVLSKKQEAFHVALKQEDGSWKKQLIPAQTVEQIVLIGYPSITGEALCYALELGIPVHYLSCFGKYLGSALPGYSRNGQLRLAQYHVHDNEEQRLALVKTVVTGKIHNQYHVLYRYQQKDNPLKEHKQLVKSKTTLEQVRGVEGLAAKDYFNGFKLILDSQWNFNGRNRRPPTDPVNALLSFAYGLLRVQVTAAVHIAGLDPYIGYLHETTRGQPAMVLDLMEEFRPLIADSLVLSVISHKEIKPTDFNESLGAYLLSDSGRKTFLQAFERKLNTEFKHPVFGYQCSYRRSIELQARLFSRYLQENIPYKSLSLR	233	595	333	39.940	133	179	191	6	21	57.36	98	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas4	515707..515117	lcl\|515707\|515117\|1\|-1	-1	UniRef50_A0A2I8A541	1.92e-99	UniRef50_A0A2I8A541 cas4 CRISPR-associated exonuclease Cas4 n=83 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8A541_9NOSO	MIDNYLPLAYLNAFEYCTRRFYWEYVLGEMANNEHIIIGRHLHRNINQEGIIKEEDTIIHRQQWVWSDRLQIKGIIDAVEEKESSLVPVEYKKGRMSQHLNDHFQLCAAALCLEEKTGKIITYGEIFYHANRRRQRVDFSDRLRCSTEQAIHHAHELVNQKMPSPINNSKKCRDCSLKTMCLPKEVKQLRNSLISD	285	729	195	66.154	129	66	162	0	0	83.08	99	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas6	516642..515833	lcl\|516642\|515833\|2\|-1	-1	UniRef50_A0A654SHL3	2.64e-108	UniRef50_A0A654SHL3 cas6 CRISPR_Cas6 domain-containing protein n=30 Tax=Cyanobacteria TaxID=1117 RepID=A0A654SHL3_9CYAN	MVQDILPQLHKYQLQSLVIELGVAKQGKLPATLSRAIHACVLNWLSLADSQLANQIHDSQISPLCLSGLIGNRRQPYSLLGDYFLLRIGVLQPSLIKPLLKGIEAQETQTLELGKFPFIIRQVYSMPQSHKLSQLTDYYSLALYSPTMTEIQLKFLSPTSFKQIQGVQPFPLPELVFNSLLRKWNHFAPQELKFPEIQWQSFVSAFELKTHALKMEGGAQIGSQGWAKYCFKDTEQARIASILSHFAFYAGVGRKTTMGMGQTQLLVNT	314	804	270	55.926	151	118	195	1	1	72.22	100	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas5	517387..516611	lcl\|517387\|516611\|1\|-1	-1	UniRef50_A0A2I8AFZ3	1.43e-118	UniRef50_A0A2I8AFZ3 cas5 Type I-D CRISPR-associated protein Cas5/Csc1 n=62 Tax=Cyanobacteria TaxID=1117 RepID=A0A2I8AFZ3_9NOSO	MNIYYCQLTLHDNIFFATREMGLLYETEKYLHNWALSYAFFKGTYIPHPYRLQGKSAQKPDYLDSTGEQSLAHLNRLKIYVFPAKPLRWSYQINTFKAAQTTYYGKSQQFGDKGANRNYPINYGRAKELAVGSEYHTFLISSQELNIPHWIRVGKWSAKVEVTSYLIPQKAISQHSGIYLCDHPLNPIDLPFDQELLLYNRIVMPPVSLVSQAQLQGNYCKINKNNWNDCPSNLTDLPQQICLPLGVNYGAGYIASAS	338	866	252	65.079	164	71	194	3	17	76.98	98	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas7	518600..517530	lcl\|518600\|517530\|3\|-1	-1	UniRef50_B7JVM8	0.0	UniRef50_B7JVM8 cas7 CRISPR-associated protein Csc2 n=52 Tax=Cyanobacteria TaxID=1117 RepID=B7JVM8_RIPO1	MSILETLKPQFQSAFPRLASANYVHFIMLRHSQSFPVFQTDGVLNTVRTQAGLMAKDSLSRLVMFKRKQTTPERLTGRELLRSLNITTADKNDKEKGCEYNGEGSCKKCPDCIIYGFAIGDSGSERSKVYSDSTFSLSAYEQSHRTFTFNAPFEGGTMSEQGVMRSAINELDHILPEITFPNIETLRDSTYEGFIYVLGNILRTKRYGAQESRTGTMKNHLVGIAFCDGEIFSNLRFTQALYDGLEGDVNKPIDEICYQASQIVQTLLSDEPVRKIKTIFGEELNHLINEVSGIYQNDALLTETLNMLYQQTKTYSENHGSLAKSKPPKAEGNKSKGRTKKKGDDEQTSLDLNIEE	733	1891	356	98.876	352	4	354	0	0	99.44	100	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas10	521597..518673	lcl\|521597\|518673\|3\|-1	-1	UniRef50_B7KB38	0.0	UniRef50_B7KB38 cas10 CRISPR-associated protein Csc3 n=52 Tax=Cyanobacteria TaxID=1117 RepID=B7KB38_GLOC7	MTLLQILLLETISQDTDPILISYLETVLPAMEPEFALIPALGGSQQIHYQNLIAIGNRYAQENAKRFSDKADQNLLVHVLNALLTAWNLVDHLTKPLSDIEKYLLCLGLTLHDYNKYCLGHGEESPKVSNINEIINICQELGKKLNFQAFWSDWEQYLPEIVYLAQNTQFKAGTNAIPANYPLFTLADSRRLDLPLRRLLAFGDIAVHLQDPADIISKTGGDRLREHLRFLGIKKALVYHRLRDTLGILSNGIHNATLRFAKDLNWQPLLFFAQGVIYLAPIDYTSPEKMELQGFIWQEISQLLASSMLKGEIGFKRDGKGLKVAPQTLELFTPVQLIRNLADVINVKVANAKVPATPKRLEKLELTDIERQLLEKGADLRADRIAELIILAQREFLADSPEFIDWTLQFWGLEKQITAEQTQEQSGGVNYGWYRVAANYIANHSTLSLEDVSGKLVDFCQQLADWATSNQLLSSHSSSTFEVFNSYLEQYLEIQGWQSSTPNFSQELSTYIMAKTQSSKQPICSLSSGEFISEDQMDSVVLFKPQQYSNKNPLGGGKIKRGISKIWALEMLLRQALWTVPSGKFEDQQPVFLYIFPAYVYSPQIAAAIRSLVNDMKRINLWDVRKHWLHEDMNLDSLRSLQWRKEEAEVGRFKDKYSRADIPFMGTVYTTTRGKTLTEAWIDPAFLTLALPILLGVKVIATSSSVPLYNSDNDFLDSVILDAPAGFWQLLKLSTSLRIQELSVALKRLLTIYTIHLDNRSNPPDARWQALNSTVREVITDVLNVFSIADEKLREDQREASPQEVQRYWKFAEIFAQGDTIMTEKLKLTKELVRQYRTFYQVKWSESSHTILLPLTKALEEILSTPEHWDDEELILQGAGILNDALDRQEVYKRPLLQDKSIPYEIRKQQELQAIHQFMTTCVKELFGQMCKGDRALLQEYRNRIKSGAESAYKLLAFEEKSNSSQQQKSSEDQ	1073	2775	978	56.544	553	399	710	12	26	72.60	99	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	cas3	523760..521655	lcl\|523760\|521655\|3\|-1	-1	UniRef50_A0A168SWH5	0.0	UniRef50_A0A168SWH5 cas3 Type I-D CRISPR-associated helicase Cas3 n=2 Tax=Phormidium TaxID=1198 RepID=A0A168SWH5_9CYAN	MKINLKPLYSKLNAGVGNCPLGCQEMCRVQQQAPQFKAPSGCNCPLYQHQAESYPYLTKGDTDIIFITAPTAGGKSLLASLPSLLDPNFRMMGLYPTIELVEDQTEQQNNYHNLFGLNSEERIDKLFGVELTQRIKEFNSNRFQQLWLAIETKEVILTNPDIFHLMTHFRYRDNAYGTDELPLALAKFPDLWVFDEFHIFGAHQETAVLNSMMLIRRTQQQKKRFLFTSATVKTDFVEQLKQTGLKIKEIAGEYKSEAQQGYRQILQAVELSIINLKEEDGFSWLINNAAKIRKILKAEDKGRGLIILNSVVMVRRISQELQSLLPEIVVREISGRIDRKERSQTQQLLQEEEKPVLVVATSAVDVGVDFRIHLLITESSDSATVIQRLGRLGRHSGFSNYQAFLLLSGRTPWVINRLQEKLESKQDVTREELIEAIQYAFDPPKEYQEYRNRWGAIQVQGMFSQMMGSNAKVMQSIKERISEDLKRIYGNTLDNKAWYAMGHNCLGKAIQSELLRFRGGSTLQAAVWDEQRFYTYDLLRLLPYATVDILDRETFLKAATKAGHIEEAFPSQYLQVYLRIEQWLDKRLNLNLFCNRESDELLVGKLFLITRLKLDGHPQSDVISCLSRCNLLTFLVPVDRSRTQSHWEVSYCLHLNPLFGLYRLKDASEQAYACAFNQDALLLEALNWKLTKFYRERSLIF	671	1731	720	49.028	353	341	479	10	26	66.53	100	data/GCF_000024045.1_ASM2404v1_genomic.fna.gz
NC_013161.1	503817..525707	CRISPR array	512560..513624					Copies: 15, Repeat: 37, Spacer: 36	–GTTTCAATCCC———–ATTACTAGGATTCATTAAAAAGAAAC												data/GCF_000024045.1_ASM2404v1_genomic.fna.gz

The first two columns contain the input genome/contig sequence ID (sometimes called an accession number) and the coordinates of the candidate gene cluster, respectively. Since an input file can have multiple genomic sequences, these two fields together uniquely specify a candidate gene cluster. Each row represents a single annotated feature in the candidate locus. Features from the same candidate are always grouped together in the CSV.

Descriptions of each output field are provided below. Alignment statistic naming conventions are from the BLAST documentation, see BLAST+ appendices (specifically “outfmt” in table C1). This glossary of common BLAST terms may also be useful in interpreting alignment statistic meaning.

index	field name	data type	description
0	Contig	string	ID/accession for the parent contig/genome sequence.
1	Loc_coordinates	string	Start and end position of the candidate locus (relative to the parent sequence).
2	Name	string	Feature name/label. This is will be identical to “Description” (index 8) if `parse_descriptions` is `True`.
3	Coordinates	string	Start and end position of this feature, relative to the parent sequence.
4	ORFID	string	A unique ID given to this feature, primarily for internal use. Only applies to features that are genes.
5	Strand	signed int	Specifies if the feature was found in the forward (1) or backward (-1) direction. Only applied to features that are genes.
6	Accession	string	ID/accession for the reference sequence that had the best alignment (by e-value) with this feature’s translated sequence.
7	E_val	float	The e-value score for the best alignment for this feature.
8	Description	string	A description of this putative feature, parsed from the defline of best aligned reference sequence.
9	Sequence	string	The (translated) amino acid sequence for this feature.
10	Bitscore	float	The bitscore for the best alignment for this feature.
11	Rawscore	int	The raw score for the best alignment for this feature.
12	Aln_len	int	The length of the best scoring alignment, in base pairs.
13	Pident	float	The fraction of identical positions in the best alignment.
14	Nident	int	The number of identical positions in the best alignment.
15	Mismatch	int	The number of mismatched positions in the best alignment.
16	Positive	int	The number of positive-scoring matches in the best alignment.
17	Gapopen	int	The number of gap openings.
18	Gaps	int	Total number of gaps in the alignment.
19	Ppos	float	Percentage of positive scoring matches.
20	Qcovhsp	int	Query coverage per HSP. That is, the fraction of the query (this feature’s translated amino acid sequence) that was covered in the best alignment.
21	Contig_filename	string	The input data (genomic sequence(s)) file path.