prot_db

The prot_db module contains classes to handle protein file and protein description which can be either generate by Prodigal or Provide by Gembase. It also provide an interface to abstract the way to get protein sequences and descriptions

class integron_finder.prot_db.GembaseDB(replicon, cfg, gembase_path=None, prot_file=None)[source]

Implements ProteinDB from a Gembase. Managed proteins from Proteins directory corresponding to a replicon/contig

__getitem__(prot_seq_id)[source]
Parameters:

prot_seq_id (str) – the id of a protein sequence

Returns:

The Sequence corresponding to the prot_seq_id.

Return type:

Bio.SeqRecord object

__init__(replicon, cfg, gembase_path=None, prot_file=None)[source]
Parameters:
  • replicon (Bio.SeqRecord object with a extra attribute path) – The replicon used to create ProteinDB (protein files and extra information)

  • cfg (integron_finder.config.Config object) – The integron_finder configuration

  • prot_file – The path to a protein file in fasta format which is the translation of the replicon

Warning

The replicon is a modified Bio.SeqRecord object. The attribute path must be injected in the object This attribute represent the path to a fasta file representing this replicon

__iter__()[source]
Returns:

a generator which iterate on the gene seq_ids which constitute the contig (genes and pseudogenes).

Return type:

generator

_make_protfile(path=None)[source]

Create fasta file with protein corresponding to this sequence, from the corresponding Gembase protfile This step is necessary because in Gembase1&2 Draft One nucleic file can contain several contigs, but all proteins are in the same file. and in Gembase2 replicons file can contain several chromosomes and plasmids :return: the path of the created protein file :rtype: str

_parse_lst(lst_path)[source]

Parse the LSTINFO file and extract information specific to the replicon :return: class:pandas.DataFrame` object :raise IntegronError: when LST dir is not found

coding_prot_ids()[source]
Returns:

a generator which iterate on coding genes seq_ids (the non coding genes are discarded)

Return type:

generator

classmethod find_gembase_file_basename(gembase_path, input_seq_path)[source]

from the input file name, try to retrieve the basename which is used in gembase This specially useful when IF is run in parallel. The input sequence is split in chunks and treat in parallel. But in this case the name of the chunk does not match neither the lstinfo file nor the protein file. So this method try retrieve the original basename without extension for instance:

ACBA.0917.00019.fna               => ACBA.0917.00019
ACBA.0917.00019.0001.fna          => ACBA.0917.00019
ESCO001.C.00001.C001.fst          => ESCO001.C.00001.C001
ESCO001.C.00001.C001_chunk_1.fst  => ESCO001.C.00001.C001
Returns:

the gembase basename corresponding to the input file

Return type:

string

static gembase1_complete_parser(lst_path, sequence_id)[source]
Parameters:
  • lst_path (str) – the path of the LSTINFO file Gembase Complet

  • sequence_id (str) – the id of the genomic sequence to analyse

Returns:

the information related to the ‘valid’ CDS corresponding to the sequence_id

Return type:

class:pandas.DataFrame` object

static gembase1_draft_parser(lst_path, replicon_id)[source]
Parameters:
  • lst_path (str) – the path of the LSTINFO file from a Gembase Draft

  • sequence_id (str) – the id of the genomic sequence to analyse

Returns:

the information related to the ‘valid’ CDS corresponding to the sequence_id

Return type:

class:pandas.DataFrame` object

static gembase2_parser(lst_path, replicon_id)[source]
Parameters:
  • lst_path (str) – the path of the LSTINFO file from a Gembase Draft

  • sequence_id (str) – the id of the genomic sequence to analyse

Returns:

the information related to the ‘valid’ CDS corresponding to the sequence_id

Return type:

class:pandas.DataFrame` object

classmethod gembase_sniffer(lst_path)[source]

Detect the type of gembase :param str lst_path: the path to the LSTINFO file corresponding to the nucleic sequence :returns: either the type of replicon (‘Complet’ or ‘Draft’, 1 or 2) :rtype: tuple

get_description(gene_id)[source]
Parameters:

gene_id (str) – a protein/gene identifier

Returns:

The description of the protein corresponding to the gene_id

Return type:

SeqDesc namedtuple object

Raises:
  • IntegronError – when gene_id is not a valid Gembase gene identifier

  • KeyError – if gene_id is not found in GembaseDB instance

static get_lst_dir(gembase_path)[source]
Returns:

The path to the Gembase LST directory

Return type:

str

classmethod get_replicon_type(seq_id='', rep_id='')[source]
Parameters:
  • seq_id (str) – the sequence id to parse

  • rep_id (str) – the replicon identifier

Returns:

the kind of genome, it can be either:

  • Chromosome

  • Plasmid

  • Phage

  • Other

  • Draft

Return type:

RepliconType object

class integron_finder.prot_db.ProdigalDB(replicon, cfg, prot_file=None)[source]

Creates proteins from Replicon/contig using prodigal and provide facilities to access them.

__getitem__(prot_seq_id)[source]
Parameters:

prot_seq_id (str) – the id of a protein sequence

Returns:

The Sequence corresponding to the prot_seq_id.

Return type:

Bio.SeqRecord object

__iter__()[source]
Returns:

a generator which iterate on the protein seq_id which constitute the contig.

Return type:

generator

_make_protfile(path=None)[source]

Use prodigal to generate proteins corresponding to the replicon

Returns:

the path of the created protfile

Return type:

str

coding_prot_ids()[source]
Returns:

a generator which iterate on coding genes seq_ids (the non coding genes are discarded)

Return type:

generartor

get_description(gene_id)[source]
Parameters:

gene_id (str) – a protein/gene identifier

Returns:

The description of the protein corresponding to the gene_id

Return type:

SeqDesc namedtuple object

Raises:
  • IntegronError – when gene_id is not a valid Gembase gene identifier

  • KeyError – if gene_id is not found in ProdigalDB instance

class integron_finder.prot_db.ProteinDB(replicon, cfg, prot_file=None)[source]

AbstractClass defining the interface for ProteinDB. ProteinDB provide an abstraction and a way to access to proteins corresponding to the replicon/contig CDS.

abstractmethod __getitem__(prot_seq_id)[source]
Parameters:

prot_seq_id (str) – the id of a protein sequence

Returns:

The Sequence corresponding to the prot_seq_id.

Return type:

Bio.SeqRecord object

Raises:

KeyError – when seq_id does not match any sequence in DB

__init__(replicon, cfg, prot_file=None)[source]
Parameters:
  • replicon

  • cfg

  • prot_file

abstractmethod __iter__()[source]
Returns:

a generator which iterate on the protein seq_id which constitute the contig.

Return type:

generator

__weakref__

list of weak references to the object

_make_db()[source]
Returns:

an index of the sequence contains in protfile corresponding to the replicon

abstractmethod _make_protfile(path=None)[source]

Create fasta file with protein corresponding to the nucleic sequence (replicon)

Returns:

the path of the created protein file

Return type:

str

abstractmethod coding_prot_ids()[source]
Returns:

a generator which iterate on coding genes seq_ids (the non coding genes are discarded)

Return type:

generartor

abstractmethod get_description(gene_id)[source]
Parameters:

gene_id (str) – a protein/gene identifier

Returns:

The description of the protein corresponding to the gene_id

Return type:

SeqDesc namedtuple object

Raises:
  • IntegronError – when gene_id is not a valid Gembase gene identifier

  • KeyError – if gene_id is not found in GembaseDB instance

is_pseudo_gene(seq_id)[source]
Parameters:

seq_id – The sequence id to test

Returns:

True if the seq_id correspond to gene which is a pseudo gene, False otherwise

property protfile
Returns:

The absolute path to the protein file corresponding to contig id

Return type:

str

class integron_finder.prot_db.SeqDesc(id, strand, start, stop)
__getnewargs__()

Return self as a plain tuple. Used by copy and pickle.

static __new__(_cls, id, strand, start, stop)

Create new instance of SeqDesc(id, strand, start, stop)

__replace__(**kwds)

Return a new SeqDesc object replacing specified fields with new values

__repr__()

Return a nicely formatted representation string

_asdict()

Return a new dict which maps field names to their values.

classmethod _make(iterable)

Make a new SeqDesc object from a sequence or iterable

_replace(**kwds)

Return a new SeqDesc object replacing specified fields with new values

id

Alias for field number 0

start

Alias for field number 2

stop

Alias for field number 3

strand

Alias for field number 1