Pynteny

logo

1. What is Pynteny?

Pynteny is Python tool to search for synteny blocks in (prokaryotic) sequence data through HMMs of the ORFs of interest and HMMER. By leveraging genomic context information, Pynteny can be employed to decrease the uncertainty of functional annotation of unlabelled sequence data due to the effect of paralogs. Pynteny can be accessed (i) through the command line or (ii) as a Python module.

These are the available subcommands in the CLI, run as pynteny <subcommand> <options>:

2. Setup

Pynteny is a pure-Python package (it no longer requires conda or any external binaries). The HMMER and Prodigal engines are provided by the pip packages pyhmmer and pyrodigal.

Install with pip:

pip install pynteny

Or install the latest development version directly from GitHub:

pip install git+https://github.com/Robaina/Pynteny.git

Check that installation worked fine:

pynteny --help

2.1. Installing on Windows and macOS

Pynteny is developed and tested on Linux, but since it is now a pure-Python package, pip install pynteny also works on Windows and macOS (including Apple Silicon / ARM64), provided wheels are available for pyhmmer and pyrodigal on your platform.

3. General usage

Pynteny's main subcommand, pynteny search, requires a peptide (ORF) sequence database in fasta format in which record labels contain positional information of each sequence with respect to their contig of origin. Additionally, labels must follow the following format:

<genome ID>_<contig ID>_<gene position>_<locus start>_<locus end>_<strand>

To facilitate the construction of this database, Pynteny provides the subcommand pynteny build, which takes as input a fasta file containing assembled nucleotide sequence data (or a single or a collection of genomes), such as that retrieved from MAG reconstruction pipelines.

Additionally, pynteny search requires either the set of profile Hidden Markov Models (HMMs) used in the provided synteny structure or a database of profile HMMs from which to retrieve the necessary HMMs. The user may provide their own set of HMMs. However, you may also download the entire PGAP HMM database through pynteny download, which will take care of downloading and storing the download path for future usage (by default any time no additional HMMs are provided as an argument in pynteny search).

Once both the peptide database and the required HMMs are ready, you can query the peptide database with a text string encoding the query synteny structure such as the following:

\(>HMM_a \:\: n_{ab} \:\: > (HMM_{b1} | HMM_{b2}) \:\: n_{bc} \:\: < HMM_c.\)

Where \(HMM_a\) represents the name of the HMM to be used (corresponding to the file name without the extension), \(n_{ab}\) is an integer representing the maximum number of genes between HMMs a and b, < and > indicate the strand in which to search for the HMM pattern, antisense and sense, respectively. Note that more than one HMM can be employed for a single gene in the structure, as indicated by the HMM group \((HMM_{b1} | HMM_{b2})\) above. In these cases, pynteny search will search for sequences that matched any HMM contained within the HMM group.

4. Getting started and Examples

Here are some Jupyter Notebooks with examples to show how Pynteny works.

You can find more notebooks in the examples directory. Find more info in the documentation.

5. Contributing

Contributions are always welcome! If you don't know where to start, you may find an interesting issue to work in here. Please, read our contribution guidelines first.

6. Citation

If you use this software, please cite it as below:

Semidán Robaina Estévez. (2022). Pynteny: synteny-aware hmm searches made easy(Version 0.0.4). Zenodo. https://doi.org/10.5281/zenodo.7048685