Semidán Robaina, September 2022.
In this Notebook, we will use Pynteny through its Python API to find candidate peptide sequences beloging to the leu operon of Escherichia coli.
Note that we could have conducted the same search through Pynteny's command-line interface. Check out a more complete example based on Pynteny's command-line interface here.
Find more info in the documentation pages!
Let's start by importing some required modules.
from pathlib import Path
from pandas import DataFrame
from pynteny.filter import SyntenyHits
from pynteny import Search, Build, Download
Let's now create a directory to store results
Path("example_api/data").mkdir(exist_ok=False, parents=True)
Download PGAP profile HMM database¶
Pynteny downloads the PGAP's profile HMM database by default from the NCBI webpage when no path to a HMM database is provided. However, we can also manually download PGAP's database within Python through the class Download
, which will unzip and store files in the specified output directory. The metadata file will be parsed and filtered to remove HMM entries that are not available in the downloaded database (this is to avoid possible downstream errors). Here is how you would run it:
# Optional if PGAP's database has already been downloaded by Pynteny
Download(
outdir="example_api/data/hmms",
unpack=True
).run()
Build peptide sequence database¶
For this example, we are going to use the complete genome of E. coli's K-12 MG1655 in genbank format. Our final goal is to build a peptide sequence database in a single FASTA file where each record corresponds to an inferred ORF, which will display the positional information (i.e. ORF number within the parent contig as well as the DNA strand). To this end, we will run pynteny's subcommand build
within Python through the class Build
.
Since we are providing a genome which is already annotated (genbank file), we don't need to predict and translate ORFs as in the command-line example. Instead, Pynteny will directly label each ORF with a unique identifier and add positional metadata (with respect to the parent contig). The labels will be organized following the structure:
<genome ID>__<contig ID>_<gene position>_<locus start>_<locus end>_<strand>
where gene position, locus start, and locus end are taken with respect to the contig.
NOTE: You'll need E. coli's genome to follow this example. It's already downloaded in the repo (tests/test_data/MG1655.gb
), but you can also download it here.
Build(
data="../../tests/test_data/MG1655.gb",
outfile="example_api/data/labelled_MG1655.fasta",
logfile=None
).run()
2023-01-31 10:14:01,795 | INFO: Building annotated peptide database 2023-01-31 10:14:02,289 | INFO: Parsing GenBank data. 2023-01-31 10:14:02,705 | INFO: Database built successfully!
Search synteny structure in E. coli¶
Finally, we are going to use pynteny class Search
to search for a specific syntenic block within the previously built peptide database. Specifically, we are interested in the following structure:
<leuD 0 <leuC 1 <leuA
With this synteny structure, we are searching for peptide sequences matching the profile HMM corresponding to these gene symbols, which are also arranged in this particular order, all in the negative (antisense) strand, as indicated by <
, and which are located exactly next to each other in the same contig in the case of leuD
and leuC
, and with at most one ORF in between in the case of leuC
and leuA
(as indicated by a maximum number of in-between ORFs of 0 and 1, respectively.)
First, we need to initialize the class Search
with the appropiate parameters to conduct our synteny-aware search. Find more info about the parameters in the wiki pages.
Some notes:
The only required parameters are
data
, the path to the position-labeled peptide database andsynteny_struc
, a string containing the definition of the synteny block to search forProviding a path to the HMM database directory (parameter
hmm_dir
) is optional. If not provided, then pynteny will download and store the PGAP HMM database (only once if not previously downloaded) and use it to run the search. A custom HMM database provided inhmm_dir
will override pynteny's default databaseWe can also manually download the PGAP HMM database with the subcommand
pynteny download
, or within Python through the classDownload
as shown above.
# Initialize class
search = Search(
data="example_api/data/labelled_MG1655.fasta",
synteny_struc="<leuD 0 <leuC 1 <leuA",
hmm_dir=None,
hmm_meta=None,
outdir="example_api/",
prefix="",
hmmsearch_args=None,
gene_ids=False,
logfile="example_api/pynteny.log",
processes=None,
unordered=False,
)
# Parse gene IDs in synteny structure according to PGAP HMM database metadata
parsed_struc = search.parse_genes(synteny_struc="<leuD 0 <leuC 1 <leuA")
2022-10-04 12:02:26,644 | INFO: Translated "<leuD 0 <leuC 1 <leuA" to "<TIGR00171.1 0 <TIGR00170.1 1 <TIGR00973.1" according to provided HMM database metadata
We see that pynteny parse
has found three profile HMMs matching the corresponding gene symbols in the provided synteny structure:
<TIGR00171.1 0 <TIGR00170.1 1 <TIGR00973.1
Alright, now that we know that our HMM database contains models for all the gene symbols in our synteny structure, let's execute Search.run
to find matches in our peptide sequence database.
Some notes:
- We could have directly input the synteny string composed of gene symbols. In that case we would have to set
gene_ids=True
with the methodSearch.update
.
# Update parsed synteny structure and Rrun Pynteny search
search.update("synteny_struc", parsed_struc)
synhits: SyntenyHits = search.run()
synhits_df: DataFrame = synhits.hits
2022-10-04 12:02:29,054 | INFO: Searching database by synteny structure 2022-10-04 12:02:29,054 | INFO: Running Hmmer 2022-10-04 12:02:29,055 | INFO: Reusing Hmmer results for HMM: TIGR00973.1 2022-10-04 12:02:29,058 | INFO: Reusing Hmmer results for HMM: TIGR00171.1 2022-10-04 12:02:29,060 | INFO: Reusing Hmmer results for HMM: TIGR00170.1 2022-10-04 12:02:29,062 | INFO: Filtering results by synteny structure 2022-10-04 12:02:29,091 | INFO: Writing matching sequences to FASTA files 2022-10-04 12:02:29,146 | INFO: Finished!
Displaying the first synteny match¶
Pynteny has generated a number of output files in the provided output directory. HMMER3 hit results are stored within the subdirectory hmmer_outputs
. The main output file, synteny_matched.tsv
contains the labels of the matched sequences grouped by synteny block and sorted by gene number within their parent contig. The remaining (FASTA) files contain the retrieved peptide sequences for each gene symbol / HMM name in the synteny structure.
Displayed below is the first synteny match in our peptide database, we see that all peptides are located within the same parent contig and respect the positional restrictions of our input synteny structure: <leuD 0 <leuC 1 <leuA
.
synhits_df.head()
contig | gene_id | gene_number | locus | strand | full_label | hmm | gene_symbol | label | product | ec_number | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | U00096 | b0071 | 71 | (78847, 79453) | neg | b0071__U00096_71_78847_79453_neg | TIGR00171.1 | leuD | leuD | 3-isopropylmalate dehydratase small subunit | 4.2.1.33 |
1 | U00096 | b0072 | 72 | (79463, 80864) | neg | b0072__U00096_72_79463_80864_neg | TIGR00170.1 | leuC | leuC | 3-isopropylmalate dehydratase large subunit | 4.2.1.33 |
2 | U00096 | b0074 | 74 | (81957, 83529) | neg | b0074__U00096_74_81957_83529_neg | TIGR00973.1 | leuA | leuA_bact | 2-isopropylmalate synthase | 2.3.3.13 |
Get citation¶
We can get the citation string by calling the cite
method:
Search.cite()
Semidán Robaina Estévez (2022). Pynteny: synteny-aware hmm searches made easy(Version 0.0.2). Zenodo. https://doi.org/10.5281/zenodo.7048685