Semidán Robaina, September 2022.
In this Notebook, we will use Pynteny through its command-line interface to find candidate peptide sequences beloging to the sox operon within unannotated, fully sequenced, genomes of marine prokaryotic organisms. Note that we could have conducted the same search through Pynteny's python API. Find more info in the documentation pages. Let's start by importing some required modules.
from pathlib import Path
from IPython.display import display, HTML
import pandas as pd
Download PGAP profile HMM database¶
First, let's download PGAP's profile HMM database from the NCBI webpage. To this end, we will use pynteny subcommand download
, which will unzip and store files in the specified output directory. The metadata file will be parsed and filtered to remove HMM entries that are not available in the downloaded database (this is to avoid possible downstream errors).
%%bash
pynteny download --outdir data/hmms --unpack
Build peptide sequence database¶
For this example, we are going to use the MAR reference database (currently version v7), a collection of 970 fully sequenced prokaryotic genomes from the marine environment. Specifically, we will use the assembly data file containing the assembled nucleotide sequences.
Our final goal is to build a peptide sequence database in a single FASTA file where each record corresponds to an inferred ORF, which will display the positional information (i.e. ORF number within the parent contig as well as the DNA strand). To this end, we will run pynteny's subcommand build
, which will take care of:
- Predict and translate ORFs with prodigal
- Label each ORF with a unique identifier and add positional metadata (with respect to the parent contig)
To follow this example, you should have previously downloaded the assembly data file, assembly.fa
, from MAR ref. Here is what the first lines of assembly.fa
look like, each record corresponds to a single, assembled genome:
%%bash
head -n 4 data/MARref_v7/assembly.fa
>CP000435.1 Synechococcus sp. CC9311, complete genome ACATCGTTTCCCCTGTTTCCACAAGACCTACTACGGCTGTTTTCGTAGTTCTTTTAAGAGAATAAAAACAGCCCTAAAGC CGGGGAACACGAAAAAAACGTGAAACCATTGCGCTTCTCCCTTGCCTGTGAAATTGTGAGGAGAGATTTGTTCACGCCGT TGACTCGGACCTCATGAAATTGGTCTGTTCCCAGGCAGAACTCAACGCAGCTCTGCAGTTGGTCAGTCGGGCTGTCGCCT
Let's run pynteny build
to generate a peptide database labeled with positional information. The labels are organized following the structure:
<genome ID>__<contig ID>_<gene position>_<locus start>_<locus end>_<strand>
where gene position, locus start, and locus end are taken with respect to the contig.
%%bash
pynteny build \
--data data/MARref_v7/assembly.fa \
--outfile data/labeled_marref.fasta
Here are some position-labeled predicted peptides corresponding to the assembled genome displayed above (CP000435.1
):
%%bash
grep -A 1 "CP000435.1" data/labeled_marref.fasta | head -n 6
>CP000435.1_1__CP000435.1_1_174_1331_pos MKLVCSQAELNAALQLVSRAVASRPTHPVLANVLLTADAGTDRLSLTGFDLNLGIQTSLAASVDTSGAVTLPARLLGEIVSKLSSDSPVSLSSDAGADQVELTSSSGSYQMRGMPADDFPELPLVENGTALRVDPASLLKALRATLFASSGDEAKQLLTGVHLRFNQKRLEAASTDGHRLAMLTVEDALQAEISAEESEPDELAVTLPARSLREVERLMASWKGGDPVSLFCERGQVVVLAADQMVTSRTLEGTYPNYRQLIPDGFSRTIDLDRRAFISALERIAVLADQHNNVVRIATEPATGLVQISADAQDVGSGSESLPAEINGDAVQIAFNARYVLDGLKAMDCDRVRLSCNAPTTPAILTPANDDPGLTYLVMPVQIRT* >CP000435.1_2__CP000435.1_2_1435_2148_pos MAWMHPPVHRLLGWVSRPSALRTSRDVWRLDQCRGFDDQQVFVKGAPAEADQITLDRLPTLLDADLLNADGERVGIIADLAFLPASGQISHYLVARSDPRLPGTSRWRLLPDRIVDQQPGLVSSAIHELDDLPLARASVRQDFLQRSRHWREQLQQFGDRAGERLEGWLEEPPWDEPPAVSDVASSYSSTAAPTVDPLDDWDDGDWTDAPRVERGRSVRNDPTDRNDWPDHEEDPWV* >CP000435.1_3__CP000435.1_3_2185_4518_pos MTQSSHAVAAFDLGAALRQEGLTETDYSEIQRRLGRDPNRAELGMFGVMWSEHCCYRNSRPLLSGFPTEGPRILVGPGENAGVVDLGEGHHLAFKVESHNHPSAVEPFQGAATGVGGILRDIFTMGARPIALLNALRFGPLDEPATRGLVEGVVAGIAHYGNCVGVPTVGGEVAFDPSYRGNPLVNAMALGLMETDEIVRSGAAGVGNPVVYVGSTTGRDGMGGASFASAELSADSLDDRPAVQVGDPFLEKGLIEACLEAFQSGDVVAAQDMGAAGLTCSCSEMAAKGDVGVELDLDRVPAREKGMTAYEFLLSESQERMLFVVRAGREEQLMQRFRRWGLQAAVVGRVLEEKVVRVLQHGAVAAEVPARALAEDTPINKHELLSEPPDDIQTHWTWRESDLPSPAIDRDWNADLLRLLDDPTIASKRWIYRQYDQQVLANTVIRAGGADAAVVRLRPQQGDASLQMSQRGVAATLDCPNRWVALDPERGAIAAVAEAARNLSCVGAQPIAVTDNLNFPSPETSKGYWQLAMACRGLSHACRSMGTPVTGGNVSLYNETRADDGSLQPIHPTPVVGMVGLVEDLDRSGGLAWRQPGDFVVLLGVSTDEEGNESVGLAGSSYQGAVHGLLTGRPPSVDLELEGQVQALVRQAFAQGVLASAHDSSDGGLAVALAESTLASGLGVDLNLPHRSARLDRVLFAEGGARIVVSVRAEQRSAWHSLVASQEHRSVPVTEIGTVADHGCFRLAVEKHPVIDLAVETLREQYEQAVPRRLGAV*
Search synteny structure in MAR ref¶
Finally, we are going to use pynteny's search
subcommand to search for a specific syntenic block within the previously built peptide database. Specifically, we are interested in the sox operon:
>soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC
We this synteny structure, we are searching for peptide sequences matching the profile HMM corresponding to these gene symbols, which are also arranged in this particular order, all in the positive (sense) strand, as indicated by >
, and which are located exactly next to each other in the same contig (no ORFs allowed between them, as indicated by a maximum number of in-between ORFs of 0 in all cases.)
%%bash
pynteny parse \
--synteny_struc ">soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC" \
--hmm_meta data/hmms/hmm_meta.tsv
____ __ / __ \__ ______ / /____ ____ __ __ / /_/ / / / / __ \/ __/ _ \/ __ \/ / / / / ____/ /_/ / / / / /_/ __/ / / / /_/ / /_/ \__, /_/ /_/\__/\___/_/ /_/\__, / /____/ /____/ Synteny-based Hmmer searches made easy, v0.0.2 Semidán Robaina Estévez (srobaina@ull.edu.es), 2022 2022-09-29 15:19:30,733 | INFO: Translated ">soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC" to ">TIGR04485.1 0 >TIGR04488.1 0 >TIGR04490.1 0 >(TIGR01372.1|TIGR04484.1) 0 >(TIGR01373.1|TIGR04486.1) 0 >TIGR04555.1" according to provided HMM database metadata
We see that pynteny parse
has found a number of profile HMMs matching the gene symbols in the provided synteny structure. Additionally, in two cases it has found two HMMs matching a single gene symbol, which are displayed within parentheses and separated by "|":
>TIGR04485.1 0 >TIGR04488.1 0 >TIGR04490.1 0 >(TIGR01372.1|TIGR04484.1) 0 >(TIGR01373.1|TIGR04486.1) 0 >TIGR04555.1
In these cases, pynteny search
will match sequences by either or all of the HMMs in each group within parentheses.
Alright, now that we know that our HMM database contains models for all the gene symbols in our synteny structure, let's run pynteny search
to find matches in our peptide sequence database.
Some notes:
Since we are using gene symbols instead of HMM names, we need to add the flag
--gene_ids
We could have directly input the synteny string composed of HMM names. In that case, we wouldn't need to provide the path to the HMM metadata file (
--hmm_meta
) and we would remove the flag--gene_ids
Providing a path to the HMM database directory (
--hmm_dir
) is optional. If not provided, then pynteny will download and store the PGAP HMM database (only once if not previously downloaded) and use it to run the search. A custom HMM database provided in--hmm_dir
will override pynteny's default databaseWe can also manually download the PGAP HMM database with the subcommand
pynteny download
%%bash
pynteny search \
--synteny_struc ">soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC" \
--data data/labeled_marref.fasta \
--outdir example_cli/results/ \
--hmm_dir data/hmms/hmm_PGAP \
--hmm_meta data/hmms/hmm_meta.tsv \
--gene_ids
____ __ / __ \__ ______ / /____ ____ __ __ / /_/ / / / / __ \/ __/ _ \/ __ \/ / / / / ____/ /_/ / / / / /_/ __/ / / / /_/ / /_/ \__, /_/ /_/\__/\___/_/ /_/\__, / /____/ /____/ Synteny-based Hmmer searches made easy, v0.0.2 Semidán Robaina Estévez (srobaina@ull.edu.es), 2022 2022-09-29 15:11:25,776 | INFO: Finding matching HMMs for gene symbols 2022-09-29 15:11:25,947 | INFO: Found the following HMMs in database for given structure: >TIGR04485.1 0 >TIGR04488.1 0 >TIGR04490.1 0 >(TIGR01372.1|TIGR04484.1) 0 >(TIGR01373.1|TIGR04486.1) 0 >TIGR04555.1 2022-09-29 15:11:26,134 | INFO: Searching database by synteny structure 2022-09-29 15:11:26,135 | INFO: Running Hmmer 2022-09-29 15:14:35,607 | INFO: Filtering results by synteny structure 2022-09-29 15:15:58,818 | INFO: Writing matching sequences to FASTA files 2022-09-29 15:16:03,511 | INFO: Finished!
Pynteny has generated a number of output files in the provided output directory. HMMER3 hit results are stored within the subdirectory hmmer_outputs
. The main output file, synteny_matched.tsv
contains the labels of the matched sequences grouped by synteny block and sorted by gene number within their parent contig. The remaining (FASTA) files contain the retrieved peptide sequences for each gene symbol / HMM name in the synteny structure.
output_files = sorted([
file.name for file in Path("example_cli/").iterdir()
])
output_files
['hmmer_outputs', 'soxA_TIGR04484.1_hits.fasta', 'soxB_TIGR04486.1_hits.fasta', 'soxC_TIGR04555.1_hits.fasta', 'soxX_TIGR04485.1_hits.fasta', 'soxY_TIGR04488.1_hits.fasta', 'soxZ_TIGR04490.1_hits.fasta', 'synteny_matched.tsv']
Displaying the first synteny match¶
So far, we have identified sequences (putatively) belonging to the sox operon in the MAR ref database. However, we don't know to which organisms these sequences belong. Luckily, the MAR ref database provides a metadata file that contains GTDB taxonomical information for each genome. Let's extract that info and add it to Pynteny's output table.
# Assign species (GTDB) to each genome ID
meta = pd.read_csv("data/MARref_v7/MarRef_1.7.tsv", sep="\t")
def assign_tax(genome_id: str) -> str:
try:
return meta.loc[
meta['acc:genbank'].str.contains(f'ena.embl:{genome_id.split(".")[0]}'), 'tax:gtdb_classification'
].item().split(">")[-1]
except:
return ""
df = pd.read_csv("example_cli/results/synteny_matched.tsv", sep="\t")
df["taxonomy"] = df.contig.apply(assign_tax)
# Display main results
till_row = 5
display_cols = ["gene_id", "contig", "gene_symbol", "gene_number", "locus", "strand", "hmm", "taxonomy"]
display(HTML(df.loc[:till_row, display_cols].to_html()))
gene_id | gene_symbol | gene_number | locus | strand | hmm | taxonomy | |
---|---|---|---|---|---|---|---|
0 | CP000031.2_985 | soxX | 985 | (1045616, 1046089) | pos | TIGR04485.1 | s__Ruegeria_B pomeroyi |
1 | CP000031.2_986 | soxY | 986 | (1046132, 1046548) | pos | TIGR04488.1 | s__Ruegeria_B pomeroyi |
2 | CP000031.2_987 | soxZ | 987 | (1046582, 1046911) | pos | TIGR04490.1 | s__Ruegeria_B pomeroyi |
3 | CP000031.2_988 | soxA | 988 | (1046989, 1047840) | pos | TIGR04484.1 | s__Ruegeria_B pomeroyi |
4 | CP000031.2_989 | soxB | 989 | (1047975, 1049645) | pos | TIGR04486.1 | s__Ruegeria_B pomeroyi |
5 | CP000031.2_990 | soxC | 990 | (1049724, 1051007) | pos | TIGR04555.1 | s__Ruegeria_B pomeroyi |
Displayed above the first synteny match in our peptide database, we see that all peptides are located within the same parent contig and respect the positional restrictions of our input synteny structure. Furthermore, all sequences belong to Ruegeria pomeroyii, and alphaproteobacteria of the family Rhodobacteraceae.
Additional notes:
The previous results are strand-specific (all ORFs must be located in the positive or sense strand). However, we could have made them strand-agnostic by omitting the strand symbols in the synteny structure (i.e., using
soxX 0 soxY 0 soxZ 0 soxA 0 soxB 0 soxC
)We could have made the search even more general dropping the constraint on the arrangement by adding the flag
pynteny search --unordered
. In which case, Pynteny would match any group of 6 ORFs corresponding to the provided HMM names, located in the same contig and adjacent to each other, but not necessarily arranged in the same order displayed by the synteny structure. In other words,--unordered
enables searching for "true" synteny, as opposed to the, more restrictive, collinearity.
Alright, finally, let's get the peptide sequences in our original input database that correspond to the identified synteny block displayed above:
# Get original peptide sequences
hit_labels = df.loc[:till_row, "full_label"].values
grep_labels = "|".join(hit_labels)
%%bash -s "$grep_labels"
grep -A 1 -E $1 data/labeled_marref.fasta
>CP000031.2_985__CP000031.2_985_1045616_1046089_pos MKTTILTLAAALISGAAWAGETAPGDVVYADGAVEASLTGTPGDAANGAMVVGSKKHGNCVACHQVGALADVPFQGEIGPALDGAGSRWSEAELRGLVANAKLTFEGSMMPSFYRIDGYIRPGDAYTGKAAKGALTPLLSAQEIEDVVAFLATLKDE* >CP000031.2_986__CP000031.2_986_1046132_1046548_pos MDFSRRDTLGLALGAAALTVLPFRVNAAAEDRIAEFTGGAEMGEGGLTLTAPEIAENGNTVPIEVSAPGAAAIMVLAMGNPTPGVAQFNFGPLAAAQAASTRIRLAGTQDVVAIAKMADGSFVKASSTVKVTIGGCGG* >CP000031.2_987__CP000031.2_987_1046582_1046911_pos MASGVKPRVKVPKSVAAGEAITIKTLISHAMESGQRKDKEGNVIPRSIINRFTCEFNGQSVIDITMEPAISTNPYFQFDATVPEAGEFVFTWYDDDGSVYNDNKSITIA* >CP000031.2_988__CP000031.2_988_1046989_1047840_pos MKVRAMTAIAALLAAPLAAVAGPDSDELVVNGEINMVTQTEAPAHLDGALSELYSGWRFRSDETQALQMDDFDNPAMVFVDQAQEAWDTADGTEGKSCASCHGDAADSMAGVRAVYPKWNEAAGEVRTLEAQVNDCRENRMGAKAWKYDGGDMASMTALISVQSRGLPVNVAIDGPAQATWEMGKEIYYTRYGQLELSCANCHEDNYGNMIRADHLSQGHINGFPTYRLKNAKLNTSHARFKGCVRDTRAETFNPGSPEFVALELYVASRGNGLSVEGPSVRN* >CP000031.2_989__CP000031.2_989_1047975_1049645_pos MAASALVGASGFGNWSRLAAQQALTQDQLLEFDTFGNLTLIHITDIHAQLMPIYFREPEVNLGIGAAKGQMPHITGADFRRFYGIEDGSPSAYALTYDDFSSLARTYGRVGGMDRVANVVNAIRADRPDALLLDGGDTWHGSYTCHHTEGQDVVNVMNALKPDAMTFHWEFTLGTDRVTELVESLPFASLGQNIFDAEWDEPAELFKPYKFFERGGVKVAVIGQAFPYMPIANPGWMFPEFSFGIREENMQAMVDEVRAEGAEVVVVLSHNGFDVDKKMAGRVTGIDVILSGHTHDALPEPVLVEQTYVIASGSNGKFVSRVDLDIRDGRMMGLKHKLIPIFSDVIAPDPQVTALINAQREPHIDQLREVIGRTADDALLYRRGNFNGTWDDLICDALISERDADIAMSPGVRWGPSILPGQEITREDIWNVTSMSYGAAYRTEMTGEFLHVVLEDVGDNLFNPDPYYQQGGDMVRVGGLGYRIDVTKPQGSRISDMTLLKTGEQIDPAKNYVVAGWASVNEGTEGPQIWDVVENHIRKLGTVNVTPNTSVEVVGA* >CP000031.2_990__CP000031.2_990_1049724_1051007_pos MSESSGKHSPSRRRFLTSVAAAGAGAVAAGAARAAGPDPLITEVQDWARATGDGVDATPYGLPIEYEGDVIRRNVEWLTADTISSINFTPIHALDGTITPQGCAFERHHSGAIALKKEDYRLMINGLVDTPLVFSYADLERFPRENHVYFCECAANSGMEWAGAQLNGAQFTHGMIHNMEYSGVSLRTLLEEAGLSAAGDLKDKWVYVEGADASSNGRSIPMEKALDDVLVAFKANGEALRKEHGYPVRLVVPGWEGNMWVKWLRRIEVMDGPVESREETSKYTDTLEDGTSRKWTWVMDAKSVVTSPSPQAPITHGAGPLVITGLAWSGRGAITGVDVSIDGGKSWQPARLAAPGQDKALTRFYLDTNWDGSEMLLQSRARDSSGYVQPTKAQLREVRGLNSIYHNNAIQTWWVKANGVAENVEVS*
There you go! All six peptide sequences corresponding to the synteny block >soxX 0 >soxY 0 >soxZ 0 >soxA 0 >soxB 0 >soxC
. We have come from unannotated nucleotide assembly data to identify and locate (part of) the sox operon of Ruegeria pomeroyi. Not bad...
Here is the KEGG genome visualization centered around the sox operon. Click on the image to open it in the browser, then change gene identifiers from KID
to Symbol
to display gene symbols. The sox operon is depicted in purple.