Module pipelines
Pre-set pipelines to infer trees and place and label env sequences
QueryLabeller
Place queries onto reference tree and assign function and taxonomy
Source code in src/metatag/pipelines.py
306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 |
|
assign_directory: Path
property
Path to directory with assignment results.
count_directory: Path
property
Path to directory with count results.
jplace: Path
property
Path to output file with placements in jplace format.
logfile: Path
property
Path to logfile.
place_directory: Path
property
Path to directory with placement results.
placements_tree: Path
property
Path to output file with placements in Newick format.
query_labels: Path
property
Path to output file with query labels.
taxtable: Path
property
Path to output file with taxonomic assignments.
__init__(input_query, reference_alignment, reference_tree, reference_labels, tree_model, tree_clusters=None, tree_cluster_scores=None, tree_cluster_score_threshold=None, alignment_method='papara', output_directory=None, maximum_placement_distance=1.0, distance_measure='pendant_diameter_ratio', minimum_placement_lwr=0.8, logfile=None)
Place queries onto reference tree and assign function and taxonomy
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_query |
Path
|
path to query fasta file |
required |
reference_alignment |
Path
|
path to reference alignment in FASTA format |
required |
reference_tree |
Path
|
path to reference tree in Newick format |
required |
tree_model |
str
|
substitution model to use for tree inference |
required |
tree_clusters |
Path
|
path to tsv file containing tree cluster definitions. Defaults to None. |
None
|
tree_cluster_scores |
Path
|
path to tsv file containing tree cluster scores. Defaults to None. |
None
|
reference_labels |
Path
|
path to reference labels file in pickle format. Defaults to None. |
required |
alignment_method |
str
|
choose aligner: "papara" or "hmmalign". Defaults to "papara". |
'papara'
|
output_directory |
Path
|
path to output directory. Defaults to None. |
None
|
maximum_placement_distance |
float
|
maximum distance of placed sequences (distance measure below). Defaults to 1.0. |
1.0
|
distance_measure |
str
|
choose distance measure for placements: "pendant_diameter_ratio", "pendant_distal_ratio" or "pendant". Defaults to "pendant_diameter_ratio". |
'pendant_diameter_ratio'
|
minimum_placement_lwr |
float
|
cutoff value for the LWR of placements. Defaults to 0.8. |
0.8
|
logfile |
Path
|
path to logfile. Defaults to None. |
None
|
Source code in src/metatag/pipelines.py
311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 |
|
run()
Run pipeline to annotate query sequences through evolutionary placement.
Source code in src/metatag/pipelines.py
458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 |
|
QueryProcessor
Preprocess query sequences: remove duplicates, translate and relabel if needed, prefilter sequences by HMM.
Source code in src/metatag/pipelines.py
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
|
filtered_query: Path
property
Get path to filtered query.
logfile: Path
property
Get path to log file.
output_directory: Path
property
Get output directory.
__init__(input_query, hmms=None, minimum_sequence_length=30, maximum_sequence_length=None, idprefix='query_', relabel=False, translate=False, export_dup=False, output_directory=None, hmmsearch_args=None, logfile=None)
Preprocess query sequences
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_query |
Path
|
path to query sequences |
required |
hmms |
list[Path]
|
list of paths to HMMs to prefilter query sequences. Defaults to None. |
None
|
minimum_sequence_length |
int
|
minimum length for sequences to be kept. Defaults to 30. |
30
|
maximum_sequence_length |
int
|
maximum length for sequences to be kept. Defaults to None. |
None
|
idprefix |
str
|
prefix for sequence IDs. Defaults to "query_". |
'query_'
|
relabel |
bool
|
relabel sequences to short IDs. Defaults to False. |
False
|
translate |
bool
|
translate DNA sequences to peptide. Defaults to False. |
False
|
export_dup |
bool
|
export duplicated sequences. Defaults to False. |
False
|
output_directory |
Path
|
path to output directory. Defaults to None. |
None
|
logfile |
Path
|
path to logfile. Defaults to None. |
None
|
Source code in src/metatag/pipelines.py
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 |
|
run()
Run pipeline to preprocess query sequences
Source code in src/metatag/pipelines.py
272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
|
ReferenceTreeBuilder
Reconstruct reference phylogenetic tree from sequence database and hmms
Source code in src/metatag/pipelines.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
cleaned_database: Path
property
Get path to cleaned database.
logfile: Path
property
Get path to log file.
output_directory: Path
property
Get output directory.
reference_alignment: Path
property
Get path to reference alignment.
reference_database: Path
property
Get path to reference database.
reference_labels: Path
property
Get path to reference labels.
reference_tree: Path
property
Get path to reference tree.
__init__(input_database, hmms, maximum_hmm_reference_sizes=None, minimum_sequence_length=30, maximum_sequence_length=None, output_directory=None, translate=False, relabel=True, remove_duplicates=True, relabel_prefixes=None, hmmsearch_args=None, tree_method='fasttree', tree_model='iqtest', msa_method='muscle', skip_preprocess=False, logfile=None)
Reconstruct reference phylogenetic tree from sequence database and hmms
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_database |
Path
|
path to input sequence database in FASTA format |
required |
hmms |
list[Path]
|
list of paths to HMM files |
required |
max_hmm_reference_size |
list[int]
|
list of maximum database size for each HMM. Defaults to None. |
required |
min_sequence_length |
int
|
minimum length of sequences in final database. Defaults to 10. |
required |
max_sequence_length |
int
|
maximum length of sequencesin final database. Defaults to 1000. |
required |
output_directory |
Path
|
path to output directory. Defaults to None. |
None
|
translate |
bool
|
whether to translate input (DNA) sequences. Defaults to False. |
False
|
relabel |
bool
|
whether to relabel records in reference database with provisional short labels. Defaults to True. |
True
|
remove_duplicates |
bool
|
whether to remove duplicated sequences in database. Defaults to True. |
True
|
relabel_prefixes |
list[str]
|
list of prefixes to be added to each relabelled record. One for each HMM-derived database. Defaults to None. |
None
|
hmmsearch_args |
str
|
additional arguments to hmmsearch as a string. Defaults to None. |
None
|
tree_method |
str
|
choose tree inference method: "iqtree" or "fasttree". Defaults to "fasttree". |
'fasttree'
|
tree_model |
str
|
choose method to select substitution model: "iqtest", "modeltest", or a valid substitution model name (compatible with EPA-ng). Defaults to "iqtest". |
'iqtest'
|
msa_method |
str
|
choose msa method for reference database: "muscle" or "mafft". Defaults to "muscle". |
'muscle'
|
skip_preprocess |
bool
|
whether to skip preprocessing step. Defaults to False. |
False
|
logfile |
Path
|
path to logfile. Defaults to None. |
None
|
Source code in src/metatag/pipelines.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
run()
Run pipeline to build reference tree.
Source code in src/metatag/pipelines.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
|
Module database.preprocessing
Tools to preprocess sequence databases
assert_correct_sequence_format(fasta_file, output_file=None, is_peptide=True)
Filter out (DNA or peptide) sequences containing illegal characters
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fasta_file |
Path
|
path to input FASTA file |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
is_peptide |
bool
|
whether FASTA contains peptide sequences. Defaults to True. |
True
|
Source code in src/metatag/database/preprocessing.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
|
fasta_contains_nucleotide_sequences(fasta_file)
Check whether fasta file contains nucleotide sequences
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fasta_file |
Path
|
path to input FASTA file |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
whehter FASTa contains nucleotide sequences |
Source code in src/metatag/database/preprocessing.py
74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
is_fasta(filename)
Check whether file is of type FASTA
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename |
Path
|
path to input file |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
answer |
Source code in src/metatag/database/preprocessing.py
281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 |
|
is_legit_dna_sequence(record_seq)
Assert that DNA sequence only contains valid symbols
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record_seq |
str
|
record squence as a string |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
whether sequence corresponds to DNA |
Source code in src/metatag/database/preprocessing.py
125 126 127 128 129 130 131 132 133 134 135 136 |
|
is_legit_peptide_sequence(record_seq)
Assert that peptide sequence only contains valid symbols
Parameters:
Name | Type | Description | Default |
---|---|---|---|
record_seq |
str
|
record sequence as a string |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
whether sequence correpsonds to a peptide |
Source code in src/metatag/database/preprocessing.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
|
merge_fastas(input_fastas_dir, output_fasta=None)
Merge input fasta files into a single fasta
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fastas_dir |
Path
|
path to input FASTA file |
required |
output_fasta |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/database/preprocessing.py
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
|
remove_duplicates_from_fasta(input_fasta, output_fasta=None, export_duplicates=False, duplicates_file=None)
Removes duplicate entries (either by sequence or ID) from fasta.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input FASTA file |
required |
output_fasta |
Path
|
path to output file. Defaults to None. |
None
|
export_duplicates |
bool
|
whether to export a file cotnainig duplicated sequences. Defaults to False. |
False
|
duplicates_file |
Path
|
path to duplicates output file. Defaults to None. |
None
|
Source code in src/metatag/database/preprocessing.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
set_original_record_ids_in_fasta(input_fasta, label_dict=None, output_fasta=None)
Relabel temporary record ID by original IDs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input FASTA file |
required |
label_dict |
dict
|
dictionary containing labels to short IDs Defaults to None. |
None
|
output_fasta |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/database/preprocessing.py
227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 |
|
set_temp_record_ids_in_fasta(input_fasta, output_dir=None, prefix=None, output_fasta=None, output_dict=None)
Change record ids for numbers and store then in a dictionary
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input FASTA file |
required |
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
prefix |
str
|
prefix to record names in output FASTA. Defaults to None. |
None
|
output_fasta |
Path
|
description. Defaults to None. |
None
|
output_dict |
Path
|
description. Defaults to None. |
None
|
Source code in src/metatag/database/preprocessing.py
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
|
write_record_names_to_file(input_fasta, filter_by_tag=None, output_file=None)
Write a txt file containing a list of record IDs in fasta
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input FASTA file |
required |
filter_by_tag |
str
|
set to str containing a pattern to match in record labels. In this case, only matched record labels are returned. Defaults to None. |
None
|
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/database/preprocessing.py
256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 |
|
Module database.manipulation
Tools to create peptide-specific sequence databases
convert_fasta_aln_to_phylip(input_fasta_aln, output_phylip=None)
Convert alignments in Fasta to Phylip.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta_aln |
Path
|
path to input alignment file |
required |
output_phylip |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/database/manipulation.py
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
|
convert_phylip_to_fasta_aln(input_phylip, output_file=None)
Convert alignments in Phylip to Fasta format
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_phylip |
Path
|
path to input alignment file |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/database/manipulation.py
189 190 191 192 193 194 195 196 197 198 199 200 201 202 |
|
convert_stockholm_to_fasta_aln(input_stockholm, output_fasta=None)
Convert alignment file in Stockholm format to fasta
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_stockholm |
Path
|
path to input alignment file |
required |
output_fasta |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/database/manipulation.py
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 |
|
filter_fasta_by_hmm(hmm_model, input_fasta, output_fasta=None, hmmer_output=None, method='hmmsearch', additional_args=None)
Generate protein-specific database by filtering sequence database to only contain sequences corresponing to protein of interest
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hmm_model |
Path
|
path to hmm model |
required |
input_fasta |
Path
|
path to input FASTA file |
required |
output_fasta |
Path
|
path to output file. Defaults to None. |
None
|
hmmer_output |
Path
|
path to hmmer output directory. Defaults to None. |
None
|
method |
str
|
choose hmmer method: "hmmsearch" or "hmmscan". Defaults to "hmmsearch". |
'hmmsearch'
|
additional_args |
str
|
additional arguments to hmmsearch/scan as a string. Defaults to None. |
None
|
Source code in src/metatag/database/manipulation.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
filter_fasta_by_ids(input_fasta, record_ids, output_fasta=None)
Filter records in fasta file matching provided IDs
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input FASTA file |
required |
record_ids |
list
|
list of record IDs to filter FASTA file by |
required |
output_fasta |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/database/manipulation.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
filter_fasta_by_sequence_length(input_fasta, min_length=None, max_length=None, output_fasta=None)
Filter sequences by length in fasta file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input FASTA file |
required |
min_length |
int
|
description. Defaults to None. |
None
|
max_length |
int
|
description. Defaults to None. |
None
|
output_fasta |
Path
|
description. Defaults to None. |
None
|
Source code in src/metatag/database/manipulation.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
get_fasta_record_ids(fasta_file)
Extract record ids from fasta
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fasta_file |
Path
|
path to input FASTA file |
required |
Returns:
Name | Type | Description |
---|---|---|
set |
set
|
set of record IDs in FAStA file |
Source code in src/metatag/database/manipulation.py
270 271 272 273 274 275 276 277 278 279 280 281 282 283 |
|
parse_hmmsearch_output(hmmer_output)
Parse hmmsearch or hmmscan summary table output file
Args hmmer_output (Path): path to hmmsearch or hmmscan summary table output file
Source code in src/metatag/database/manipulation.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
split_reference_from_query_alignments(ref_query_msa, ref_ids=None, ref_prefix=None, output_dir=None)
Separate reference sequences from query sequences in msa fasta file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref_query_msa |
Path
|
path to input papara/hmmalign alignment file |
required |
ref_ids |
set
|
IDs of reference sequences. Defaults to None. |
None
|
ref_prefix |
str
|
prefix employed by all reference sequences (Use instead of ref_ids). Defaults to None. |
None
|
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
Source code in src/metatag/database/manipulation.py
223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 |
|
Module database.reduction
Tools to reduce the size of the peptide-specific reference database
Currently based on: 1. CD-HIT 2. Repset: https://onlinelibrary.wiley.com/doi/10.1002/prot.25461
get_representative_set(input_seqs, input_pi, max_size=None, output_file=None)
Runs repset.py to obtain a representative set of size equal to max_size (or smaller if less sequences than max_size) or an ordered list (by 'representativeness') of representative sequences if max_size set to None.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_seqs |
Path
|
path to input FASTA file containing sequences |
required |
input_pi |
Path
|
path to input file containing pairwise identity |
required |
max_size |
int
|
maximum number of sequences in reduced database. Defaults to None. |
None
|
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/database/reduction.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
reduce_database_redundancy(input_fasta, output_fasta=None, cdhit=True, maxsize=None, cdhit_args=None)
Reduce redundancy of peptide datatabase. Runs cd-hit, if selected, additional arguments to cdhit may be passed as a string (cdhit_args). Runs repset to obtain a final database size no larger (number of sequences) than selected maxsize. If maxsize = None, repset is not run.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input FASTA file. |
required |
output_fasta |
Path
|
path to output, reduced fasta. Defaults to None. |
None
|
cdhit |
bool
|
whether to use CD-HIT alongside repset. Defaults to True. |
True
|
maxsize |
int
|
maximum number of sequences in final database. Defaults to None. |
None
|
cdhit_args |
str
|
additional arguments to CD-HIT. Defaults to None. |
None
|
Source code in src/metatag/database/reduction.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
|
Module database.labelparsers
Tools to parse sequence labels from different databases
LabelParser
Source code in src/metatag/database/labelparsers.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|
__init__(label)
Parse labels to extract genome ID and metadata
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label |
str
|
label to be parsed |
required |
Source code in src/metatag/database/labelparsers.py
13 14 15 16 17 18 19 |
|
extract_genome_id()
Extract genome ID from sequence label
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Genome ID |
Source code in src/metatag/database/labelparsers.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
extract_mmp_id()
Extract MMP ID from sequence label
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
MMP ID |
Source code in src/metatag/database/labelparsers.py
40 41 42 43 44 45 46 47 48 49 50 |
|
extract_oceanmicrobiome_id()
Extract OceanMicrobiome ID from sequence label
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
OceanMicrobiome ID |
Source code in src/metatag/database/labelparsers.py
64 65 66 67 68 69 70 |
|
extract_taxid()
Extract NCBI taxon ID from sequence label
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
NCBI taxon ID |
Source code in src/metatag/database/labelparsers.py
52 53 54 55 56 57 58 59 60 61 62 |
|
Module alignment
Tools to perform multiple sequence alignments
align_peptides(input_fasta, method='muscle', output_file=None, additional_args=None)
Perform MSA on reference peptide sequences. Outputs in format fasta.aln
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to fasta file containing reference sequences |
required |
method |
str
|
Choose alignment method: "muscle" or "mafft". Defaults to "muscle". |
'muscle'
|
output_file |
Path
|
path to output alignment file. Defaults to None. |
None
|
additional_args |
str
|
additional arguments to aligner. Defaults to None. |
None
|
Source code in src/metatag/alignment.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
|
align_short_reads_to_reference_msa(ref_msa, query_seqs, method='papara', tree_nwk=None, output_dir=None)
Align short read query sequences to reference MSA (fasta format). Outputs fasta msa alignment between query and reference sequences
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref_msa |
Path
|
path to reference MSA in fasta format |
required |
query_seqs |
Path
|
path to query sequences in fasta format |
required |
method |
str
|
choose aligner: "hmmalign" or "papara". Defaults to "papara". |
'papara'
|
tree_nwk |
Path
|
path to reference tree in newick format. Defaults to None. |
None
|
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
Source code in src/metatag/alignment.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
|
Module phylotree
Tools to perform phylogenetic tree reconstructions and query sequence placements onto trees
get_iq_tree_model_from_log_file(iqtree_log)
Parse iqtree log file and return best fit model
If model supplied, search model in Command: iqtree ... -m 'model' If not, then -m TEST or -m MFP If one of those, continue to line: Best-fit model: 'model' chosen according to BIC
Parameters:
Name | Type | Description | Default |
---|---|---|---|
iqtree_log |
Path
|
path to iqtree log file |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
best substitution model employed by iqtree |
Source code in src/metatag/phylotree.py
259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 |
|
get_tree_model_from_modeltest_log(modeltest_log, criterion='BIC')
Parse modeltest-ng log file and return best fit model according to selected criterion: BIC, AIC or AICc
Parameters:
Name | Type | Description | Default |
---|---|---|---|
modeltest_log |
Path
|
path to modeltest-ng log file |
required |
criterion |
str
|
choose best model criterior. Defaults to "BIC". |
'BIC'
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
best substitution model employed by modeltest-ng |
Source code in src/metatag/phylotree.py
282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
|
infer_tree(ref_aln, output_dir, method='iqtree', substitution_model='modeltest', additional_args=None)
Infer tree from reference msa. Best substitution model selected by default.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref_aln |
Path
|
path to reference alignment in FASTA format |
required |
output_dir |
Path
|
path to output directory |
required |
method |
str
|
choose tree inference method: "iqtree" or "fasttree". Defaults to "iqtree". |
'iqtree'
|
substitution_model |
str
|
substitution model employed to infer tree or path to iqtree log file containing model. The name of an algorithm to choose best substitution model can also be provided. In that case, choose between "iqtest" or "modeltest". Defaults to "modeltest". |
'modeltest'
|
additional_args |
str
|
additional arguments to tree algorithm as a string. Defaults to None. |
None
|
Source code in src/metatag/phylotree.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
relabel_tree(input_newick, label_dict, output_file=None, iTOL=True)
Relabel tree leaves with labels from provided dictionary. If iTOL is set, then labels are checked for iTOL compatibility
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_newick |
Path
|
path to input tree in newick format |
required |
label_dict |
dict
|
dictionary with labels to replace. Keys original labels, values new labels. |
required |
output_file |
Path
|
path to output relabelled tree. Defaults to None. |
None
|
iTOL |
bool
|
whether to check if label complies with iTOL requirements. Defaults to True. |
True
|
Source code in src/metatag/phylotree.py
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 |
|
sanity_check_for_iTOL(label)
Reformat label to comply with iTOL requirements, remove: 1. white spaces 2. double underscores 3. symbols outside english letters and numbers
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label |
str
|
tree label as a string |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
reformatted label |
Source code in src/metatag/phylotree.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
|
Module placement
Tools to quantify and assign labels to placed sequences
JplaceParser
Methods to parse jplace files, as specified in https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0031009
Source code in src/metatag/placement.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
|
fields
property
Print data fields
meta
property
Print metadata
placements
property
Return placement objects
compute_tree_diameter()
Find maximum (pairwise) distance between two tips (leaves) in the tree
Source code in src/metatag/placement.py
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
filter_placements_by_max_pendant_length(max_pendant_length, output_file=None)
Filter placements by maximum pendant length
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_pendant_length |
float
|
cutoff value for pendant length of placements |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/placement.py
174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
|
filter_placements_by_max_pendant_to_distal_length_ratio(max_pendant_ratio, output_file=None)
Filter placements by maximum pendant length
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_pendant_ratio |
float
|
cutoff value for the pendant to distal length ratio of placements |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/placement.py
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
|
filter_placements_by_max_pendant_to_tree_diameter_ratio(max_pendant_ratio, output_file=None)
Filter placements by maximum pendant length
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_pendant_ratio |
float
|
cutoff value for pendant to tree diameter ratio |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/placement.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
|
filter_placements_by_minimum_lwr(minimum_lwr, output_file=None)
Filter placements by minimum LWR
Parameters:
Name | Type | Description | Default |
---|---|---|---|
minimum_lwr |
float
|
LWR threshold (between 0 and 1) |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/placement.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
get_reference_sequences()
Get list of reference sequences in the placement tree
Source code in src/metatag/placement.py
86 87 88 89 90 |
|
newickfy_tree(tree_str)
staticmethod
Remove branch IDs from jplace tree string
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_str |
str
|
jplace tree string |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
newick tree string |
Source code in src/metatag/placement.py
73 74 75 76 77 78 79 80 81 82 83 84 |
|
TaxAssignParser
Parse function and taxonomy placement assignments table
Source code in src/metatag/placement.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 |
|
__init__(tax_assign_path)
Parse function and taxonomy placement assignments table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tax_assign_path |
Path
|
path to tsv file containing taxonomic assignments |
required |
Source code in src/metatag/placement.py
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
|
count_hits(cluster_ids=None, score_threshold=None, taxopath_type='taxopath', path_to_query_list=None)
Count hits within given cluster ids and at specificied taxon level
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cluster_ids |
list[str]
|
IDs of tree clusters to be included in the counting of placements. Defaults to None. |
None
|
score_threshold |
float
|
global placement score threshold to filter low-quality placements out. Defaults to None. |
None
|
taxopath_type |
str
|
'taxopath' to use gappa-assign taxopath or 'cluster_taxopath' to use lowest common taxopath of the reference tree cluster. Defaults to "taxopath". |
'taxopath'
|
path_to_query_list |
Path
|
if not None, then a tsv is exported to defined location containing those queries with correct cluster assignment ( according to defined 'valid' cluster ids or threshold). Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
TaxonomyCounter |
TaxonomyCounter
|
description |
Source code in src/metatag/placement.py
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 |
|
add_clusters_to_tax_table(in_taxtable, clusters=None, out_taxtable=None)
Add tree cluster info at the beginning of each taxopath according to clusters defined in dictionary 'clusters'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
in_taxtable |
Path
|
path to taxonomy table |
required |
clusters |
dict, optionall
|
dictionary with keys equal to cluster IDs and values to lists of cluster members. If None, then all sequences are assumed to belong to the same cluster. Defaults to None. |
None
|
out_taxtable |
Path
|
path to output taxonomy table. Defaults to None. |
None
|
Source code in src/metatag/placement.py
419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 |
|
add_duplicates_to_assignment_table(taxtable, query_duplicates, output_file=None)
Add duplicated query IDs to cluster and taxonomic assignment table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
taxtable |
Path
|
path to cluster and taxonomic assignment table |
required |
query_duplicates |
Path
|
path to query duplicates file as output by seqkit rmdup |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/placement.py
701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 |
|
add_query_labels_to_assign_table(input_table, query_labels, output_table=None)
Add new column containing actual query labels to query taxonomy/cluster assignment table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_table |
Path
|
path to query taxonomy/cluster assignment table |
required |
query_labels |
dict
|
dictionary with keys equal to query short IDs and values to query labels |
required |
output_table |
Path
|
path to output table. Defaults to None. |
None
|
Source code in src/metatag/placement.py
527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 |
|
assign_labels_to_placements(jplace, ref_labels, query_labels=None, output_dir=None, output_prefix=None, only_best_hit=True, ref_clusters_file=None, ref_cluster_scores_file=None, gappa_additional_args=None, only_unique_cluster=True, taxo_file=None)
Assign taxonomy and/or tree cluster IDs to placed query sequences based on taxonomy assigned to tree reference sequences using gappa examine assign.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
jplace |
Path
|
path to jplace file |
required |
ref_labels |
dict
|
dictionary containing short IDs as keys and long labels as values for reference sequences |
required |
query_labels |
dict
|
dictionary containing short IDs as keys and long labels as values for query sequences. Defaults to None. |
None
|
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
output_prefix |
str
|
prefix to output files. Defaults to None. |
None
|
only_best_hit |
bool
|
only report taxonomy with largest LWR per query. Defaults to True. |
True
|
ref_clusters_file |
Path
|
path to tsv containing reference cluster definitions. Defaults to None. |
None
|
ref_cluster_scores_file |
Path
|
path to tsv containing cluster scores. Defaults to None. |
None
|
gappa_additional_args |
str
|
additional arguments to gappa. Defaults to None. |
None
|
only_unique_cluster |
bool
|
if True, keep only queries with multiple placement locations if they were assigned to the same cluster. Defaults to True. |
True
|
taxo_file |
Path
|
path to taxonomy database. Defaults to None. |
None
|
Source code in src/metatag/placement.py
558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 |
|
filter_non_unique_placement_assignments(placed_tax_assignments, output_file=None)
Remove queries that were assigned to more than one cluster from placements assignments table
Parameters:
Name | Type | Description | Default |
---|---|---|---|
placed_tax_assignments |
Path
|
path to placed taxonomic assignments table |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/placement.py
759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 |
|
find_queries_placed_in_several_clusters(placed_tax_assignments)
Find queries placed in more than one cluster
Parameters:
Name | Type | Description | Default |
---|---|---|---|
placed_tax_assignments |
Path
|
path to placed taxonomic assignments table |
required |
Returns:
Type | Description |
---|---|
tuple[list, pd.DataFrame]
|
tuple[list, pd.DataFrame]: list of query IDs placed in more than one cluster and dataframe with unique cluster assignments per query |
Source code in src/metatag/placement.py
737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 |
|
parse_duplicates_from_seqkit(query_duplicates)
Add a column with the query IDs of duplicated sequences to the taxonomy assignments file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query_duplicates |
Path
|
path to query duplicates file as output by seqkit rmdup |
required |
Returns:
Name | Type | Description |
---|---|---|
_type_ |
None
|
description |
Source code in src/metatag/placement.py
679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 |
|
parse_gappa_assign_table(input_table, has_cluster_id=True, cluster_scores=None, clusters_taxopath=None, output_file=None)
Parse gappa assign per query taxonomy assignment result tsv
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_table |
Path
|
path to gappa assign per query taxonomy assignment result tsv |
required |
has_cluster_id |
bool
|
set to True if results table includes reference cluster info in the first element of taxopath. Defaults to True. |
True
|
cluster_scores |
dict
|
dictionary with values set to cluster quality scores. It is only used if has_cluster_id = True. Defaults to None. |
None
|
clusters_taxopath |
dict
|
dict with keys equal to cluster IDs and values corresponding to the lowest common taxopath for the cluster. Defaults to None. |
None
|
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/placement.py
442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 |
|
parse_tree_cluster_quality_scores(cluster_scores_tsv)
Parse cluster quality scores file into dictionary
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cluster_scores_tsv |
Path
|
path to tsv containing cluster scores |
required |
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
dict with keys equal to cluster IDs and values to scores |
Source code in src/metatag/placement.py
405 406 407 408 409 410 411 412 413 414 415 416 |
|
parse_tree_clusters(clusters_tsv, cluster_as_key=True)
Parse clusters text file into dictionary
Parameters:
Name | Type | Description | Default |
---|---|---|---|
clusters_tsv |
Path
|
path to tsv containing tree cluster definitions |
required |
cluster_as_key |
bool
|
if True then dict keys are cluster IDs and values |
True
|
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
description |
Source code in src/metatag/placement.py
382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 |
|
pick_taxopath_with_highest_lwr(placed_tax_assignments, output_file=None)
Pick taxopath assigment with higuest LWR for each placed query
Parameters:
Name | Type | Description | Default |
---|---|---|---|
placed_tax_assignments |
Path
|
path to placed taxonomic assignments table |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/placement.py
782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 |
|
place_reads_onto_tree(input_tree, tree_model, ref_aln, query_seqs, aln_method='papara', output_dir=None)
Performs short read placement onto phylogenetic tree tree_model: str, either the model name or path to log output by iqtree workflow example: https://github.com/Pbdas/epa-ng/wiki/Full-Stack-Example
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_tree |
Path
|
path to input tree |
required |
tree_model |
str
|
substitution model used for tree inference |
required |
ref_aln |
Path
|
path to reference alignment |
required |
query_seqs |
Path
|
path to query sequences |
required |
aln_method |
str
|
choose either "papara" or "hmmalign". Defaults to "papara". |
'papara'
|
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
Source code in src/metatag/placement.py
319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 |
|
Module taxonomy
Tools to assign taxonomy to reference and query (placed) sequences
TaxonomyAssigner
Methods to assign taxonomy to reference sequences
Source code in src/metatag/taxonomy.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
|
assign_lowest_common_taxonomy_to_clusters(clusters, label_dict=None)
Find lowest possible common taxonomy to reference labels in clusters If reference labels do not contain genome IDs, a dictionary, label_dict, of reference labels and genome ids (or labels with genome ids) must be passed
Parameters:
Name | Type | Description | Default |
---|---|---|---|
clusters |
dict
|
dictionary with keys as cluster IDs and values as lists of reference labels in each cluster |
required |
label_dict |
dict
|
dictionary with keys as short IDs and values as reference (full) labels. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
dictionary with keys as cluster IDs and values as lowest common taxopath for each cluster |
Source code in src/metatag/taxonomy.py
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
|
assign_lowest_common_taxonomy_to_labels(labels)
Assing taxonomy to set of labels and find lowest common taxonomy among them
Parameters:
Name | Type | Description | Default |
---|---|---|---|
labels |
list[str]
|
list of reference labels containing genome IDs |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
lowest common GTDB taxonomy |
Source code in src/metatag/taxonomy.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
|
assign_taxonomy_to_label(label)
Assign GTDB taxonomy to label based on genome ID
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label |
str
|
reference label containing genome ID |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
GTDB taxonomy as a string taxopath |
Source code in src/metatag/taxonomy.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
|
build_gappa_taxonomy_table(ref_id_dict, output_file=None)
Build gappa-compatible taxonomy file as specified here: https://github.com/lczech/gappa/wiki/Subcommand:-assign Removes references without assigned taxonomy
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref_id_dict |
dict
|
dictionary with keys as reference IDs and values as reference labels |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/taxonomy.py
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
|
lowest_common_taxonomy(taxopaths)
staticmethod
Find lowest common taxonomy among set of taxopaths
Parameters:
Name | Type | Description | Default |
---|---|---|---|
taxopaths |
list[str]
|
list of taxopath strings |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
lowest common taxopaht |
Source code in src/metatag/taxonomy.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
|
TaxonomyCounter
Source code in src/metatag/taxonomy.py
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
|
__init__(taxopath_list)
Tools to summarize taxonomical diversity in a list of taxopaths
Source code in src/metatag/taxonomy.py
209 210 211 212 213 214 215 216 |
|
get_counts(taxlevel='family', output_tsv=None, plot_type='bar', output_pdf=None)
Compute counts and fraction at specified taxonomy levels
Parameters:
Name | Type | Description | Default |
---|---|---|---|
taxlevel |
str
|
tanoxomy level to perform counts at. Defaults to "family". |
'family'
|
output_tsv |
Path
|
path to output file. Defaults to None. |
None
|
plot_type |
str
|
choose either "bar" or "pie". Defaults to "bar". |
'bar'
|
output_pdf |
Path
|
path to output pdf with figures. Defaults to None. |
None
|
Returns:
Type | Description |
---|---|
None
|
pd.DataFrame: dataframe with counts and fraction at specified taxlevel |
Source code in src/metatag/taxonomy.py
218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 |
|
plot_counts(count_data, output_pdf, plot_type='bar', title=None)
Make (and optionally export) barplot ('bar') or pieplot ('pie') figure depicting counting results at specified taxonomic level
Parameters:
Name | Type | Description | Default |
---|---|---|---|
count_data |
pd.DataFrame
|
dataframe with counts and fraction at specified taxonomy level as returned by get_counts() |
required |
plot_type |
str
|
choose between "bar" and "pie". Defaults to "bar". |
'bar'
|
output_pdf |
Path
|
path to output pdf containing figure. Defaults to None. |
required |
title |
str
|
figure title. Defaults to None. |
None
|
Source code in src/metatag/taxonomy.py
261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
|
Taxopath
Object to contain taxopath
Source code in src/metatag/taxonomy.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
__init__(taxopath_str=None, delimiter=';')
summary
Parameters:
Name | Type | Description | Default |
---|---|---|---|
taxopath_str |
str
|
taxopath as a string. Defaults to None. |
None
|
delimiter |
str
|
taxa delimiter in taxopath. Defaults to ";". |
';'
|
Source code in src/metatag/taxonomy.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
from_dict(taxodict, delimiter=';')
classmethod
Instantiate Taxopath object from dict
Parameters:
Name | Type | Description | Default |
---|---|---|---|
taxodict |
dict
|
dict of taxonomic levels and taxa |
required |
delimiter |
str
|
delimiter to separate taxon levels. Defaults to ";". |
';'
|
Returns:
Name | Type | Description |
---|---|---|
Taxopath |
Taxopath
|
Taxopath object |
Source code in src/metatag/taxonomy.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
Module visualization
Tools to visualize phylogenetic trees with and without short read placement.
make_feature_metadata_table(label_dict, output_tsv, original_labels=True)
Construct feature metadata tsv classifiying reference and query sequences for empress https://github.com/biocore/empress/issues/548
Parameters:
Name | Type | Description | Default |
---|---|---|---|
label_dict |
dict
|
dictionary of sequence short IDs and labels. Reference sequences should be prefixed with "ref_", and query sequences should be prefixed with "query_". |
required |
output_tsv |
Path
|
path to output tsv file with metadata table. |
required |
original_labels |
bool
|
whether to include original long labels in tree. Defaults to True. |
True
|
Source code in src/metatag/visualization.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
|
make_tree_html(input_tree, output_dir=None, feature_metadata=None)
Runs empress tree-plot empress: https://github.com/biocore/empress
input_tree (Path): path to tree in newick format output_dir (Path, optional): path to output directory. Defaults to None. feature_metadata (Path, optional): path to fieature metadata table as output by make_feature_metadata_table. Defaults to None.
Source code in src/metatag/visualization.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
plot_tree_in_browser(input_tree, output_dir=None, feature_metadata=None)
Runs empress tree-plot and opens generated html in browser empress: https://github.com/biocore/empress
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_tree |
Path
|
path to tree in newick format |
required |
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
feature_metadata |
Path
|
path to fieature metadata table as output by make_feature_metadata_table. Defaults to None. |
None
|
Source code in src/metatag/visualization.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
|
Module utils
Functions and classes for general purposes
CommandArgs
Base class to hold command line arguments.
Source code in src/metatag/utils.py
27 28 29 30 31 |
|
ConfigParser
Handle MetaTag configuration file.
Source code in src/metatag/utils.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
__init__(config_file)
Handle MetaTag configuration file."
Parameters:
Name | Type | Description | Default |
---|---|---|---|
config_file |
Path
|
description |
required |
Source code in src/metatag/utils.py
37 38 39 40 41 42 43 44 |
|
get_config()
Load config file.
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
dict containing fields and values of config file. |
Source code in src/metatag/utils.py
81 82 83 84 85 86 87 88 |
|
get_config_path()
Show config file path.
Source code in src/metatag/utils.py
77 78 79 |
|
get_default_config()
classmethod
Initialize ConfigParser with default config file.
Source code in src/metatag/utils.py
46 47 48 49 |
|
get_field(key)
Get field from config file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
key name to get the value from. |
required |
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
key value. |
Source code in src/metatag/utils.py
104 105 106 107 108 109 110 111 |
|
initialize_config_file()
staticmethod
Initialize empty config file.
Returns:
Name | Type | Description |
---|---|---|
Path |
Path
|
path to generated config file. |
Source code in src/metatag/utils.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|
update_config(key, value)
Update config file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
config file key name to be updated. |
required |
value |
str
|
new value. |
required |
Source code in src/metatag/utils.py
95 96 97 98 99 100 101 102 |
|
write_config()
Write config dict to file.
Source code in src/metatag/utils.py
90 91 92 93 |
|
DictMerger
Source code in src/metatag/utils.py
346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 |
|
__init__(dicts)
Toos to merge python dictionaries into a single one Args dicts: list of dictionaries to be merged
Source code in src/metatag/utils.py
347 348 349 350 351 352 353 |
|
from_pickle_paths(dict_paths)
classmethod
Initialize class from list of paths to dictionaries (pickle)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dict_paths |
list[Path]
|
list of paths to piclke files |
required |
Returns:
Name | Type | Description |
---|---|---|
DictMerger |
DictMerger
|
DictMerger instance |
Source code in src/metatag/utils.py
355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 |
|
merge(dict_prefixes=None, save_pickle_path=None)
Merge dictionaries
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dict_prefixes |
list[str]
|
list of strings containing prefixes |
None
|
save_pickle_path |
Path
|
description. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
dict |
dict
|
description |
Source code in src/metatag/utils.py
388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 |
|
read_from_pickle_file(path_to_file='object.pkl')
staticmethod
Load python object from pickle file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_to_file |
Path
|
path to pickle file. Defaults to "object.pkl". |
'object.pkl'
|
Returns:
Name | Type | Description |
---|---|---|
_type_ | Python object |
Source code in src/metatag/utils.py
372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 |
|
TemporaryDirectoryPath
Custom context manager to create a temporary directory which is removed when exiting context manager
Source code in src/metatag/utils.py
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
|
__init__(work_dir=None)
Custom context manager to create a temporary directory which is removed when exiting context manager
Parameters:
Name | Type | Description | Default |
---|---|---|---|
work_dir |
Path
|
path to working directory. Defaults to None. |
None
|
Source code in src/metatag/utils.py
185 186 187 188 189 190 191 192 193 194 195 |
|
TemporaryFilePath
Custom context manager to create a temporary file which is removed when exiting context manager
Source code in src/metatag/utils.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
|
__init__(work_dir=None, extension=None, create_file=False)
Custom context manager to create a temporary file which is removed when exiting context manager
Parameters:
Name | Type | Description | Default |
---|---|---|---|
work_dir |
Path
|
path to working directory. Defaults to None. |
None
|
extension |
str
|
file extension. Defaults to None. |
None
|
create_file |
bool
|
whether to create a permanent file. Defaults to False. |
False
|
Source code in src/metatag/utils.py
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
easy_pattern_matching(text, left_pattern, right_pattern=None)
Just straightforward string searchs between two patterns
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
srring to be searched |
required |
left_pattern |
str
|
left most border pattern |
required |
right_pattern |
str
|
right most border pattern. Defaults to None. |
None
|
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
description |
Source code in src/metatag/utils.py
323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 |
|
init_logger(args)
Initialize logger object
Parameters:
Name | Type | Description | Default |
---|---|---|---|
args |
Union[CommandArgs, ArgumentParser]
|
arguments object |
required |
Returns:
Type | Description |
---|---|
logging.Logger
|
logging.Logger: initialized logger object |
Source code in src/metatag/utils.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
parallelize_over_input_files(callable, input_list, processes=None, **callable_kwargs)
Parallelize callable over a set of input objects using a pool of workers. Inputs in input list are passed to the first argument of the callable. Additional callable named arguments may be passed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
callable |
_type_
|
function to be called. |
required |
input_list |
list
|
list of input objects to callable. |
required |
n_processes |
int
|
maximum number of processes. Defaults to None. |
required |
Source code in src/metatag/utils.py
303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 |
|
read_from_pickle_file(path_to_file='object.pkl')
Load python object from pickle file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path_to_file |
Path
|
path to picke file. Defaults to "object.pkl". |
'object.pkl'
|
Returns:
Name | Type | Description |
---|---|---|
_type_ | Python object. |
Source code in src/metatag/utils.py
262 263 264 265 266 267 268 269 270 271 272 273 274 |
|
save_to_pickle_file(python_object, path_to_file='object.pkl')
Save python object to pickle file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
python_object |
object
|
description |
required |
path_to_file |
Path
|
description. Defaults to "object.pkl". |
'object.pkl'
|
Source code in src/metatag/utils.py
250 251 252 253 254 255 256 257 258 259 |
|
set_default_output_path(input_path, tag=None, extension=None, only_filename=False, only_basename=False, only_dirname=False)
Utility function to generate a default path to output file or directory based on an input file name and path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_path |
Path
|
path to input file. |
required |
tag |
str
|
text tag to be added to file name. Defaults to None. |
None
|
extension |
str
|
change input file extension with this one. Defaults to None. |
None
|
only_filename |
bool
|
output only default filename. Defaults to False. |
False
|
only_basename |
bool
|
output only default basename (no extension). Defaults to False. |
False
|
only_dirname |
bool
|
output only path to default output directory. Defaults to False. |
False
|
Returns:
Name | Type | Description |
---|---|---|
Path |
Path
|
a path or name to a default output file. |
Source code in src/metatag/utils.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 |
|
terminal_execute(command_str, suppress_shell_output=False, work_dir=None, return_output=False)
Execute given command in terminal through Python.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
command_str |
str
|
terminal command to be executed. |
required |
suppress_shell_output |
bool
|
suppress shell output. Defaults to False. |
False
|
work_dir |
Path
|
change working directory. Defaults to None. |
None
|
return_output |
bool
|
whether to return execution output. Defaults to False. |
False
|
Returns:
Type | Description |
---|---|
subprocess.STDOUT
|
subprocess.STDOUT: subprocess output. |
Source code in src/metatag/utils.py
277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 |
|
Module wrappers
Simple CLI wrappers to several tools
get_percent_identity_from_msa(input_msa, output_file=None)
Run esl-alipid to compute pairwise PI from a MSA.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_msa |
Path
|
path to input MSA file |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
|
remove_auxiliary_output(output_prefix)
Removes iqtree auxiliary output files
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_prefix |
str
|
prefix of output files |
required |
Source code in src/metatag/wrappers.py
373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 |
|
run_cdhit(input_fasta, output_fasta=None, additional_args=None)
Simple CLI wrapper to cd-hit to obtain representative sequences CD-HIT may be used to remove duplicated sequences (keeps one representative) with parameters -c 1 -t 1.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input fasta file |
required |
output_fasta |
Path
|
path to output file. Defaults to None. |
None
|
additional_args |
str
|
additional arguments to cdhit. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
|
run_epang(input_tree, input_aln_ref, input_aln_query, model=None, output_dir=None, processes=None, overwrite_previous_results=True, additional_args=None)
Simple CLI wrapper to EPA-ng See epa-ng -h for additional parameters input_tree: newick format input_aln: fasta format input_aln_query: fasta format (sequences must be alignned to reference msa fasta and have the same length as the reference msa alignment) epa-ng: https://github.com/Pbdas/epa-ng
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_tree |
Path
|
path to tree file in newick format |
required |
input_aln_ref |
Path
|
path to reference alignment in fasta format |
required |
input_aln_query |
Path
|
path to query alignment in fasta format |
required |
model |
str
|
substitution model employed to infer tree. Defaults to None. |
None
|
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
n_threads |
int
|
maximum number of processes. Defaults to None. |
required |
overwrite_previous_results |
bool
|
whether to overwrite result files of a previous run. Defaults to True. |
True
|
additional_args |
str
|
additional arguments to epa-ng. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 |
|
run_fasttree(input_algns, output_file=None, nucleotides=False, starting_tree=None, quiet=True, additional_args=None)
Simple CLI wrapper to fasttree. fasttree accepts multiple alignments in fasta or phylip formats It seems that fasttree does not allow inputing subsitution model. Default substitution model for protein seqs is JTT
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_algns |
Path
|
path to input fasta file |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
nucleotides |
bool
|
whether data is DNA. Defaults to False. |
False
|
starting_tree |
str
|
path to starting tree to help inference. Defaults to None. |
None
|
additional_args |
str
|
additional arguments to fasttree. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 |
|
run_gappa_assign(jplace, taxonomy_file, output_dir=None, output_prefix=None, only_best_hit=True, additional_args=None, delete_output_tree=True)
Use gappa examine assign to assign taxonomy to placed query sequences based on taxonomy assigned to tree reference sequences
argument: --resolve-missing-paths alongside --root-outgroup can be added to find missing taxonomic info in labels.
Info: https://github.com/lczech/gappa/wiki/Subcommand:-assign
Parameters:
Name | Type | Description | Default |
---|---|---|---|
jplace |
Path
|
path to jplace file |
required |
taxonomy_file |
Path
|
path to taxonomy file as required by gappa assign |
required |
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
output_prefix |
str
|
prefix to be added to output files. Defaults to None. |
None
|
only_best_hit |
bool
|
return only best hit (highest LWR). Defaults to True. |
True
|
additional_args |
str
|
additional arguments to gappa assign. Defaults to None. |
None
|
delete_output_tree |
bool
|
whether to delete gappa assign output tree. Defaults to True. |
True
|
Source code in src/metatag/wrappers.py
625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 |
|
run_gappa_graft(input_jplace, output_dir=None, output_prefix=None, additional_args=None)
Run gappa examine graft to obtain tree with placements in newick format
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_jplace |
Path
|
path to jplace file |
required |
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
output_prefix |
str
|
prefix to be added to output files. Defaults to None. |
None
|
additional_args |
str
|
additional arguments to gappa. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 |
|
run_hmmalign(input_hmm, input_aln, input_seqs, output_aln_seqs=None, additional_args=None)
Simple CLI wrapper to hmmalign Align short read query sequences to reference MSA
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_hmm |
Path
|
path to input hmm |
required |
input_aln |
Path
|
path to input alignment file |
required |
input_seqs |
Path
|
path to input sequences to be aligned |
required |
output_aln_seqs |
Path
|
path to output alignment. Defaults to None. |
None
|
additional_args |
str
|
additional arguments to hmmalign. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
|
run_hmmbuild(input_aln, output_hmm=None, additional_args=None)
Simple CLI wrapper to hmmbuild (build HMM profile from MSA file) additional args: see hmmbuild -h
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_aln |
Path
|
path to input alignment file |
required |
output_hmm |
Path
|
path to output hmm. Defaults to None. |
None
|
additional_args |
str
|
additional arguments to hmmbuild. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
run_hmmsearch(hmm_model, input_fasta, output_file=None, method='hmmsearch', processes=None, additional_args=None)
Simple CLI wrapper to hmmsearch or hmmscan
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hmm_model |
Path
|
path to hmm model |
required |
input_fasta |
Path
|
path to input fasta file |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
method |
str
|
choose method: "hmmscan" or "hmmsearcg". Defaults to "hmmsearch". |
'hmmsearch'
|
n_processes |
int
|
maximum number of processes. Defaults to None. |
required |
additional_args |
str
|
additional arguments to hmmsearch/scan. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
|
run_iqtree(input_algns, output_dir=None, output_prefix=None, keep_recovery_files=False, nucleotides=False, processes=None, substitution_model='TEST', starting_tree=None, bootstrap_replicates=1000, max_bootstrap_iterations=1000, overwrite_previous_results=True, quiet=True, additional_args=None)
Simple CLI wrapper to iqtree. iqtree accepts multiple alignments in fasta or phylip formats.
a string containing additional parameters and
parameter values to be passed to iqtree
output: iqtree outputs several files
Reducing computational time via model selection: http://www.iqtree.org/doc/Command-Reference
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_algns |
Path
|
path to input fasta file |
required |
output_dir |
Path
|
path to output file. Defaults to None. |
None
|
output_prefix |
str
|
prefix to be added to output files. Defaults to None. |
None
|
keep_recovery_files |
bool
|
whether to keep recovery files. Defaults to False. |
False
|
nucleotides |
bool
|
whether sequence data are DNA. Defaults to False. |
False
|
n_processes |
int
|
maximum number of processes. Defaults to None. |
required |
substitution_model |
str
|
substitution model employed to infer tree. Defaults to "TEST". |
'TEST'
|
starting_tree |
Path
|
path to tree file with starting tree to help inference. Defaults to None. |
None
|
bootstrap_replicates |
int
|
number of bootstrap replicates. Defaults to 1000. |
1000
|
max_bootstrap_iterations |
int
|
maximum number of bootstrap iterations. Defaults to 1000. |
1000
|
overwrite_previous_results |
bool
|
whether to overwrite results of a previous run. Defaults to True. |
True
|
additional_args |
str
|
additional arguments to iqtree. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 |
|
run_mafft(input_fasta, output_file=None, processes=-1, parallel=True, quiet=True, additional_args=None)
Simple CLI wrapper to mafft (MSA)
Manual: https://mafft.cbrc.jp/alignment/software/manual/manual.html
CLI examples: mafft --globalpair --thread n in > out mafft --localpair --thread n in > out mafft --large --globalpair --thread n in > out
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input fasta file |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
n_threads |
int
|
maximum number of processes. Defaults to -1. |
required |
parallel |
bool
|
description. Defaults to True. |
True
|
additional_args |
str
|
additional arguments to mafft. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 |
|
run_muscle(input_fasta, output_file=None, maxiters=None, quiet=True, additional_args=None)
Simple CLI wrapper to muscle (MSA) muscle: https://www.drive5.com/muscle/manual/output_formats.html
output phylip and fasta.aln
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input fasta file |
required |
output_file |
Path
|
path to output file. Defaults to None. |
None
|
maxiters |
int
|
maximum number of iterations. Defaults to None. |
None
|
additional_args |
str
|
additional arguments to muscle. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 |
|
run_papara(tree_nwk, msa_phy, query_fasta, output_aln=None, additional_args=None)
Simple CLI wrapper to Papara. Output lignment in fasta format
Run Papara to do query alignment to reference MSA and tree (required for EPA-ng) Alignment could be done with hmmalign or muscle as well, but these tools don't consider the tree during alignment (would this be a justified improvement over hmmalign?)
There seems to be a problem with enabling multithreading in papara when run as a static executable. It looks like it has to be enabled during compilation (but compilation currently not working): https://stackoverflow.com/questions/19618926/thread-doesnt-work-with-an-error-enable-multithreading-to-use-stdthread-ope
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_nwk |
Path
|
path to tree file in newick format |
required |
msa_phy |
Path
|
path to reference alignment in phylip format |
required |
query_fasta |
Path
|
path to query fasta file |
required |
output_aln |
Path
|
path to output alignment. Defaults to None. |
None
|
additional_args |
str
|
additional arguments to papara. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 |
|
run_prodigal(input_file, output_prefix=None, output_dir=None, metagenome=False, additional_args=None)
Simple CLI wrapper to prodigal
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file |
Path
|
path to input file |
required |
output_prefix |
str
|
prefix to be preppended to output files. Defaults to None. |
None
|
output_dir |
Path
|
path to output directory. Defaults to None. |
None
|
metagenome |
bool
|
whether original sequences are metagenomic. Defaults to False. |
False
|
additional_args |
str
|
additional arguments to prodigal. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
run_seqkit_nodup(input_fasta, output_fasta=None, export_duplicates=False, duplicates_file=None)
Simpe CLI wrapper to seqkit rmdup
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_fasta |
Path
|
path to input fasta file |
required |
output_fasta |
Path
|
path to output file. Defaults to None. |
None
|
export_duplicates |
bool
|
whether to export duplicated sequences. Defaults to False. |
False
|
duplicates_file |
Path
|
path to output file containing duplicates. Defaults to None. |
None
|
Source code in src/metatag/wrappers.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
|