SEMPHY (Version 2.00)
This manual covers the following topics:

All uses of SEMPHY are by calling the executable semphy from a command prompt with some parameters:
   semphy [parameters...]

Here we give a few examples of using SEMPHY for the most common tasks.  Below is a table with the full list of options, which is also available by typing 'sepmhy -h' at the command prompt.

Neighbor Joining (NJ) trees
Running standard NJ on protein sequences using the JTT replacement matrix:
   semphy -s prots.fasta -o out.txt -T prots.tree -l log.txt -a 20 --jtt -J -H
(Meaning:  Input is a fasta sequence file and ouput is written to three files (general output, tree file and log file).  Use alphabet of 20, i.e. amino acids.  Use the JTT matrix.  Do Neighbor Joining, with a homogeneous rates model)

Same thing, but using the new iterative NJ method with Bayesian estimation of the rate at each site (i.e. "posterior" estimates of the rates. -O requests optimization of rate parameters) For a description of the algorithm please see our paper referenced on the SEMPHY homepage (Ninio et al. 2006).
   semphy -s prots.fasta -o out.txt -T prots.tree -l log.txt -a 20 --jtt --posteriorDTME -O

Running iterative NJ on DNA sequences, using the HKY model:
   semphy -s genes.fasta -o out.txt -T genes.tree -l log.txt -a 4 --hky --posteriorDTME -O
(Alphabet of 4 indicates DNA or RNA)

Same thing, with 100 bootstrap iterations:
   semphy -s genes.fasta -o out.txt -T genes.tree -l log.txt -a 4 --hky --posteriorDTME -O --BPrepeats 100

Maximum Likelihood (ML) trees using SEMPHY
Running SEMPHY to find the ML tree for a set of protein sequences using the JTT replacement matrix:  (Standard NJ will be used)
   semphy -s prots.fasta -o out.txt -T prots.tree -l log.txt -a 20 --jtt -S -O
(Meaning:  Run SEMPHY on a fasta sequence file and write outputs, tree and log files.  Use alphabet of 20, i.e., amino acids.  Use the JTT matrix.  Do SEMPHY steps, -O requests optimization of rate parameters)

Same thing, but using the new iterative NJ method for the initial tree:
   semphy -s prots.fasta -o out.txt -T prots.tree -l log.txt -a 20 --jtt -S --posteriorDTME -O

Running SEMPHY to find the ML tree for a set of DNA sequences using the HKY model:
   semphy -s genes.fasta -o out.txt -T genes.tree -l log.txt -a 4 --hky -S --posteriorDTME -O
(Meaning:  Run SEMPHY on a fasta sequence file and write outputs, tree and log files.  Use alphabet of 20, i.e., amino acids.  Use the HKY model.  Do SEMPHY steps.  Use iterative NJ)

Same thing, with 100 bootstrap iterations:
   semphy -s genes.fasta -o out.txt -T genes.tree -l log.txt -a 4 --hky -S --posteriorDTME -O --BPrepeats 100

List of options and parameters
The following table lists most of the available options and parameters (the full list can be printed by typing 'semphy -h' at the command prompt)

Flag Full name Description Default
-h --help
Print help and exit



--full-help
Print help, including advanced options, and exit

-s [MSA file] --sequence
The input sequence file. The following formats are supported: Mase, Molphy, Phylip, Clustal, Fasta Obligatory
-t [tree file] --tree
An initial input tree file (in Newick format) Optional
-o [output file] --outputfile
File for general outputs
Optional
-T [output tree file] --treeoutputfile
Output of the final tree Optional
-l [log file] --Logfile
Log file Optional
-v [verbosity level]
--verbose
Verbosity level of the log file (between 0 and 10)
Optional
-a [alphabet] --alphabet

4 - nucleotides;  20 - amino acids;  61 - codons

20

--BPrepeats
Perform a number of bootstrap iterations Optional
-S
--SEMPHY
Do SEMPHY steps to search for the ML tree
Optional
Distance Table Estimation Method (DTME)
Choice of NJ variant to be used in SEMPHY, or by itself.  Specifies the method that will be used in NJ to calculate the distances table.  Standard NJ is -J.  The recommended iterative NJ method is --posteriorDTME.
Simple pairwise methods:

-J is standard NJ, using ML distance with a homogeneous rates model (also evoked by --homogeneousRatesDTME).
ML distance with a Gamma-ASRV model (--pairwiseGammaDTME) is usually not recommended.
Iterative distance-based tree reconstruction methods:
Using the common alpha parameter (--commonAlphaDTME)
Using the ML rate for each site (--rate4siteDTME)
Using the posterior distribution of the rate at each site (--posteriorDTME)
NJ is not run unless some method was requested, or if -S is used with no method chosen then -J is implied
Evolutionary model

The following models are supported:
JC for nucleotides (--nucjc) or amino acids (--aaJC),
K2P (--k2p), HKY (--hky), Dayhoff (--day), JTT (--jtt),
REV (--rev), WAG (--wag), cpREV (--cprev)
Or load a matrix from a file:  --modelfile=[file name]

JTT
15acdfgilswAmong Site Rate Variation

NOTE:  Either -H, -A, or -O must be given
-H --homogeneous
A homogeneous rates model (no Gamma ASRV) See above
-A [alpha] --alpha
Set the initial alpha parameter for Gamma ASRV See above
-O --optimizeAlpha
Optimize the alpha parameter for the reconstructed tree See above
-C [categories number] --categories
The number of discrete categories used in the approximation of the Gamma distribution of rates
8

Running on large datasets
Starting from version 2.0 SEMPHY can be run on very large datasets of many thousands of sequences.  However, in order to handle such datasets a different make of the program must be used. SEMPHY needs to be compiled with the doubleRep flag:
> make doubleRep all
This compilation command makes a copy of SEMPHY called semphy.doubleRep. This make of the program uses a  different implementation of the double data type that allows the handling of a virtually unlimited number of sequences. The price for  using doubleRep is a slower running time by about one order of magnitude. Therefore, it is recommended to use the normal make of SEMPHY where possible, and only use the doubleRep make for datasets of 300 sequences or more.