FAQ (Frequently Asked Questions)

Last updated 22 February 2007

Why are all the answers given in terms of commands and not menu choices?

Primarily this is to maintain consistency. All versions of PAUP* have a command line
interface, whereas only a few versions have a menu system, thus if answers were given
in terms of menu choices, users of the UNIX and DOS versions would be out of luck.
Also, many users prefer to put all of the commands for a particular analysis in a
PAUP block directly in the data file itself. This maintains a complete record of
how the analysis was carried out, which is useful later for purposes of writing
the “Methods” section of a paper. The commands presented here can all be used within
PAUP blocks as well as on the command line itself, thus facilitating the creation
of PAUP blocks.


If I don’t find it here, does that mean that it doesn’t exist?

This FAQ is written as the need arises, and thus it will continue to grow in completeness
each week. Thus, this FAQ is not intended to be a replacement for the PAUP* manual, but
we hope it is a useful surrogate until the program and manual are officially published.
The FAQ’s authors frequently receive questions (usually by email) about using
PAUP*, and this provides a convenient mechanism for responding to common questions that we
receive over and over again. Please feel free to submit candidate
questions for inclusion in the FAQ.


Can I submit questions that I think should be part of this FAQ?

Please do. We welcome submission of candidate questions for the PAUP* FAQ, but be
aware that the decision to include any particular question resides with the authors
of the FAQ. The
questions most likely to make it into the FAQ are those that we feel would benefit a
large proportion of PAUP* users. Please submit candidate questions
Answers in the form of
a series of PAUP* commands are of course very much appreciated. Please refrain from
using abbreviations of commands, as abbreviations change over time as more commands
are added to PAUP*. Also, if you find answers that are incorrect or ambiguous,
please let us know!


I just updated PAUP* using the updater on your web site, yet when I try to run PAUP I still get the message that PAUP* has expired.?

Occasionally this happens because a user’s computer is not set to the
correct date or the user is clicking on an icon that is not linked to
the beta 8 binary. Because PAUP is sensitive to both the creation
date and expiration date, back-dating your computer to a time before
the program was created will also generate the expiration notice.
After checking your system date make sure that you are executing the beta
8 binary.


Is PAUP* Year 2000 Compliant?

Yes, PAUP* is “Year 2000 Compliant.” The only time PAUP* uses dates is
to output them to the main display and/or log file for the user’s
information. If the host operating system returns the correct date
when PAUP* requests it, then PAUP* will show the correct date in
its output. Even if the host operating system fails to return the
correct date in the year 2000, the only consequence is that the date
will not be shown correctly by PAUP* in its display output and log


What is a batch file?

A batch file contains commands that you would otherwise issue
interactively (i.e., from pull-down menues or the command line).
For example, using the pull-down menues in the Mac version of PAUP* you could:
1) open the data file combine2.dat from the file menu and execute it

2) exclude the charactersets cytb and junk2 from the data menu
3) start a heuristic search from the analysis menu

You could obtain the same result by executing a simple text file containing the
following paup blocks.

begin paup;
execute d:\data\combine2.dat;
exclude cytb junk2;


I’m using a beta version of PAUP* 4.0. How should I cite the program?

Swofford, D. L. 2003. PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods).
Version 4. Sinauer Associates, Sunderland, Massachusetts.

Note: Because there are a number of beta and test versions of the program
you should mention the specific version of PAUP* somewhere in the methods.


Is there a version of PAUP* that will run a search in parallel on a multiple processor machine or a cluster of machines?

Right now the answer is no. PAUP* is a single threaded application that will only take advantage of one processor at a time. Dave is in the process of parallellizing the code for the portable or Unix version of PAUP*, but it will be a while before a general parallel version of PAUP* available.


Could you recommend some text books that will help me to learn more about the analyses that can be done in paup?

There are a number of good books out there that deal with the subject of phylogenetic analyses. The selection below is just a few of the text books that I find myself referring to.

  • Felsenstein, J. 2002. Inferring Phylogenies. Sinauer Associates. Sunderland, Massachusetts.
  • Li, W. 1997. Molecular Evolution. Sinauer Associates. Sunderland, Massachusetts.
  • Nei, M. and Kumar, S. 2000. Molecular Evolution and Phylogenetics. Oxford University Press, New York, New York.
  • Page, R. D. and Holmes, E. C. 1998. Molecular Evolution: A Phylogenetic Approach. Blackwell Science, Oxford
  • Hillis, D. M., Moritz, C., and Mable, B. Molecular Systematics (2nd ed.) Sinauer Associates. Sunderland, Massachusetts.


What are the maximum dimensions (i.e., characters x sequences) of a data matrix that PAUP* will read?

The maximum number of sequences (AKA taxa) is 16384. The maximum number of characters (AKA positions or sites) will depend on the type of computer you are using. If your machines uses a 32-bit processor the maximum will be 2^30 (2 raised to the power of 30), whereas machines with 64-bit processors can read a maximum of 2^62 characters.


What is the maximum number of character states that can be assigned to a character in PAUP*?

16 for a 16-bit machine
32 for a 32-bit machine
64 for a 64-bit machine
This limit stems from the use of bit manipulation to perform the state-set calculations in parsimony, and corresponds to the “word length” of the computer–usually 32 bits (e.g., most x86 PCs) but occasionally 64 bits (e.g., Alpha, G5, etc).


Why doesn’t PAUP* allow me to set the criterion to likelihood after I execute my data set?

To use the maximum likelihood criterion in PAUP* your dataset must be composed of DNA, Nucleotide, or RNA characters and the “datatype” option under the “format” command must also be set to one of these values. For example:

Begin characters;
Dimensions nchar=200;
Format datatype=dna interleave;


How do I tell PAUP* I want to use the likelihood criterion?

set criterion=likelihood;


How do I tell PAUP* I want to use the parsimony criterion?

set criterion=parsimony;


How do I tell PAUP* I want to use the minimum evolution criterion?

set criterion=distance;
dset objective=me;


How do I tell PAUP* I want to use the least-squares criterion?

set criterion=distance;
dset objective=lsfit;

The default least-squares objective function is for weighted least squares,
with the weights equal to the reciprocal of the square of the distance
between each pair of taxa (see below).


How do I tell PAUP* I want to use unweighted least-squares criterion?

set criterion=distance;
dset objective=lsfit power=0;

In general, the “power” specifies the power to which the reciprocal of
the distance between each pair of taxa is raised. Raising this value
to the zero(th) power is equivalent to weighting all pairwise deviations
by the constant “1”.


Which non-NEXUS file formats will PAUP* import?

  • FrePars
  • Hennig86
  • MEGA
  • Phylip 3.X
  • Simple test
  • Tab-delimited text


Where can a find examples of non-NEXUS file formats that PAUP* will import?

Sample non-NEXUS files are given at http://paup.csit.fsu.edu/nfiles.html.


How do I import non-NEXUS formatted files into PAUP*?

To import non-NEXUS formatted files into PAUP* you need to use the tonexus command. For example:

tonexus format=gcg fromfile=mygcgfile.gcg tofile=mynexusfile.nex;

If you are using the Mac interface you can get to the import dialog box by selecting File and then Import data…


How do I tell PAUP* to ignore certain taxa in further analyses?

The following lines show six alternative ways of telling PAUP* to ignore the taxa
P._articulata, P._gracilis, P._fimbriata, and P._robusta (we’ll assume these
were number 2, 3, 4 and 7 in the data matrix, respectively) in further analyses.

delete P._articulata P._gracilis P._fimbriata P._robusta;
delete ‘P. articulata’ ‘P. gracilis’ ‘P. fimbriata’ ‘P. robusta’;
delete ‘P. articulata’-‘P. fimbriata’ P._robusta’
delete 2 3 4 7;
delete 2-4 7;

Note: If you plan to refer to a set of taxa frequently, you may find it convenient
to setup a taxset. Sets are defined in a sets block. For the five taxa
given above defining a taxset would look like this:

begin sets;
taxset junk = P._articulata P._gracilis P._fimbriata P._robusta;

After the taxset is defined, simply refer to the taxset to ignore these taxa
in futher analyses. For example:

delete junk;


How do I tell PAUP* to use taxa that I previously told it to ignore?

The following lines show five alternative ways to tell PAUP* to
reinstate four taxa previously deleted (see above)

restore P._articulata P._gracilis P._fimbriata P._robusta;
restore ‘P. articulata’ ‘P. gracilis’ ‘P. fimbriata’ ‘P. robusta’;
restore ‘P. articulata’-‘P. fimbriata’ P._robusta’
restore 2 3 4 7;
restore 2-4 7;

Note: If you’ve defined a taxset then you can use the following syntax:

restore junk;


How do I tell PAUP* to ignore certain characters (sites) in further analyses?

The following lines show five alternative ways of telling PAUP* to ignore the
characters leaf_length, leaf_width, stamen_number, and carpel_number (we’ll assume these
were characters number 2, 3, 4 and 7 in the data matrix, respectively) in further analyses.

exclude leaf_length leaf_width stamen_number carpel_number;
exclude ‘leaf length’ ‘leaf width’ ‘stamen number’ ‘carpel number’;
exclude leaf_length-stamen_number ‘carpel number’;
exclude 2 3 4 7;
exclude 2-4 7;

If you planned to exclude these characters frequently it would be a good to define them in a characters set. This way you could exclude them by referencing the character set. For example:

charset foo = 1-4 7;
exclude foo;

Here’s how to tell PAUP* to ignore nucleotide sites
359 to 367, 586 to 588 and 693 to the last site in further analyses.

exclude 359-367 586-588 693-.;

Here’s how to tell PAUP* to ignore every third nucleotide site
in further analyses (starting with the third site).

exclude 3-.\3;


How do I tell PAUP* to use characters (sites) that I previously told it to ignore?

The following lines show five alternative ways to tell PAUP* to
reinstate four characters previously excluded (see above)

include leaf_length leaf_width stamen_number carpel_number;
include ‘leaf length’ ‘leaf width’ ‘stamen number’ ‘carpel number’;
include leaf_length-stamen_number ‘carpel number’;
include 2 3 4 7;
include 2-4 7;

Here’s how to tell PAUP* to include previously excluded nucleotide sites
359 to 367, 586 to 588 and 693 to the last site in further analyses.

include 359-367 586-588 693-.;

Here’s how to tell PAUP* to include every third nucleotide site
(starting with site number 1) in further analyses.

include 1-.\3;


How do I exclude all the constant characters?

exclude constant;


How do I exclude all constant as well as autapomorphic characters?

exclude uninf;


How do I combine different data set into a single NEXUS file?

In the example below protein and nucleotides are combined in a single interleaved data set. Notice that a character partition is used to distinguish the data sets.

Begin data;
Dimensions ntax=5 nchar=20;
Format datatype=protein interleave symbols=”ACGT” gap=-;


Begin Assumptions;
charset protein = 1-10;
charset dna = 10-.;

usertype 5_1 stepmatrix = 4 acgt
– 5 1 5
5 – 5 1
1 5 – 5
5 1 5 –

Begin paup;
outgroup t2 t3;
ctype 5_1:dna;
hsearch addseq=random;


How do I code indels so that they are not treated as missing data?

If you are confident about the homology of the indels then you might consider setting up an additional character for each site in the original matrix that contains an indel. The new sites would be represented by a binary character. The syntax for doing this looks like this:

begin data;
dimensions ntax=4 nchar=10;
format datatype=dna gap=- interleave symbols=”01″;
options gapmode=missing;
one ATGGT–
two AtggT–
three A-GGTTG
four A-GGTAG
one 011
two 011
three 100
four 100


What are data partitions and why are they useful?

Data partitions divide the characters in your data matrix into two or more
groups. This is useful for performing the partition homogeneity test
or for estimating site-specific rates by maximum likelihood.


How do I define and name a data partition?

Here a partition is created and named codons. The partition divides sites into
first, second and third codon positions. The first partition, named firstpos,
includes every third site (the \3 means every third site) starting from site
1 and ending with the last site (the period means last character). The
second and third partitions, named secondpos and thirdpos respectively,
are defined similarly, except they have different starting points.

charpartition codons = firstpos:1-.\3, secondpos:2-.\3, thirdpos:3-.\3;


How do I do a partition homogeneity test?

First you’ll need to set up a partition. For this example, I’ll pretend to setup a partition
called genes for two partial gene sequences.

charpartition genes = gene1:1-210, gene2:230-.;

Next I’ll need to exclude the characters contained in the NEXUS data set but not defined in either of the
two partitions — gene1 or gene2.

exclude 211-229;

Now I can use the partition homogeneity test.

hompart partition=genes;


What are topological constraints?

Topological constraints are unresolved trees used to filter out trees discovered during
the search that do not match a particular topological criterion. One possible use of
a topological constraint is to force a particular group to be convex (i.e., monophyletic
if the tree is rooted outside the group). This type of topological constraint is
referred to as a monophyly constraint. Monophyly constraint trees contain all
the taxa but are unresolved to some degree. A second type of constraint is called
a backbone constraint. Backbone constraint trees are normally fully resolved,
but are missing one or more taxa. A tree encountered during a search is consistent
with a backbone constraint tree so long as pruning all taxa not in the constraint
tree yields the constraint tree topology. One may wish to compare the
support of the data for the best tree obtained under the constraint to the best tree
without the constraint. Note that PAUP* offers much more flexibility in terms of
topological constraints than is indicated here; the manual for version 3.1 explains
constraints thoroughly.


How do I define and name a topological constraint?

Suppose you are studying bot flies that parasitize either lagomorphs or
rodents depending on the species. You may be interested in finding the
best tree in which the lagomorph-infecting species of bot flies form
a monophyletic group. Assume that there are 10 taxa, and taxa
2, 3, 5, 7 and 9 are lagomorph-infecting species, while the others (1, 4,
6, 8 and 10) are rodent-infecting species.

constraints lagomorph (monophyly) = (1,4,6,8,10,(2,3,5,7,9));

Here, the word lagomorph is the name of the topological constraint, and
the word monophyly is a keyword indicating the type of constraint (the other
possible type is specified using the keyword backbone).
Note that taxa connected directly to the root node do not have to
be specified explicitly in constraint-tree definitions, and monophyly
constraints are the default. The above example could thus also be

constraints lagomorph = ((2,3,5,7,9));


How do I load a topological constraint in the form of a tree file?

Suppose one or more constraint trees exist as tree definitions in a tree file named
“foo.tree” (the names of the trees in the tree file will become the names of
the corresponding constraint definitions when the treefile is loaded).

loadconstr file=foo.tree;

If the trees in “foo.tree” are to be considered backbone constraints,
then the keyword “asbackbone” must be included (otherwise the trees are
considered to be monophyly constraints):

loadconstr file=foo.tree asbackbone;


How do I apply a previously-defined topological constraint to a search?

The command below will perform an heuristic search using all default options
except that the (predefined) topological constraint named lagomorph will be enforced:

hsearch constraints=lagomorph enforce=yes;

Other search-related commands for which the constraints and enforce options are
available are illustrated in the examples below:

nj constraints=lagomorph enforce=yes; [neighbor-joining]
alltrees constraints=lagomorph enforce=yes; [exhaustive search]
bandb constraints=lagomorph enforce=yes; [branch-and-bound search]


How do I get a single majority-rule bootstrap consensus tree from the results of multiple bootstrap runs performed at different times or on different machines?

First, save the trees found during each bootstrap run. By default, PAUP* uses the system clock to seed the random number generator; thus, provided you do not change the value of bseed characters will be sampled differently from run to run. After the bootstrap runs have completed, retrieve the tree files, and compute the consensus tree using the options given below.

begin paup;
execute my_nexus_file.nex;
bootstrap treefile=futz1.out nreps=10 bseed=0 search=heuristic;

begin paup;
execute my_nexus_file.nex;
bootstrap treefile=futz3.out nreps=10 bseed=0 search=heuristic;
begin paup;
execute my_nexus_file.nex;
gettrees file=futz1.out StoreTreeWts=yes mode=3;
gettrees file=futz2.out StoreTreeWts=yes mode=7;
gettrees file=futz3.out StoreTreeWts=yes mode=7;
contree all/strict=no majrule=yes usetreewts=yes;


How do I tell PAUP* to save the trees currently in memory to a file?

Here’s how to save the trees (and the estimated branch lengths) to the
file ‘foo.trees’

savetrees file=foo.trees brlens;


How do I tell PAUP* to read in trees previously saved in a file?

Here’s how to load into memory the trees saved in the file ‘foo.trees’

gettrees file=foo.trees;


Why can’t I get PAUP* to save branch length on the bootstrap consensus tree?

The bootstrap tree is a consensus of the trees found for each replicate sample of the
data. Since each replicate tree will have a different set of branch lengths none are
displayed or saved on the bootstrap consensus tree.


How can I limit the number of rearrangements PAUP* evaluates during a heuristic search?

There are several different ways to go about this. First, there is a “rearrlimit=n” option on the hsearch
command, which
limits the total number of rearrangements for each search to n. Second, there is a “timelimit=n” option, where n is the number of seconds that PAUP* will use to search for a tree. Note that
if you use these options in conjuction with random-addition-sequence searches,
the “limitperrep=y|n” determines whether to apply this limit on a per replicate or overall basis.
You can also
specify reconlimit=n, where n is the maximum “reconnection distance” for
an SPR or TBR reconnection (1 is equivalent to NNI, infinity to TBR, and
values in between restrict the size of the neighborhood of trees that are


Why are fractions listed in the bootstrap bipartition table when 100 bootstrap replicates are performed?

In some cases PAUP might find multiple optimal trees for a given replicate. If it does, PAUP will give the tree a weight that is equal to the reciprocal of the number of trees found in the replicate. You can see this for yourself if you use the treefile option under the bootstrap command to save all trees during search. For example:

bootstrap treefile=bstrees.tre;


How do I ask PAUP* to examine every possible tree topology?

Here’s how to do this, but keep in mind that the number of possible unrooted
bifurcating tree topologies increases factorially with the number of taxa.
This means that for even a 14 taxon problem, it will take PAUP* several
to complete this analysis! It probably is not a good idea to
try this command if you have more than ten taxa currently included.



How do I evaluate 500 random-addition replicates but prevent PAUP* from branch swapping on each one?

hsearch addseq=random nreps=500 swap=none;


How do I set a maxtree limit for each random addition sequence replicate?

If you are doing a number of random addition sequence replicates you’ll need a way to get around
the problem of hitting the maxtree limit on the first replicate and hence aborting the search
before PAUP gets to remaining replicates. For example, if you want to apply a maxtree limit of 100 to
each of 10 random addition sequence replicates then you will need to set the maxtree limit to 1000
and use two options under the the hsearch command. The syntax will look like this:

set maxtrees = 1000 increase=no;
hsearch addseq=random nreps=10 nchuck=100 chuckscore=1;


I have performed an heuristic likelihood search and specified 100 replicates within the hsearch command. When I examine the progress reports, it looks like PAUP* is finding many different tree islands, however the summary at the end says that only one island was found and that island was hit 100 times. What is going on here?

The problem is that PAUP* makes progress reports only once per minute by
default. Once PAUP* encounters a tree in the same island as trees it has
found previously, it immediately abandons the current replicate and begins
working on the next replicate. Thus, even if you set the progress report
interval to 1 second as follows:

set dstatus=1;

you will probably never catch PAUP* at just the moment when it is finishing
one replicate and about to begin the next. As a result, it is very common
for the last entry of a replicate to report a likelihood score that is
worse than the best likelihood score found thus far.


Do you have equations for estimating the relative (or actual) time required for heuristic searches for sequences of different length and for different numbers of sequences?

Unfortunately, the time required to complete a heuristic search cannot be estimated based on the size of a data set. There are a number of reasons why this is so; however, one important reason has to do with the quality of the data (i.e., how homoplastic the data are).

Another important reason is that there is no simple expression for calculating the number of tree bisection-reconnection (TBR) or subtree pruning-regrafting (SPR) rearrangements that will be made on a given tree. That is, the shape of a starting tree will determine the total number of rearrangements that can be made using one of the aforementioned swapping techniques. The problem is further complicated by the fact that it is not known how many suboptimal trees will be found during a search before optimal trees are found, and what portion of potential rearrangements of a given tree will be performed before a better tree is found.


Is there a version of PAUP* that will work on my new Intel-based Mac?

Yes, Mac users who have upgraded to an Intel-based Mac must follow the instructions on this page to get a version of PAUP* that will work on this platform.


Is there a version of PAUP* that will work natively under Mac OS X?

Yes, we have compiled a command-line only version of PAUP* 4.0 beta that will run on Mac OS X in
a terminal window. Note, this version takes full advantage of Mac OS X’s memory protection
and preemptive multitasking but LACKS a Graphical User Interface (GUI). Starting with the forthcoming release of Beta 11,
Mac users will be given a choice to install the command-line version of PAUP* as well as the classic Mac GUI version.
Work is currently underway to “carbonize” the GUI Mac version of PAUP*; however, at this time, we cannot speculate on
when this version will be available. The Mac GUI version of PAUP* is compatible with Mac OS X when run in the classic layer.
If you are only interested in the command-line version of PAUP* then you may purchase
the portable version http://www.paup.csit.fsu.edu/port.html


I just purchased a new Mac and Classic support is not installed on the system. How do I run PAUP* without classic support?

You have two choices. The first is to install classic support on your machine. While Apple no longer installs classic support by default on new systems, you can install it yourself with very little effort. A classic support installer is included on the “Additional Software & Apple Hardware Test” CD. This CD is included with your set of system CDs. Open the file labeled “About the Additional Software & Apple Hardware Test Disc” on the “Additional Software & Apple Hardware Test” CD and you will find concise instructions for installing classic support.

Your second choice is to use the command-line version of PAUP* for OS X. Starting with the forthcoming release of Beta 11, Mac users can use a command-line version of PAUP* in addition to the classic Mac GUI version. The command-line version runs on Mac OS X in a terminal window and takes full advantage of Mac OS X’s memory protection and preemptive multitasking but LACKS a Graphical User Interface (GUI). The Beta 11 installer and updater will automatically add the command-line program to your system path. To start command-line program type “paup” in terminal window. See the quick-start document http://paup.csit.fsu.edu/quickstart.pdf for more details regarding the use of the command-line version of PAUP*.


I get an error when I try to print from the classic version of PAUP*. How do I print from the classic version of PAUP*?

This is probably happening because you do not have a printer setup for the classic layer. A complete description of how to setup printing can be found in Apple’s “Knowledge Base”. The short version of this site is:

  1. If you plan to use an Appletalk printer then you will need to Turn on AppleTalk. Go to your System Preferences > Network > Configure … Select the AppleTalk Tab and then the “Make AppleTalk active” toggle.
  2. Open the Desktop Printer Utility. This is typically located in the Utilities folder within the Applications (Mac OS 9) folder. A window named “New Desktop Printer” should open after a few seconds (give it some time).
  3. Select the printer type that you would like to use and follow the instructions.

If you are only interested in using the Mac tree preview window in PAUP*, you can also setup a “dummy” printer. Open the Desktop Printer Utility. Under “Create Desktop …” select Translator and then click OK. After you do this you should be in business.


How do I increase the amount of memory available to PAUP*?

This is pretty much straight out of Mac’s online help: First, quit PAUP* if it is open. Click the program’s icon to select it. (Make sure to click the program icon itself, not an alias.) Open the File menu and choose Get Info. For Mac OS 8.1 and below, double-click the “Preferred size” box and type a new number. For Mac OS 8.5 and up, you’ll need to select memory under the “Show” pull-down menu to get to the “Preferred Size” box. The program can use this amount of memory if enough memory is available.


Can I download a Mac updater to a PC and transfer the updater to a Mac that is not online?

Yes, download the BinHexed updater for the appropriate Mac version. From your PC click the BinHex
link. Your browser will ask you if you want to save the file or run it. Select the save option.
You’ll get another dialog box allowing you to select a save location. Save the updater to a PC
formatted floppy disk. Mount the floppy on your Mac’s desktop. If your Mac doesn’t already have
one, you’ll need an utility to decompress the BinHexed file. After the file is decompressed double
click the updater icon and you should be good to go.


How do I tell PAUP to automatically close the heuristic search status window at the end of the search?

set autoclose=yes;


How do I keep information from scrolling off the screen before I have read it?

If the PAUSE option of the SET command equals Silent, Beep, or Msg the output will stop after every screenful and wait for you to press the return key.

set pause=No|Silent|Beep|msg


How do I recall a PAUP* command?

We strongly recommend using the public domain command-line editor CED, which provides command-line editing and recall capabilities within PAUP*.


How do I print trees using the Windows Interface?

To print an asci trees, direct the general output to a file using the log command,
issue the command showtrees, stop the log, and print the log file using your favorite text editor.

log file=tree.log;
showtree 1;
log stop;

Note: The windows interface of PAUP* 4.0 does not print graphical trees. We plan to make graphical
printing a part of the windows package but this feature will not be available in 4.0.
The program TreeView written
by Rod Page is an execellent program for creating and manipulating graphical trees from NEXUS files.
To output NEXUS trees from any version of PAUP* use the savetrees command.

savetrees file=mytree.trees;


How does PAUP* deal with missing characters under the parsimony criterion?

The way that PAUP* deals with missing characters under the parsimony criterion is to assign to the taxa the character state that would be most parsimonious given its placement on the tree. Therefore, only the characters with no missing data will affect the placement of the taxa.


What options are available in PAUP* for dealing with multi-state taxa?

Under the “Set” or “Pset” commands you are given an option to change the way in which PAUP deals with multi-state taxa. When the data set below is analyzed under the parsimony criterion changing the designation of multi-state taxa to uncertain (default), variable, and polymorphic gives three different scores; 5, 6, and 7, respectively. For “Pset mstaxa=uncertain” paup picks the variable state that minimizes the tree length, for “Pset mstaxa=polymorphic” paup assumes that variable characters are a heterogeneous terminal group, and for “Pset mstaxa= variable” paup treats the characters inside the curly braces as uncertain and those inside the parentheses as polymorphic.

NOTE: For display reason, the curly braces are replaced by square brackets. To get the results described above replace the square brackets with curly braces.

begin data;
dimensions ntax=4 nchar=4;
format symbols=”012″;
t1 11 00
t2 1[12] 10
t3 02 1(01)
t4 00 11


How do I define multistate characters as ordered in PAUP?

There are several ways to assign character types to specific characters in the data matrix. One way is to define a typeset in an assumption block and then use the assume command to set the character type. For example:

begin assumptions;
typeset myTypesetName = ord: 1 4 5;
begin paup;
assume typeset = myTypesetName;

You can skip the assume command and set the character type from within the assumptions block if you precede the typeset name with an asterisk (“*”). For example:

begin assumptions;
typeset *myTypesetName = ord: 1 4 5;

Yet another way to set character types is by using the ctype command from within a paup block or at the command line. For example the following command has the same effect as those given above:

ctype ord:1 4 5;


If a patristic distance is the sum of branch lengths on a path between a pair of taxa, why do the summed branch lengths between a pair of taxa not add up to the patristic distance reported under the “describetrees” command?

The most likely reason for this is that you have unordered multistate characters in your data matrix.
PAUP does not included unordered multistate characters in the patristic distance calculation, because
reconstuction of these characters can be ambiguous. To calculate branch lengths and by extension
the entire tree length, PAUP will arbitrarily accept one of the possible ancestral state assignments.
Therefore, the sum of the branch lengths is greater then the patristic distance because the branch
length calculations included the multistate characters.

If you don’t care about what ancestral states PAUP has used there is a way to get a patristic distance
for all of the characters in your data set. First, save the tree in matrix representation
including the branch lengths as a weight set.

matrixrep brlens=yes file=mytreefile.nex;

Next, open the matrix tree file and apply the weight set to all of the characters.

execute mytreefile.nex;
assume wtset=brlens;

Finally, rebuild the tree and generate the patristic distance matrix:

describetrees 1/ patristic=yes;

The patristic distances will now equal the summed branch lengths.


I did a search under the parsimony criterion and got two trees that look just alike. Why does PAUP consider them to be different?

The answer involves how PAUP collapses zero-length branches. The default collapsing rule is that a branch is retained if it is supported under at least one most-parsimonious reconstruction (MPR) of the ancestral states, for at least one character.

Here is a simple data matrix that will generate this result.

taxa 1 23 45
A 0 00 00
B 0 11 11
C 0 11 11
D 1 00 11
E 1 00 00
F 1 00 11

Analysis of this matrix using PAUP gives two most-parsimonious trees:

: A B C D F E
: \ \ / \ / /
: \ * * /
: \ \ / /
: \ \ / /
: tree1 \ * /
: \ | /
: \|/
: *
: A B C D F E
: \ \ / / / /
: \ * / / /
: \ \ / / /
: \ * / /
: \ \ / /
: tree2 \ * /
: \ | /
: \|/
: *

An MPR on tree1 for character 1 requires two steps, and there are two of them:

: A B C D F E A B C D F E
: 0 0 0 1 1 1 0 0 0 1 1 1
: \ \ / \ / / \ \ / \ / /
: \ 0 1 / \ 0 1 /
: \ \ / / \ \ / /
: \ \ / / \ \ / /
: \ 0 / \ 1 /
: \ | / \ | /
: \|/ \|/
: 0 1

Because one of these two MPRs assigns a change leading to the group DF, PAUP does not collapse the branch connecting DF to the remainder of the tree.

On the other hand, tree2 has only a single MPR for character 1:

: A B C D F E
: 0 0 0 1 1 1
: \ \ / / / /
: \ 0 / / /
: \ \ / / /
: \ 1 / /
: \ \ / /
: \ 1 /
: \ | /
: \|/
: 1

This character does not provide support for the BCD group, and since there are no other characters that support it, the branch leading to BCD is collapsed, yielding the tree:

: A B C D F E
: 0 0 0 1 1 1
: \ \ / | / /
: \ 0 | / /
: \ \ | / /
: \ \ | / /
: \ \ | / /
: \ \ | / /
: \ \|/ /
: \ 1 /
: \ | /
: \|/
: 1

PAUP considers both of these trees to be distinct, recognizing that there is a tree for which the group DF receives support (albeit ambiguous support) and another tree for which DF receives no support.


How do I perform a Kishino-Hasegawa test to see if the support for the first and second trees stored in memory is significantly different?

set criterion=parsimony;
pscores 1-2 / khtest;


How do I perform a partition homogeneity (congruence) test?

The following example uses the partition definition
named “foo”, specifies 1000 randomizations using the random number seed
1234567, and uses a branch and bound search to obtain the sum of tree lengths
for each partition.

set criterion=parsimony;
hompart partition=foo nreps=1000 seed=1234567 search=bandb;


How do I downweight third position transitions only in a parsimony analysis?

First you need to identify the codon positions. Probably the most efficient way to
do this is to set up a codons block where the reading frame for the coding genes
is identified. Then you need to define the weighting for transitions and transversions by
creating a step matrix within an assumptions block. Finally, use the ctype
command within a paup block to apply the stepmatrix to 3rd position sites only.

begin assumptions;
charset coding = 2-457 660-896;
charset noncoding = 1 458-659 897-898;
charset 1stpos = 2-457\3 660-896\3;
charset 2ndpos = 3-457\3 661-896\3;
charset 3rdpos = 4-457\3 662-.\3;
usertype 5_1 stepmatrix = 4 acgt
– 5 1 5
5 – 5 1
1 5 – 5
5 1 5 –
begin paup;
ctype 5_1:3rdpos;


How do I weight specific character positions in my alignment?

You can give different weights to different character positions by using the “weights” command.
There are several ways to identify the characters to be weighted. One efficient way to
identify characters is to include them in a character set, which must be defined within an
assumptions block. For example:

begin assumptions;
charset coding = 2-457 660-896;
charset noncoding = 1 458-659 897-898;

Next, you can issue the “weights” command at the command line or within a paup block. In the example
below, the first “weights” command assigns a weight of three to all characters defined as coding. The
second “weights” command does the same thing except the character are directly identified.

begin paup;
weights 3:coding;


weights 3:2-457, 3:660-896;


Do stepmatrices for character state transformations have to be symmetric?

User-defined stepmatrices do not need to be symmetric. The only requirement imposed on a stepmatrix is that it may not violate the triangle inequality .


Why does PAUP* tell me that my stepmatrix violates the triangle inequality?

The triangle inequality requires that a single edge of a triangle not be greater than the sum of the other edges. In terms of step matrices this means that

d(ac) <= d(ct) + d(at)

According to this rule the stepmatrix given below qualifies.

3 < 1+3

Stepmatrix “asym” (asymmetric):

TO: a c g t

FROM: a – 3 1 3

c 2 – 3 1

g 1 4 – 3

t 3 1 3 –

Whereas the following matrix would be inconsistent with the triangle inequality:

Stepmatrix “asymNT” (asymmetric triangle violation):

TO: a c g t

FROM: a – 5 1 3

c 2 – 3 1

g 1 4 – 3

t 3 1 3 –

and PAUP* would adjust the a to c transformation from 5 to 4.


Why does PAUP* warn me that the stepmatrix supplied in Xu and Miranker (2004, “A metric model of amino acid substitution”, Bioninformatics 20:1214-1221) is “internally inconsistent”?

Symmetric stepmatrices in PAUP* are required to satisfy the triangle inequality. If they fail to do so, a warning is issued and the costs in the matrix are adjusted until the triangle inequality is satisfied for all possible triplets of states. Unfortunately, the matrix given in the paper by Xu and Miranker contained a minor error. A corrected matrix is available at the following location: http://www.cs.utexas.edu/users/mobios/Publications/mPAMErrata.pdf.


What do the indices under the “pscores” command mean?

PAUP outputs several indices that measure the “fit” of characters to particular trees.
The indices can be defined in terms of the following three parameters:

  1. s= length (number of steps) required by the characters on the tree being evaluated
  2. m= minimum amount of change that the character may show on any conceivable tree
  3. g= maximum possible amount of change that a character could possible require on any
    conceivable tree (i.e., the length of the character on a completely unresolved bush).

You can calculate a value for each character using the following formulae:

ci= m/s

ri= (g-s)/(g-m)

rc= ri*ci

hi= 1-c

To get the overall value for a suite of characters you’ll simply caculate the sums
of s, m, and g for all the charachers in the suite and use the summed values in the
equations described above.


How does PAUP* deal with missing characters under the likelihood criterion?

The likelihood is computed by summing the likelihoods over each possible assignment of A, C, G, or T to the taxon with the missing datum. Generally, if all of the nearby taxa have the same state, this sum will be dominated by the term with this same state assigned to the “missing” value, but each of the other states will contribute some small, nonzero, value to the likelihood.
On the other hand, if there is considerable ambiguity in the sense that the surrounding taxa have different states, or the branch leading to a missing-data taxon is very long, each of the possible assignments makes a larger contribution to the total likelihood.
It’s all in the same spirit as likelihood in the absence of missing data–there are lots of ways that the pattern of nucleotides at the tips of the tree could have been generated, and all of them contribute something to the total likelihood (generally some much more than others).
With missing data, there are several states that a taxon might have taken if an insertion/deletion event had not happened (or an ambiguity in the sequencing hadn’t occurred) and likelihood considers the probability of each of those alternatives.


How do I tell PAUP* I want to use the JC69 model (Jukes & Cantor, 1969)?

set criterion=likelihood;
lset nst=1 basefreq=equal;


How do I tell PAUP* I want to use the K2P model (Kimura, 1980)?

set criterion=likelihood;
lset nst=2 basefreq=equal;


How do I tell PAUP* I want to use the F81 model (Felsenstein, 1981)?

set criterion=likelihood;
lset nst=1 basefreq=empirical;


How do I tell PAUP* I want to use the F84 model (i.e., the model used in DNAML)?

set criterion=likelihood;
lset nst=2 basefreq=empirical variant=f84;


How do I tell PAUP* I want to use the HKY model (Hasegawa, Kishino, & Yano, 1985)?

set criterion=likelihood;
lset nst=2 basefreq=empirical variant=hky;


How do I tell PAUP* I want to use the GTR model (i.e., the general time reversible model)?

set criterion=likelihood;
lset nst=6 basefreq=empirical;


How do I obtain likelihoods for all trees in memory?

lscores all;

Notes: you must first instruct PAUP* to use the likelihood criterion and you
may also wish to change the current substitution model before issuing the
above command.


How do I obtain likelihoods corresponding to each individual nucleotide site in my data using the first tree in memory?

lscores 1 / sitelikes;

Notes: you must first instruct PAUP* to use the likelihood criterion and you
may also wish to change the current substitution model before issuing the
above command.


How do I force PAUP* to use the branch lengths I specify when computing site likelihoods?

Assuming that you have a tree file (for example, “foo.tre”) in which descriptions
of trees contain branch length information, you could read in the trees from
this file and preserve the branch length information as follows:

gettrees file=foo.tre storebrlens;
lscores 1 / sitelikes userbrlens;

Notes: you must first instruct PAUP* to use the likelihood criterion and you
may also wish to change the current substitution model before issuing the
above commands.

An example of a tree file containing one unrooted tree with branch length
information is shown below. In this example, all branches in the four-taxon
unrooted tree have length 0.1 except for the central branch, which has length 0.2

begin trees;
utree best = (taxonA:0.1,taxonB:0.1,(taxonC:0.1,taxonD:0.1):0.2);


How do I perform a Kishino-Hasegawa test to see if the support for the first and second trees stored in memory is significantly different?

lscores 1-2 / khtest;

Notes: you must first instruct PAUP* to use the likelihood criterion and you
may also wish to change the current substitution model before issuing the
above command.


What is the difference between the transition/transversion ratio and the transition/transversion rate ratio?

The transition/transversion rate ratio is simply the instantaneous
rate of transitions divided by the instantaneous rate of transversions.
I will refer to this quantity as k. If k
is 1.0, this means that transitions are occurring at the same rate as transversions.
The transition/transversion ratio, however, is the probability of
any transition (over a single unit of time) divided by the probability
of any transversion (over a single unit of time). To find the probability
of any transition during a single unit of time, one must consider each
of the ways a transition can occur (i.e., A to G, G to A, C to T, and T
to C) and add together the probabilities of each (note that this will be
a sum of four terms). Likewise, finding the probability of any transversion
during a single unit of time involves a sum of eight terms (i.e., A to
C, A to T, G to C, G to T, C to A, C to G, T to A, and T to G). The probability
of the specific transition A to G can be determined as follows: it is the
probability that one begins in state A and changes from state A
to state G in a single unit of time. Using the Felsenstein 1981 substitution
model, the probability of the second part of the above statement, namely
the probability of changing from state A to state G, can be written as
pGb. The
first part of the statement, namely the probability of starting with state
A, is simply the equilibrium nucleotide frequency of A, or pA.
The transition/transversion ratio, then, involves the equilibrium base
frequencies, whereas the transtition/transversion rate ratio does not.
Still another definition of transition/transversion ratio exists. That
definition is that this ratio is the observed number of transitions between
two sequences divided by the observed number of transversions between two
sequences. This definition is problematic because the magnitude of this
measure depends on the amount of time separating the two sequences being
considered. It is thus difficult to compare meaningfully transition/transversion
ratios obtained in this way across different pairs of sequences, since
these will generally be separated by different amounts of time. Also, one
should be aware that the symbol k has been used
in other contexts; for example, k as used in
the model implemented in the program DNAML is not comparable to k
as described here.


How do I tell PAUP* to estimate the transition/transversion ratio when using the HKY substitution model?

set criterion=likelihood;
lset nst=2 basefreq=empirical variant=hky;
lset tratio=estimate;


How do I take account of rate heterogeneity across sites using a discrete gamma distribution, four rate categories, and a shape value of 0.2?

set criterion=likelihood;
lset rates=gamma ncat=4 shape=0.2;


How do I estimate the shape parameter when I am using a four-category discrete gamma distribution to account for heterogeneity in rates across sites?

set criterion=likelihood;
lset rates=gamma ncat=4 shape=estimate;


How do I tell PAUP* to estimate the proportion of invariant sites?

set criterion=likelihood;
lset pinvar=estimate;


How do I tell PAUP* to assume there are no invariant sites?

set criterion=likelihood;
lset pinvar=0;


How do I tell PAUP* to estimate the proportion of invariant sites and and estimate the shape parameter of a discrete, four-category gamma distribution applied to the sites that are not invariant?

set criterion=likelihood;
lset pinvar=estimate;
lset rates=gamma ncat=4 shape=estimate;


I think most of the rate heterogeneity in my sequences are the result of codon structure. How can I tell PAUP* to assume a different rate for each codon position (i.e., estimate site-specific rates)?

set criterion=likelihood;
charpartition codons = firstpos:1-.\3, secondpos:2-.\3, thirdpos:3-.\3;
lset rates=sitespec siterates=partition:codons;

At this point, any command that causes likelihoods to be computed will make use of the
charpartition named codons and a different rate will be estimated for each
codon position class of sites.


How do I tell paup to use site-specific rates that I have already estimated?

How do I tell paup to use site-specific rates that I have already estimated?
You can do this a couple of different ways. The first way is to estimate the rates on a given tree and then apply the estimated rates by using the previous option. In the following example, a character partition defines three genes and the site specific rates for each gene are estimated on a neighbor joining tree. Finally, a heuristic search is executed using the site-specific rates estimated on the neighbor joining tree.

charpartition genes=g1:1-300, g2:301-600, g3:601-700;
lscore 1/rates=sitespec siterates=partition:genes;
lset rates=sitespec siterates=previous;

The second way to use previously estimated site-specific rates is to define them explicitly in a rate set. In the following example 1st, 2nd, and 3rd positions are assigned a rate of 2, 1, and 3, respectively. Characters sets are used to defined which characters represent the codon positions.

charset 1stpos = 2-457\3 660-896\3;
charset 2ndpos = 3-457\3 661-896\3;
charset 3rdpos = 4-457\3 662-.\3;
rateset codonrates = 2.0:1stpos, 1.0:2ndpos, 3.0:3rdpos;
lscore / rates=sitespec siterates = rateset:codonrates;


When I estimate the shape parameter of the gamma-distributed rates model and the proportion of invariable sites simultaneously, PAUP tells me that pinvar is zero even though the empirical number of invariable sites is about 30 percent. Why?

When you use gamma-distributed rates, invariable sites can sometimes be accommodated by the left tail of the gamma distribution (i.e., while these sites are technically not “invariable”, they are changing slowly enough that a fair number of constant sites are expected when the gamma shape parameter is small). The two parameters are highly correlated; often similar likelihood scores can be achieved with a small pinv and small gamma shape or a larger pinv with a correspondingly larger gamma shape. When the gamma shape parameter is larger, fewer low-rate sites are expected, and the pinv must increase to account for the presence of these low-rate sites. The following article deals with this issue in more depth:

Sullivan, J.; Swofford, D. L., and Naylor, G. J. P. The effect of taxon sampling on estimating rate heterogeneity parameters of maximum-likelihood models. Molecular Biology and Evolution. 1999; 16:1347-1356.


How do I get the relative probabilities for each ancestral base assignment?

There are basically two step to getting the relative probabilities of each base assignment.
First you need to tell PAUP to display the values when characters are reconstructed and then
you’ll need to reconstruct the characters. The following block shows how this may be done.

begin paup;
set crit=like;
lset allprobs=yes;
describetrees 1/plot=no xout=internal;


How do a import a pairwise distance matrix from another program into PAUP*?

The easiest way to do this is to include the custom distance in a NEXUS formatted distance block.
For example, below is a distance matrix for four sequences followed by a paup block that uses the
distances to build a neighbor joining tree.

[!user defined distances]
Begin distances;
Dimensions ntax=4;
format nodiagonal;
t2 4
t3 3 4
t4 2 3 4 ;
[! nj with user defined distances]
Begin paup;
dset distance=user;

A more detailed description of the distance block is given in the command reference pdf document .


How does PAUP* distribute missing or ambiguous changes proportionally to unambiguous changes?

Take for exampe the following sequences:

t1 aaaaaccg
t2 tgca-gtt
t3 tgcaagtt

The distance p-distance or dissimilarity between sequences t1 and t3 is pretty easy to calculate. That is, 6 of the 8 comparisons do not match, therefore the p-distance between t1 and t3 is 3/4 or .75. If you chose to ignore missing sites, the comparison between sequences t1 and t2 would be equally straightforward; 6 of the 7 comparison do not match giving a p-distance of .85714. Deciding to distribute the missing comparisons to the unambiguous changes tells PAUP* to look at all the “a” pairs between sequence t1 and t2. For the example above these are:

1 a-t
1 a-g
1 a-c
1 a-a

Distributing the changes proportionally to each unambiguous change would give 1/4 to each “a” comparison. Therefore if we tallied the number of comparisons between sequence t1 and t2 we would get a matrix that looked like this:

.   a    c    g    t
a 1.25 1.25 1.25 1.25
c      0    1.00 1.00
g           0    1.00
t                0

To get the p-distance we add up the off diagonals to get 6.75 differences out of 8 comparisons or .84375.


We need to do a likelihood search on a UNIX machine with a general time reversible model (I+Gamma), i.e. some sites assumed to be invariable with gamma distributed rates at variable sites, with a heuristic search with 10 repetitions random addition taxa and TBR branch swapping ?

begin paup;
set criterion=likelihood;
lset nst=6 basefreq=empirical;
lset pinvar=estimate;
lset rates=gamma ncat=4 shape=estimate;
hsearch nreps=10 addseq=random swap=tbr;

Notes: This analysis would be expected to take a very long time if more
than four or five taxa are included in the analysis. Simply using the GTR
model is going to cost a lot in terms of computation time, since there
are many more rate parameters that need estimating in GTR compared with
HKY or even simpler models. The amount of time could be reduced considerably
by not estimating both the gamma shape parameter and the pinvar parameter.
Instead of pinvar=estimate, for example, use pinvar=0.1,
and instead of shape=estimate, use shape=0.25. These
values need not come out of thin air, however. One could supply a pretty
good tree, estimate these parameters using that tree, and then set the
pinvar and shape parameters to those estimates for purposes of conducting
a search. Once the search is finished, these parameters could be estimated
again to see if they change much. If so, it might be worth redoing the
search using the new, better estimates.


I have a sequence data set for which I would like to infer the phylogeny. What is a sequence of analyses that I can perform that will cover most potential pitfalls I am likely to encounter?

begin paup;
log file=log.txt start;
set criterion=parsimony;
hsearch nreps=10 addseq=random swap=tbr;
savetrees file=mp.tre brlens;
set criterion=distance;
dset distance=logdet objective=me;
hsearch nreps=10 addseq=random swap=tbr;
savetrees file=me.tre brlens;
set criterion=likelihood;
lset nst=2 basefreq=empirical rates=gamma ncat=4;
lset tratio=estimate shape=estimate;
lscore 1;
lset tratio=previous shape=previous;
hsearch nreps=1 swap=tbr start=1;
savetrees file=ml.tre brlens;
log stop;

Notes: This PAUP block infers phylogeny using three different optimality
criteria and stores all the output in a log file named log.txt. The first
analysis uses the criterion of maximum parsimony to obtain a tree (or set
of trees), which are then saved to a tree file named mp.tre. The second
analysis uses the minimum evolution criterion in conjunction with
LogDet/paralinear pairwise distances and saves the resulting tree(s) in a
tree file named me.tre. The third analysis makes use of the maximum
likelihood criterion in conjunction with the HKY-gamma substitution model.
Estimates of the tratio (the transition/transversion ratio) parameter and
the gamma shape parameter are obtained using the LogDet tree already in
memory. Then, these two parameters are fixed at these estimated values for
the duration of the heuristic search. The tree(s) resulting from the
hsearch command are saved in the tree file ml.tre.
Each phylogeny method has its Achilles Heel. Maximum parsimony can be
mislead if there is too much heterogeneity in substitution rates among
lineages (the classic “long edges attract” problem) in the underlying true
phylogeny. Minimum evolution using LogDet distances can be mislead if there
is too much site-to-site rate heterogeneity, or if some of the pairwise
distances are undefined (use the “showdist” command to check). Maximum
likelihood under the HKY-gamma model can be mislead if parameters that are
assumed to be constant across the phylogeny (such as the tratio or base
frequencies) actually vary among lineages in the true phylogeny. Because
of these inherent weaknesses in individual methods, it is a good idea to
try several methods that have strengths in different areas. If you get
the same tree under all methods, then you are in good shape because
apparently there are no major pitfalls in your data. Of course, there may
be a major unknown pitfall affecting all methods, but there is not much you
can do about that. You may get trees that are not identical, but are also
not significantly different (in terms of data support) from one another.
The Kishino-Hasegawa test can be used to see whether one tree is supported
significantly less by the data than a second tree. The last possibility is
that you get truly different trees from the different methods. In this
case, it is in your best interest to examine these trees carefully for
evidence that a particular method has fallen victim to its particular
Achilles Heel. For example, if you log.txt file shows that there is strong
rate heterogeneity in your data (let’s say the shape parameter is estimated
to be 0.01), then the LogDet and parsimony trees fall under a certain
degree of suspicion compared to the likelihood tree, which should be
relatively immune to this pitfall since the model used allows for rate
heterogeneity. If the parsimony tree differs from the LogDet and
likelihood tree, look for evidence of long branch (edge) attraction in the
parsimony tree. If the LogDet tree differs from the parsimony and
likelihood trees, see if the base frequencies vary considerably between tip
taxa (a useful tool for this purpose is the basefreq command). In other
words, use PAUP* as a tool for discovering what evolutionary factors are at
work in your particular set of sequences, and use this knowledge to make an
intelligent choice between the alternatives presented to you by different
phylogeny methods.