This is the manual for the program suite EOM 2.0 (Ensemble Optimisation Method), which seeks to describe experimental SAXS data using an ensemble representation of atomic models.
The following sections briefly describe the program EOM 2.0 and detail the steps required to run the program. File input and output are explained.
If you use results from EOM in your own publication, please cite:
EOM 2.0 is a program that fits an averaged theoretical scattering intensity derived from an ensemble of conformations to experimental SAXS data.
A pool of n independent models based upon sequence and structural information is first generated.
For multi-domain proteins where high-resolution structures for individual subunits/domains are available, these structures and distance/orientation information derived from them can be used as rigid-bodies and/or constraints in EOM model generation.
For proteins expected to be intrinsically unfolded no rigid bodies are required as input, and completely random configurations of the alpha-carbon trace are created based upon the sequence alone.
Crystallographic symmetry and inter-domain/subunit contacts can be imposed in the following ways:
(1) Using an oligomeric assembly of high-resolution structures (e.g. downloaded from the PDB or using the PISA server) to define the required interface as a single rigid body (e.g. a homodimer with P2 symmetry). N.B. The option to FIX the coordinates should be used in this case.
(2) Specifying a potential oligomerization interface between the rigid-bodies using distance constraints.
Once the pool generation is completed a genetic algorithm for the selection of an ensemble is performed. The genetic algorithm compares the averaged theoretical scattering intensity from n independent ensembles of conformations against the scattering data.
The ensemble that best describes the experimental SAXS data is selected.
Supported symmetries are: p1, p2, …, p19 (nineteen-fold), p22, p32, p42, p52, p62, …, p122, p222.
The n-fold axis is typically Z, if there is in addition a two-fold axis it coincides with Y. N.B. If a symmetric core is expected the multichain model defining this core must also be FIXED (i.e. enter "Y" to the question: Fix the subunit in original position/orientation?)
Overall symmetry of the particles in the pool. [S]ymmetry generates only symmetric multichain particles. [A]symmetry generates multichain particles with a symmetric core structure but leaves the remaining structure asymmetric. [M]ix generates a pool of both symmetric and asymmetric multichain particles. NOTE: If P1 Core Symmetry is selected, this option will not appear.
Percentage of symmetric structures that will be in the pool. This option requires Mix to be selected for the Overall Symmetry. If Mix is not selected for Overall Symmetry, this option will not appear.
CA dihedral angle distribution to use for modelling the missing regions. Random uses a CA distribution consistent with chemically denaturated proteins, while native uses a CA distribution consistent with disordered proteins. On average, Random models will be more extended than those defined as native-like. The compact option uses a CA distribution consistent with disordered proteins, but also forces the reconstructed linkers to be more compact.
Filename containing the amino acid sequence. Standard amino acids should be UPPERCASE for plain text (*.seq) and FASTA (*.fasta) file input. For the non-standard residues PTR, HSD, SEP, TPO, MSE the lowercase letters p, h, s, t, m should be used, respectively. N.B.: the filename must include an extension (e.g., seq.txt, seq.seq, seq.fasta)
Filename of the domain/subunit/pdb file that you wish to define as a rigid body. N.B.: The filename must contain the relevant extension (e.g. domainA.pdb).
Type [Y]es if the input PDB file contains a multichain interface (symmetric or asymmetric) that will be used as the structural core of each model. Type [N]o if the PDB file defines a monomeric subunit/domain. N.B.: The number of chains in the multichain PDB file must also match the symmetry definition from the first question (i.e. for P2 symmetry a two-chain PDB is required as input, the final models will be dimers). Note also that a multichain input PDB will count as a single rigid body for the Number of Domains question.
A contact region or interface between subunits/domains can be defined using a contacts file (see below) specifying regions in close proximity and to generate symmetry mates (requires that a non P1 symmetry has been selected). If no contact file or multichain input is provided, a contact interface is generated by default and covers the whole range of the particle.
If Yes each model of the pool will be saved on the disk (not required for regular use). If [N]o then only the PDB files of the selected ensemble will be saved (default).
Maximum order of harmonics (min = 10, max =50). Defines the resolution of the calculated curve. Default value should be sufficient in most of the cases. For large particles high orders could improve the results, but more CPU time is required. Fractional values are not allowed.
Number of points in theoretical curve (min = 10, max = 201). Default value should be sufficient for most cases. A larger number of points is recommended for very large particles (e.g. MDa size) at a higher cost of CPU time. Fractional values are not allowed.
Number of experimental curves you wish to use for ensemble selection. An independent run will be conducted on each experimental data set. The results will be written to a separate sub-directory (e.g. curve_1, curve_2 ...).
To define an interface for a subunit/domain interaction, and if a multichain PDB file is not available where such an interface is defined, an optional contact conditions file can be used. The format of the contacts file is as follows:
dist 8.0
1 25 28
This condition defines a distance of 8 Å between a symmetry related range of residues (i.e. an interface formed by residues 25-28 of the current body - in this case domain/subunit 1 - and residues 25-28 of a generated symmetry mate). The integer 1 means the first domain/subunit specified in the EOM dialog. N.B. non P1 symmetry must be defined.
dist 8.0
1 25 28
1 40 47
This condition defines a distance of 8 Å between two alternative symmetry related residue ranges. In this case, either the interface defined by residues 25 to 28 or the interface defined by residues 40 to 47 and their respective symmetry mates will be used (randomly selected during pool generation). The actual residue within the range to use as the contact point is random.
dist 8.0
1 25 28
2 88 94
This condition defines two independent sets of distances of 8 Å, the first between domain/subunit 1 and its corresponding symmetry mate (contacts involving residues 25 to 28), and the second set between domain/subunit 2 and its corresponding symmetry mate (contacts involving residues 88 to 94). N.B. This does not define a set of contacts between domain/subunit 1 and domain/subunit 2. Non P1 symmetry must be defined.
Sequence: e.g. sequence.seq - amino acid sequence;
Domains/Subunits: e.g. dom_n.pdb - atomic coordinates of the folded domain(s), if present;
Data: e.g. data.dat - experimental data.
In this example the following input is available: protein sequence (sequence.seq); PDB files of two known domains, dom1.pdb as a dimer (i.e. a multichain PDB file defining an interface), dom2.pdb as a monomer; and the experimental SAXS data (data.dat).
All the files for this example are included in the documentation directory of the ATSAS installation package.
The two domains are joined by a disordered linker of 10 residues.
EOM 2.0 will be run using the following set up in order to generate a pool of 10000 independent models with random conformations, and enforcing crystallographic point symmetry p2 for the core of each model (the symmetry axis in this case is defined by the orientation of the chains in the dom1.pdb file).
Following pool generation, a genetic algorithm will select the ensemble of scattering profiles computed from the pool that best fits the data.
Typing:
$ eom
on the command line brings up the interactive dialog:
******* ------------------------------------------------------ *******
******* Advanced Ensemble Optimization Method - EOM 2.0 *******
******* Copyright (c) ATSAS Team *******
******* EMBL, Hamburg Outstation, 2007 - 2014 *******
******* *******
******* For doubts/questions please visit SAXIER forum: *******
******* http://www.saxier.org/forum/viewtopic.php?f=10 *******
******* *******
******* In case of bugs please refer to: *******
******* G. Tria, D.I. Svergun, EMBL BioSAXS group *******
******* atsas@embl-hamburg.de *******
******* ------------------------------------------------------ *******
Core symmetry? Select one of: (1) p1, (2) p2, (3) p3, (4) p4, (5) p5,
(6) p6, (7) p7, (8) p8, (9) p9, (10) p10, (11) p11, (12) p12, (13)
p13, (14) p14, (15) p15, (16) p16, (17) p17, (18) p18, (19) p19, (20)
p22, (21) p32, (22) p42, (23) p52, (24) p62, (25) p72, (26) p82, (27)
p92, (28) p102, (29) p112, (30) p122, (31) p222 (default: p1) ........ : p2
Overall symmetry? Select one of: (s) Symmetry, (a) Asymmetry, (m) Mix
(default: Mix) ....................................................... : s
Chain type to generate? Select one of: (c) compact-chain, (n)
native-like, (r) random-coil (default: random-coil) .................. :
Sequence file name? .................................................. : sequence.seq
Number of domains? (default: 0) ...................................... : 2
------ Domain 1 ------
PDB file name for domain ............................................. : dom1.pdb
Does this PDB contain multiple chains? (default: no) ................. : y
Keep the subunit in the original PDB coordinates? (default: no) ...... : y
PDB file for D/RNA bound to the subunit (CR for none) ................ :
------ Domain 2 ------
PDB file name for domain ............................................. : dom2.pdb
Does this PDB contain multiple chains? (default: no) ................. : n
Keep the subunit in the original PDB coordinates? (default: no) ...... : n
PDB file for D/RNA bound to the subunit (CR for none) ................ :
Total number of models to generate (default: 10000) .................. :
File enumeration starting from? (default: 1) ......................... :
Save the generated PDB files? (default: no) .......................... : n
Suffix of generated pdb files? (default: eom) ........................ :
Calculate Intensity? (default: yes) .................................. : y
Number of harmonics (min. 10, max. 50)? (default: 15) ................ :
Maximum s value (min. 0.1, max. 0.5)? (default: 0.500) ............... :
Number of points (min. 10, max. 201)? (default: 51) .................. :
Run the Genetic Algorithm? (default: yes) ............................ : y
How many times (min. 1)? (default: 1) ................................ :
Number of experimental curves to fit? (default: 1) ................... :
Experimental data file name 1? ...................................... : data.dat
Loading values and configuration ...
Number of residues per chain: 387
... starts making models ...
[ 1%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 10%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 20%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 30%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 40%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 50%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 60%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 70%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 80%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 90%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[100%]
... completed making models
Running GAJOE: 1
Curve: data.dat - Loading values and configuration ...
Number of theoretical curves : 10000
Starting the Genetic Algorithm ...
CYCLE: 1
Chi^2: 0.389
Ensemble size: 5
...
...
...
CYCLE: 100
Chi^2: 0.388
Ensemble size: 4
... finished the Genetic Algorithm!
Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62
Re-making selected structures ...
[ 20%] [ 40%] [ 60%] [ 80%] [100%] ... completed!
Rancheom.log: log file. File containing the configuration for the run.
juneom.int: intensity file. File containing the theoretical intensity computed for each model in the pool.
Size_listeom.txt: size file. File containing the Rg and Dmax parameters for each model in the pool.
eomTemp_eom.pdb: temporary model file generated by RANCH (is constantly overwritten during RANCH execution)
GA00n/curve_m/: folder containing the result of the genetic algorithm.
An incrementing index n is used for each GAJOE run conducted in the current directory.
The index m is assigned for each experimental data file used.
GA00n/curve_m/logFile_00n_m.log: log file. File containing the configuration for the genetic algorithm.
GA00n/curve_m/profiles_00n_m.fit: fit file. File containing the fit for the best ensemble.
Can be directly opened by SASPLOT or PRIMUS.
If opened with a text editor more information about the selected ensemble can be visualized.
GA00n/curve_m/Rg_distr_00n_m.dat: Rg distribution file. File containing the Rg distribution for the selected models compared to the Rg distribution for the whole pool.
It can be opened using SASPLOT or PRIMUS and selecting View > AbsY:X to display them on a linear scale.
The average Rg values of the pool and the selected structures are contained in the file header.
GA00n/curve_m/Size_distr_00n_m.dat: Size distribution file. File containing the Size distribution for the selected models compared to the Size distribution for the whole pool.
It can be opened using SASPLOT or PRIMUS and selecting View > AbsY:X for display on a linear scale.
The average Dmax values of the pool and the selected structures are contained in the file header.
GA00n/curve_m/pdbs: folder containing the PDB files of the models composing the best fitting ensemble.
Please note that the PDB files of the selected models are not "the structure of the flexible system" but only models that suggest the behaviour of the system in solution.
Please note that the suffix of filenames (e.g. PDB files) will correspond to the answer of the question (default is "eom"): Suffix of generated PDB files ................ < eom >
The distributions of Rg and Dmax generated by EOM can be represented as probability density functions. This allows for a quantitative estimation of the flexibility of the system using the concept of information entropy. For example, an ensemble/pool of structural parameters for a protein showing a broad Gaussian-like distribution (where it is assumed the disordered regions move randomly in solution) can be viewed as a carrier of high uncertainty. Conversely, an ensemble/pool of parameters for a protein with a narrow size distribution (a scenario where the particle exhibits limited flexibility) provides low uncertainty. Useful metrics for the quantitative description of uncertainty (flexibility) provided by EOM 2.0 are:
Rflex = -Hb(S), where Hb(S)=-∑ni=1 p(xi)logb[p(xi)],
with logb[p(xi)] = 0 if p(xi) = 0
(For further detail refer to the EOM 2.0 paper)
Metric for the degree of flexibility of the selected ensemble and that of the pool.
Rflex = 100% for a fully flexible system, Rflex = 0% for a fully rigid system.
Metric for evaluation of the variance of the distributions of the selected ensemble and that of the pool, defined as the ratio of the standard deviations of the selected ensemble and that of the pool.
Rsigma approaches 1.0 for a fully flexible system and Rsigma < 1.0 for systems with significant flexibility.
For example, the following output from EOM/GAJOE facilitates assessment of the flexibility of the system:
Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62
Rflex of the selected ensemble is ~67%, compared to ~91% for the pool, suggesting that this system is significantly less flexible than the pool. Rsigma is much less than 1.0, supporting the hypothesis that the system is significantly flexible.
N.B. If Rflex of the ensemble is significantly smaller than that of the pool, but Rsigma > 1.0, this may indicate a problem with the experimental data and further investigation is required.
The following examples demonstrate command line usage of EOM, using the example files provided in the EOM documentation directory of the ATSAS installation.
These examples cover generation of a random pool for completely disordered protein sequences (where no high-resolution PDB input is required), and also the generation of multichain flexible models.
Example 1, sequence only (no PDB file input):
$ eom sequence.seq -q=15000 -c=n
This command generates a pool of scattering intensities computed from 15000 native chain models. No individual PDB files are created.
$ eom sequence.seq -w=. -i
This command generates a pool of scattering intensities computed from the default number (10000) of random chain/coil models.
In addition, the -w flag allows that all generated models will be saved in PDB format in the specified directory ("." here indicating the current directory).
Inclusion of the -i flag means that the theoretical intensities of the pool models will not be calculated.
$ eom sequence.seq -q=15000 -data=data.dat -t=50
This command generates a pool of 15000 random coil models, calculates the scattering intensities and subsequently performs 50 iterations of the genetic algorithm using the experimental data file data.dat.
Example 2, usage of high-resolution models and symmetry:
This command generates a pool of 5000 dimer models using the subunits dom1.pdb and dom2.pdb.
The first PDB file (dom1.pdb) contains an oligomerized/multichain subunit (-o=yes) and will be fixed in the original coordinates (-f=yes). The pool of scattering intensities will be computed and the genetic algorithm run (by default: 1 x 100 cycles).
The same output files are generated by EOM in command line mode and using the interactive configuration.
EOM can also be used for pool generation and ensemble selection separately, as was performed in previous versions using the independent programs, RANCH and GAJOE.
When doing so, please ensure both tools originate from the same ATSAS release.
N.B.: using versions of RANCH and GAJOE from separate ATSAS releases may generate spurious results.
RANCH (RANdom CHain) - tool for the generation of a pool of random models based upon user supplied sequence and structural information;
GAJOE (Genetic Algorithm Judging Optimisation of Ensembles) - tool using a genetic algorithm for the selection of an ensemble of models from a previously generated pool, whose combined theoretical scattering intensity best describes the experimental SAXS data.
RANCH is a program that generates a pool of n independent models based upon sequence and structural information. For multi-domain proteins where high-resolution structures for individual domains are available, such files (e.g. PDB) can be used as rigid-body domains/subunits and/or as an aid to define distance constraints during model generation. For proteins expected to be intrinsically unfolded, no rigid bodies are used and random configurations of the alpha-carbon trace are created based upon the sequence. Crystallographic symmetry can also be applied (P1,P2....Pn) and requires a high-resolution multichain/oligomerized PDB file as input, or through specification of a potential oligomerization interface via a set of user defined distance constraints. To run RANCH type:
Size file. File containing the Rg and Dmax parameters for each model in the pool.
Please note that the suffix of filenames (e.g. RANCH log file, intensities file, PDB files) will correspond to the answer of the question (default is "eom"): Suffix of generated PDB files ................ < eom >
Rancheom.log:
RANCH Version 2.0
Started: ............................................... : Thu Nov 14 14:01:02 2013
iSeed .................................................. : 1040227523
Chain type ............................................. : Random
Sequence file name: .................................... : sequence.seq
Symmetry: .............................................. : p2
Symmetry type: ......................................... : Mix
Percentage of Symmetric: ............................... : 50
Number of residues per chain: .......................... : 387
Number of atoms: ....................................... : 3095
Number of domains: ..................................... : 2
---------------------------------------------------------
Domain number: ......................................... : 1
Path: .................................................. : dom1.pdb
Kept in the original PDB coordinates: .................. : 1
Oligomer: .............................................. : 1
DNA file name: ......................................... : None
---------------------------------------------------------
Domain number: ......................................... : 2
Path: .................................................. : dom2.pdb
Kept in the original PDB coordinates: .................. : 0
Oligomer: .............................................. : 0
DNA file name: ......................................... : None
---------------------------------------------------------
Contact file name: ..................................... : None
Number of structures: .................................. : 10000
Suffix of the pdb files: ............................... : eom
Output Folder Path: .................................... : .
Number of experimental curves: ......................... : 0
Structure number Seed Type(0=aSymm,1=Symm,2=noSymm)
1 1040227523 0
2 955133408 1
3 78042755 0
4 1813283763 1
5 1527150186 0
. .......... .
. .......... .
. .......... .
9998 811760116 0
9999 1500495652 1
10000 1885282777 0
Completed: ............................................. : Thu Nov 14 18:33:22 2013
Where:
Rg = radius of gyration of the MODEL,
Dmax = maximum particle dimension of the MODEL,
CA-CA = end-to-end distance of MODEL (CA of n-term to CA of c-term),
Volume = dry volume computed for the MODEL
GAJOE (Genetic Algorithm Judging Optimisation of Ensembles) is a program that uses a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data.
Typically, this program is executed following the generation of a random pool of models using RANCH, however, GAJOE may be run on a pool of PDBs generated externally by other methods/programs.
GAJOE requires several RANCH format output files as input.
Number of experimental curves to use for ensemble selection. An independent GAJOE run will be conducted for each input data set and the results written to a separate subdirectory (e.g. curve_1, curve_2 etc).
Number of ensembles of theoretical curves (from the models in the pool) to use for ensemble selection by the genetic algorithm. It is NOT recommended to use more than 200.
Number of mutations (curves to be crossed/replaced) in the genetic algorithm. This value should normally be around half of the number of curves per ensemble. This question appears only if a pre-fixed ensemble size is selected.
Number of best ensembles to pass to the next generation in the genetic algorithm. This value should normally be around half of the number of ensembles for each generation. This question appears only if a pre-fixed ensemble size is selected.
Number of times (cycles) to repeat the genetic algorithm. It is recommended to run the genetic algorithm at least 50 times. Please note that final distributions are based on the repetition of the genetic algorithm. Selecting a small number of repetitions may generate nonsensical final distributions.
This command runs GAJOE (by default: 100 times) against the file data.dat using three different pools of intensities. In this example, the constant subtraction is disabled.
N.B.: the parameter -m=3 is required to define the number of pools used
gajoe data.dat -i=juneom.int -f=10 -t=150
This command runs GAJOE 150 times using a single pool and a pre-fixed ensemble size (10 model curves).
Once completed, GAJOE creates a subfolder in the working directory containing all files resulting from the computation. The subfolders are named in the form GAnum where num is the sequential number for each independent run (e.g. GA001, GA002 etc.). In each subfolder the following files/folders are written:
Fit file. File containing the fit for the best ensemble for the curve m. It can be opened directly in SASPLOT/PRIMUS. Detailed information (e.g. the discrepancy, CHI^2) is contained in the header of the file and can be viewed with a text editor.
Rg distribution file. File containing the Rg distribution of the selected ensemble for the curve m and the Rg distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average Rg values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
Size distribution file. File containing the size distribution of maximum model dimensions (Dmax) of the selected ensemble for the curve m, and the size distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average Dmax values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
Additionally, a subfolder named pdbs is created, containing the models from the selected ensemble produced in the cycle with the lowest CHI^2 value for the curve m.
Please note that the PDB files in this folder are NOT the structure of the flexible system but serve as descriptors of the behaviour of the system in solution and are used to generate the Rg/Dmax distributions and flexibility metrics.
IMPORTANT: in order to re-create the selected models with GAJOE all the RANCH files used during pool generation must be in the same directory (i.e.: sequence file, PDB files of domains/subunits, contact file, etc.).
When running GAJOE using an external pool, in addition to the standard output files, a pseudo RANCH log file (RanchXXX.log), an intensity file (junXXX.int) and a Size file (Size_listXXX.txt) are also generated (where XXX is the file name entered by the user for the RANCH pool generation). These can be used as input for additional GAJOE runs without needing the option -p.