This is the manual for the program suite EOM 3.0 (Ensemble Optimisation Method), which seeks to describe experimental SAXS data using an ensemble representation of atomic models.
The following sections briefly describe the components of the new program EOM 3.0 and detail the steps required to run the suite. File input and output are explained.
NOTE that the latest version of the program uses an updated modular protocol for: model generation (RANCH), intensities calculation (FFMAKER) and choice of several selection options (GAJOE/NNLSJOE). It is no longer possible to run the deprecated EOM executable. If one requires model generation it is recommended to execute RANCH (or obtain models from some other program), then subsequently use these models as input for the selection algorithms GAJOE or NNLSJOE.
If you use results from EOM in your own publication, please cite:
EOM 3.0 is a suite of programs that facilitate fitting of an averaged theoretical scattering intensity derived from an ensemble of conformations to experimental SAXS data.
A pool of n independent models based upon sequence and structural information is first generated (eg. using the updated prgram RANCH).
For multi-domain proteins where high-resolution structures for individual subunits/domains are available, these structures and distance/orientation information derived from them can be used as rigid-bodies and/or constraints in EOM model generation.
For proteins expected to be intrinsically unfolded no rigid bodies are required as input, and completely random configurations of the alpha-carbon trace are created based upon the sequence alone.
Crystallographic symmetry if required must be defined by the user as an appropriately arranged set of input rigid bodies (CIF or PDB format, with the user applying the fixed flag to maintain the desired orientation of such bodies). RANCH will not apply symmetry operations. Inter-domain/subunit contacts can be imposed to generate homo/hetero oligomers and complexes by providing distance constraints.
Once the pool generation is completed the user can compute the theoreticl scattering intensities of the models in the pool using FFMAKER. FFMAKER will generate input to be passed to the ensemble selection methods: a genetic algorithm (GAJOE) or non-negative linear least-squares algorithm (NNLSJOE) for the selection of an ensemble. The selection algorithm compares the averaged theoretical scattering intensity from n independent ensembles of conformations against the scattering data.
The ensemble that best describes the experimental SAXS data is selected.
The distributions of Rg and Dmax generated by EOM (specifically the GAJOE module) can be represented as probability density functions. This allows for a quantitative estimation of the flexibility of the system using the concept of information entropy. For example, an ensemble/pool of structural parameters for a protein showing a broad Gaussian-like distribution (where it is assumed the disordered regions move randomly in solution) can be viewed as a carrier of high uncertainty. Conversely, an ensemble/pool of parameters for a protein with a narrow size distribution (a scenario where the particle exhibits limited flexibility) provides low uncertainty. Useful metrics for the quantitative description of uncertainty (flexibility) provided by EOM 2.0 are:
Rflex = -Hb(S), where Hb(S)=-∑ni=1 p(xi)logb[p(xi)],
with logb[p(xi)] = 0 if p(xi) = 0
(For further detail refer to the EOM 2.0 paper)
Metric for the degree of flexibility of the selected ensemble and that of the pool.
Rflex = 100% for a fully flexible system, Rflex = 0% for a fully rigid system.
Metric for evaluation of the variance of the distributions of the selected ensemble and that of the pool, defined as the ratio of the standard deviations of the selected ensemble and that of the pool.
Rsigma approaches 1.0 for a fully flexible system and Rsigma < 1.0 for systems with significant flexibility.
For example, the following output from EOM/GAJOE facilitates assessment of the flexibility of the system:
Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62
Rflex of the selected ensemble is ~67%, compared to ~91% for the pool, suggesting that this system is significantly less flexible than the pool. Rsigma is much less than 1.0, supporting the hypothesis that the system is significantly flexible.
N.B. If Rflex of the ensemble is significantly smaller than that of the pool, but Rsigma > 1.0, this may indicate a problem with the experimental data and further investigation is required.
RANCH is a program that generates a pool of n independent models based upon sequence and structural information. For multi-domain proteins where high-resolution structures for individual domains are available, such files (eg. CIF/PDB) can be used as rigid-body domains/subunits. For proteins expected to be intrinsically unfolded, no rigid bodies are used and random configurations of the alpha-carbon trace are created based upon the sequence. Crystallographic symmetry can be contructed through the user fixing input rigid bodies at required positions upon input, symmetry operations (P1,P2....Pn) are not applied by RANCH. Contacts between rigid bodies and unstructured regions of the sequence can be introduced via a set of user defined distance constraints.
RANCH accepts absolute as well as relative paths to the input SEQUENCE, ASSIGNMENT and atomic coordinate FILE(s).
If no path is provided, RANCH reads from stdin. In all cases the cordinate input may be either in PDB or mmCIF format.
The OPTIONS known by RANCH are described in the next section.
Required. Domain assignments. The assignment of chain ID and residue numbering corresponding to structured and unstructured sequence. Here can be defined sequence regions corresponding to input CIF/PDB files and also user defined stretches of ideal strand and helix.
number of output model files (CIF); default: 10000
--database=<FILE>
Quasi-Ramachandran database file (dihedral map).
NOTE that three designations in the ASSIGNMENT file can be used that define the dihedral angles used: disordered (for intrinsically disordered and unstructured regions), denatured (for chemically denatured proteins/peptides) and compact (for compact structure).
--database-threshold=<ARG>
probabilities from the Quasi-Ramachandran dihedral map less than this threshold will be set to 0.0 (default: 0.0025)
--distance-constraints=<FILE>
File listing distance constraints between specified sequence positions/amino-acids
RANCH accepts atomic coordinate data in PDB or mmCIF format as input, and a single sequence file in FASTA format. This may
be either a relative or absolute file path, or data will be read from stdin.
RANCH writes atomic coordinate data in PDB or mmCIF format on output. By
default the coordinate files are written to the current directory, or a directory may be specified as part of the prefix.
#assignment_ss.txt
A 1 10 disordered
A 11 22 helix
A 22 26 disordered
A 27 37 strand
A 38 100 disordered
In this case a single chain (A), generate coordinates for unstructured residues 1-10, 22-26 & 38-100 using the disordered Quasi-Ramachandran database for dihedral angles, and additionally use dihdral angles from the helical and beta-strand regions of the Quasi-Ramachandran database for residues 11-22 and 27-37, respectively.
#assignment.txt
A 1 218 structure fixed
A 219 228 disordered
A 229 387 structure
B 1 218 structure fixed
In this case a multi-domain protein (chain A) forms a complex with another protein (chain B). Chain A contains a structured N-terminal region (1-218) and an additional unstructured region (219-228). The interface is defined by the user input coordinate files (domain1.cif and domain2.cif) and these pre-oriented coordinate files are fixed in position. RANCH will allow the unstructured region to undergo conformational sampling while the interface is maintained.
Apply distance constraints to define an interface rather then pre-orientation of domain1.cif and domain2.cif using the following assignment.txt and a distances.txt file:
# assignment.txt
A 1 218 structure
A 219 228 disordered
A 229 387 structure
B 1 218 structure
# distances.txt
A 140 145 B 140 145 15
In the above case a 15 angstrom upper limit distance is defined between residues 140-145 of chain A and residues 140-145 of chain B.
The program FFMAKER is designed to facilitate the creation of form-factor
files for input to the programs OLIGOMER and EOM (GAJOE/NNLSJOE). FFMAKER computes the scattering amplitudes from an input set of models (CIF/PDB) and optionally generates an intensities file and corresponding size distribution for EOM analysis (pool mode).
FFMAKER accepts absolute as well as relative paths to the input atomic coordinate FILE(s) and generates an intensities file (txt,csv) and size file containing the statistics of the pool (Rg, Dmax, Ca-Ca distance, volume).
If no path is provided, FFMAKER reads from stdin. In all cases the cordinate input may be either in PDB or mmCIF format.
The OPTIONS known by FFMAKER are described in the next section.
FFMAKER accepts atomic coordinate data in PDB or mmCIF format as input. This may
be either a relative or absolute file path, or data will be read from stdin.
FFMAKER writes a tabular file of computed scattering intensities from input atomic models. In pool mode FFMAKER also writes a tabular size file of statistics from input atomic models (Rg, Dmax, end-to-end Ca-Ca distance & model volume). By
default the intensities and size/statistics files are written to the current directory, or a directory may be specified as part of the prefix and/or pooli options.
GAJOE is a program that uses a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data.
GAJOE can be run using tabular files of intensities and size/statistics generated by FFMAKER. Thus the input models describing a pool of conformations may be derived from eg. RANCH or any other program that provides CIF/PDB format.
This command runs GAJOE 100 times against the file datafile.dat, using the pool of intensities file intensities.csv and size/statistic file size_statistics.csv.
No models are repeated and the ensemble size is maintained at a size of 50 members.
Once completed, GAJOE creates a subfolder in the working directory containing all files resulting from the computation. The subfolders are named in the form GAnum where num is the sequential number for each independent run (e.g. GA001, GA002 etc.). In each subfolder the following files/folders are written:
Fit file. File containing the fit for the best ensemble for the curve m. It can be opened directly in SASPLOT/PRIMUS. Detailed information (e.g. the discrepancy, CHI^2) is contained in the header of the file and can be viewed with a text editor.
Rg distribution file. File containing the Rg distribution of the selected ensemble for the curve m and the Rg distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average Rg values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
Size distribution file. File containing the size distribution of maximum model dimensions (Dmax) of the selected ensemble for the curve m, and the size distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average Dmax values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
Additionally, a subfolder named pdbs is created, containing the models from the selected ensemble produced in the cycle with the lowest CHI^2 value for the curve m.
Please note that the PDB files in this folder are NOT the structure of the flexible system but serve as descriptors of the behaviour of the system in solution and are used to generate the Rg/Dmax distributions and flexibility metrics.
Warning: When running GAJOE on short peptides it is recommended to use fixed size ensemble
with 50 curves per ensemble and disallow repetitions. See paper
NNLSJOE is an alternative selection algorithm program for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data.
NNLSJOE can be run using tabular files of intensities and size/statistics generated by FFMAKER. Thus the input models describing a pool of conformations may be derived from eg. RANCH or any other program that provides CIF/PDB format.
This command runs NNLJOE against the file datafile.dat, using the pool of intensities file intensities.csv and size/statistic file size_statistics.csv.
All models and repeats are considered and the optimum ensemble size determined.
Once completed, NNLSJOE writes a file describing the fit of the selected intensities to the experimental data, and reports the models selected and statistics to stdout.