Biological
Small Angle Scattering

ATSAS online | Forum | User information | EMBL Hamburg

EOM 3.0: Ensemble Optimization Method

eom

Written by H. Mertens, D. Franke, P. Markov, G. Tria & D.I. Svergun.
Post all your questions about EOM to the ATSAS Forum.

This is the manual for the program suite EOM 3.0 (Ensemble Optimisation Method), which seeks to describe experimental SAXS data using an ensemble representation of atomic models.

The following sections briefly describe the components of the new program EOM 3.0 and detail the steps required to run the suite. File input and output are explained.

NOTE that the latest version of the program uses an updated modular protocol for: model generation (RANCH), intensities calculation (FFMAKER) and choice of several selection options (GAJOE/NNLSJOE). It is no longer possible to run the deprecated EOM executable. If one requires model generation it is recommended to execute RANCH (or obtain models from some other program), then subsequently use these models as input for the selection algorithms GAJOE or NNLSJOE.

If you use results from EOM in your own publication, please cite:

Tria, G., Mertens, H. D. T., Kachala, M. & Svergun, D. I. (2015) Advanced ensemble modelling of flexible macromolecules using X-ray solution scattering. IUCrJ 2, 207-217.

Bernado, P., Mylonas, E., Petoukhov, M.V., Blackledge, M., Svergun, D.I. (2007) Structural Characterization of Flexible Proteins Using Small-Angle X-ray Scattering. J. Am. Chem. Soc. 129(17), 5656-5664.

EOM 3.0

Manual

Manual

Introduction

EOM 3.0 is a suite of programs that facilitate fitting of an averaged theoretical scattering intensity derived from an ensemble of conformations to experimental SAXS data. A pool of n independent models based upon sequence and structural information is first generated (eg. using the updated prgram RANCH). For multi-domain proteins where high-resolution structures for individual subunits/domains are available, these structures and distance/orientation information derived from them can be used as rigid-bodies and/or constraints in EOM model generation. For proteins expected to be intrinsically unfolded no rigid bodies are required as input, and completely random configurations of the alpha-carbon trace are created based upon the sequence alone.

Crystallographic symmetry if required must be defined by the user as an appropriately arranged set of input rigid bodies (CIF or PDB format, with the user applying the fixed flag to maintain the desired orientation of such bodies). RANCH will not apply symmetry operations. Inter-domain/subunit contacts can be imposed to generate homo/hetero oligomers and complexes by providing distance constraints.

Once the pool generation is completed the user can compute the theoreticl scattering intensities of the models in the pool using FFMAKER. FFMAKER will generate input to be passed to the ensemble selection methods: a genetic algorithm (GAJOE) or non-negative linear least-squares algorithm (NNLSJOE) for the selection of an ensemble. The selection algorithm compares the averaged theoretical scattering intensity from n independent ensembles of conformations against the scattering data. The ensemble that best describes the experimental SAXS data is selected.

Metrics for quantitative assessment of system flexibility

The distributions of R_g and D_max generated by EOM (specifically the GAJOE module) can be represented as probability density functions. This allows for a quantitative estimation of the flexibility of the system using the concept of information entropy. For example, an ensemble/pool of structural parameters for a protein showing a broad Gaussian-like distribution (where it is assumed the disordered regions move randomly in solution) can be viewed as a carrier of high uncertainty. Conversely, an ensemble/pool of parameters for a protein with a narrow size distribution (a scenario where the particle exhibits limited flexibility) provides low uncertainty. Useful metrics for the quantitative description of uncertainty (flexibility) provided by EOM 2.0 are:

Rflex = -Hb(S), where Hb(S)=-∑ⁿ_i=1 p(x_i)log_b[p(x_i)], with log_b[p(x_i)] = 0 if p(x_i) = 0 (For further detail refer to the EOM 2.0 paper)

Metric for the degree of flexibility of the selected ensemble and that of the pool. Rflex = 100% for a fully flexible system, Rflex = 0% for a fully rigid system.

Rsigma = standard_deviation(ensemble) / standard_deviation(pool)

Metric for evaluation of the variance of the distributions of the selected ensemble and that of the pool, defined as the ratio of the standard deviations of the selected ensemble and that of the pool. Rsigma approaches 1.0 for a fully flexible system and Rsigma < 1.0 for systems with significant flexibility.

For example, the following output from EOM/GAJOE facilitates assessment of the flexibility of the system:

Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62

Rflex of the selected ensemble is ~67%, compared to ~91% for the pool, suggesting that this system is significantly less flexible than the pool. Rsigma is much less than 1.0, supporting the hypothesis that the system is significantly flexible.

N.B. If Rflex of the ensemble is significantly smaller than that of the pool, but Rsigma > 1.0, this may indicate a problem with the experimental data and further investigation is required.

RANCH (RANdom CHains)

Manual

Introduction

RANCH is a program that generates a pool of n independent models based upon sequence and structural information. For multi-domain proteins where high-resolution structures for individual domains are available, such files (eg. CIF/PDB) can be used as rigid-body domains/subunits. For proteins expected to be intrinsically unfolded, no rigid bodies are used and random configurations of the alpha-carbon trace are created based upon the sequence. Crystallographic symmetry can be contructed through the user fixing input rigid bodies at required positions upon input, symmetry operations (P1,P2....Pn) are not applied by RANCH. Contacts between rigid bodies and unstructured regions of the sequence can be introduced via a set of user defined distance constraints.

Running RANCH

Usage:

$ ranch [OPTIONS] [COORDINATE FILE(S)]

RANCH accepts absolute as well as relative paths to the input SEQUENCE, ASSIGNMENT and atomic coordinate FILE(s). If no path is provided, RANCH reads from stdin. In all cases the cordinate input may be either in PDB or mmCIF format.

The OPTIONS known by RANCH are described in the next section.

RANCH input files

Command-Line Arguments and Options

RANCH requires the following command line arguments:

Argument	Description
`SEQUENCE`	Required. The amino-acid sequence of the protein/peptide(s) in FASTA format, in a single file. If no FILE path is specified, input is read from stdin.
`ASSIGNMENT`	Required. Domain assignments. The assignment of chain ID and residue numbering corresponding to structured and unstructured sequence. Here can be defined sequence regions corresponding to input CIF/PDB files and also user defined stretches of ideal strand and helix.
`FILE`	Optional. The atomic coordinate files of any input rigid bodies in PDB or mmCIF format.

RANCH recognizes the following command-line options.

Short option	Long option	Description
-p	`--prefix=<ARG>`	output filename prefix (default: ranch)
	`--offset=<ARG>`	output file numbering offset (default: 0)
	`--repetitions=<ARG>`	number of output model files (CIF); default: 10000
	`--database=<FILE>`	Quasi-Ramachandran database file (dihedral map). NOTE that three designations in the ASSIGNMENT file can be used that define the dihedral angles used: disordered (for intrinsically disordered and unstructured regions), denatured (for chemically denatured proteins/peptides) and compact (for compact structure).
	`--database-threshold=<ARG>`	probabilities from the Quasi-Ramachandran dihedral map less than this threshold will be set to 0.0 (default: 0.0025)
	`--distance-constraints=<FILE>`	File listing distance constraints between specified sequence positions/amino-acids
	`--seed=<INT>`	Set the seed for the random number generator
`-v`	`--version`	Print version information and exit.
`-h`	`--help`	Print a summary of arguments and options and exit.

Runtime Output

RANCH does not have any runtime output.

RANCH Input Files

RANCH accepts atomic coordinate data in PDB or mmCIF format as input, and a single sequence file in FASTA format. This may be either a relative or absolute file path, or data will be read from stdin.

RANCH Output Files

RANCH writes atomic coordinate data in PDB or mmCIF format on output. By default the coordinate files are written to the current directory, or a directory may be specified as part of the prefix.

Examples

RANCH for generation of a pool of unstructured peptides

Use RANCH to generate a pool of 10000 models based only on amino-acid sequence sequence.fasta and write the models to the directory pool:

$ ranch --repetitions 10000 --prefix pool/pep_ assignment.txt sequence.fasta

Example of the FASTA sequence file format:

> A
DSHAKRHHGYKRKFHEKHHSHRGYADSHAKRHHGYKRKFHEKHHSHRGYA
AAAAAAAAAAARKFHEKHHSHRGYADSHAKRHHGYKRKFHEKHHSHRGYA

In this case a single chain (A) of 100 residue length is defined. Additional chains can be appended to the file following this format.

Example of the assignment file format:

A 1 100 disordered

In this case a single chain (A), generate coordinates for residues 1 to 100 using the Quasi-Ramachandran database for dihedral angles.

RANCH for generation of a pool of peptides with user defined regions of secondary structure

Use RANCH to generate a pool of 10000 models with stretches of ideal secondary structure:

$ ranch --repetitions 10000 --prefix pool/pep_ assignment_ss.txt sequence.fasta

Example of the assignment file format:

#assignment_ss.txt
A 1 10 disordered
A 11 22 helix 
A 22 26 disordered
A 27 37 strand
A 38 100 disordered

In this case a single chain (A), generate coordinates for unstructured residues 1-10, 22-26 & 38-100 using the disordered Quasi-Ramachandran database for dihedral angles, and additionally use dihdral angles from the helical and beta-strand regions of the Quasi-Ramachandran database for residues 11-22 and 27-37, respectively.

RANCH for generation of a pool of protein homo-oligomers with user defined coordinates for several domains

Use RANCH to generate a pool of 10000 multi-chain models with an interface defined by input PDB/CIF orientation:

$ ranch --repetitions 10000 --prefix pool/complex_ assignment.txt sequence.fasta domain1.cif domain2.cif

Example of the assignment file format:

#assignment.txt
A   1 218 structure fixed
A 219 228 disordered
A 229 387 structure
B   1 218 structure fixed

In this case a multi-domain protein (chain A) forms a complex with another protein (chain B). Chain A contains a structured N-terminal region (1-218) and an additional unstructured region (219-228). The interface is defined by the user input coordinate files (domain1.cif and domain2.cif) and these pre-oriented coordinate files are fixed in position. RANCH will allow the unstructured region to undergo conformational sampling while the interface is maintained.

Apply distance constraints to define an interface rather then pre-orientation of domain1.cif and domain2.cif using the following assignment.txt and a distances.txt file:

# assignment.txt
A   1 218 structure
A 219 228 disordered
A 229 387 structure
B   1 218 structure

# distances.txt
A 140 145 B 140 145 15

In the above case a 15 angstrom upper limit distance is defined between residues 140-145 of chain A and residues 140-145 of chain B.

FFMAKER (Form-Factor Maker)

Manual

Introduction

The program FFMAKER is designed to facilitate the creation of form-factor files for input to the programs OLIGOMER and EOM (GAJOE/NNLSJOE). FFMAKER computes the scattering amplitudes from an input set of models (CIF/PDB) and optionally generates an intensities file and corresponding size distribution for EOM analysis (pool mode).

Running FFMAKER on a model pool

Usage:

$ ffmaker [OPTIONS] [COORDINATE FILE(S)]

FFMAKER accepts absolute as well as relative paths to the input atomic coordinate FILE(s) and generates an intensities file (txt,csv) and size file containing the statistics of the pool (Rg, Dmax, Ca-Ca distance, volume). If no path is provided, FFMAKER reads from stdin. In all cases the cordinate input may be either in PDB or mmCIF format.

The OPTIONS known by FFMAKER are described in the next section.

FFMAKER input files

Command-Line Arguments and Options

FFMAKER requires the following command line arguments:

Argument	Description
`FILE`	Required. The atomic coordinate files of any input rigid bodies in PDB or mmCIF format.

FFMAKER recognizes the following command-line options.

Short option	Long option	Description
	`--undat=<ARG>`	Units for .dat files (1=1/A, 2=1/nm)
	`--unout=<ARG>`	Units for .out files (1=1/A, 2=1/nm)
	`--smin=<ARG>`	Minimum value of S
	`--smax=<ARG>`	Maximum value of S
`-o`	`--output=<FILE>`	Output file name (computed intensities)
	`--explicit-hydrogens=<ARG>`	Use explicit hydrogens provided in the atomic structure file (default: no)
	`--lm=<ARG>`	Number of spherical harmonics to use (default: 20)
	`--ns=<INT>`	Number of data points (default: 101)
	`--pool=<FILE>`	Calculate and store pool statistics for EOM in size FILE
`-v`	`--version`	Print version information and exit.
`-h`	`--help`	Print a summary of arguments and options and exit.

Runtime Output

FFMAKER does not have any runtime output.

FFMAKER Input Files

FFMAKER accepts atomic coordinate data in PDB or mmCIF format as input. This may be either a relative or absolute file path, or data will be read from stdin.

FFMAKER Output Files

FFMAKER writes a tabular file of computed scattering intensities from input atomic models. In pool mode FFMAKER also writes a tabular size file of statistics from input atomic models (Rg, Dmax, end-to-end Ca-Ca distance & model volume). By default the intensities and size/statistics files are written to the current directory, or a directory may be specified as part of the prefix and/or pooli options.

Examples

FFMAKER for generation of input to the EOM selection programs (GAJOE/NNLSJOE)

Use FFMAKER to compute the intensities and size/statistic files from a pre-generated (in RANCH or other software) pool of models:

$ ffmaker --output intensities.csv --pool size_statistics.csv pool/*.pdb

Example of the intensities file format:

0.000000e+00, 3.840161e+05, 3.954654e+05,4.285681e+05, ...
5.000000e-03, 3.831351e+05, 3.947898e+05,4.281577e+05, ...
1.000000e-02, 3.805064e+05, 3.927709e+05,4.269289e+05, ...
1.500000e-02, 3.761729e+05, 3.894313e+05,4.248892e+05, ...
...           ...           ...          ...           ...

The first column is the momentum transfer vector, s. Subsequent columns are the corresponding intensities of each model at each s value.

Example of the size/statistics file format:

    16.3    52.1    47.2   0.262E+04       1  m0001.pdb
    13.8    41.4    32.4   0.262E+04       1  m0002.pdb
    10.3    33.4    33.4   0.262E+04       1  m0003.pdb
    12.2    37.1    34.1   0.262E+04       1  m0004.pdb
    ...     ...     ...    ...            ... ...

Columns correspond to Rg, Dmax, end-to-end Ca-Ca distance, model volume, index and model file name.

GAJOE (Genetic Algorithm Judging Optimisation of Ensembles)

Manual

Introduction

GAJOE is a program that uses a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data. GAJOE can be run using tabular files of intensities and size/statistics generated by FFMAKER. Thus the input models describing a pool of conformations may be derived from eg. RANCH or any other program that provides CIF/PDB format.

Running GAJOE

Usage:

$ gajoe [OPTIONS] [DATAFILE]

The OPTIONS known by GAJOE are described in the next section.

GAJOE input files

Command-Line Arguments and Options

GAJOE requires the following command line arguments:

Argument	Description
`DATAFILE`	Required. Scattering data file in 3 column format (s, intensities, errors).

GAJOE recognizes the following command-line options.

Short option	Long option	Description
`-i`	`--intensity=<FILE>`	pool of theoretical intensity curves (dat,csv,txt)
`-s`	`--size=<FILE>`	pool size/statistics list file (dat,csv,txt)
`-g`	`--generation=<N>`	number of generations (default: 1000)
`-e`	`--ensemble=<N>`	number of ensembles per generation (default: 50)
`-z`	`--maximum=<N>`	maximum number of conformers in selected ensemble (default: 50)
`-a`	`--minimum=<N>`	minumum number of conformers in selected ensemble (default: 1)
`-c`	`--no-constant`	Disable constant subtraction (default: enabled)
`-o`	`--no-repeated`	Disable curve repetition in the ensemble (default: disabled)
`-w`	`--work-files=<ARG>`	Enable writing workfiles to track the GA runs (default: disabled)
`-t`	`--times=<INT>`	number of times to repeat the search (default: 100)
`-m`	`--multipool=<N>`	this option is deprecated. Users are advised to use FFMAKER to generate pool files that include intensities of all models
	`--seed`	Set the seed for the random number generator.
`-v`	`--version`	Print version information and exit.
`-h`	`--help`	Print a summary of arguments and options and exit.

Runtime Output

On runtime, the following lines of output will be written to standard output:

*******  ------------------------------------------------------  *******
*******     GAJOE - version 3.0                                  *******
*******     Copyright (c) ATSAS Team                             *******
*******     EMBL, Hamburg Outstation, 2007 - 2022                *******
*******                                                          *******
*******     For doubts/questions please visit SAXIER forum:      *******
*******     http://www.saxier.org/forum/viewforum.php?f=10       *******
*******                                                          *******
*******     In case of bugs please refer to:                     *******
*******     H. Mertens, D.I. Svergun, EMBL BioSAXS group         *******
*******     atsas@embl-hamburg.de                                *******
*******  ------------------------------------------------------  *******
 Experimental data file name ............................ : datafile.dat
 Intensities file name .................................. : intensities.csv
 Number of cycles of the genetic algorithm to run (min. 1): 100
 Random number generator has not been initialised; using current time
 Random seed is:    861619947286592639
 Curve: datafile.dat - Loading values and configuration ...
   Number of theoretical curves :       10000
 Starting the Genetic Algorithm ...
 CYCLE:   1
    Chi^2:  0.985
    No. unique models: 50
    Ensemble size: 50
...
...
...
 CYCLE: 100                                                                                                                                                  
    Chi^2:  0.991
    Ensemble size: 50
 ... finished the Genetic Algorithm!
 Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62
 Re-making selected structures ...
[ 20%] [ 40%] [ 60%] [ 80%] [100%]  ... completed

Examples

GAJOE for ensemble selection from a pool of intensities generated by FFMAKER

Use GAJOE to perform selection from a pool of models:

$ gajoe --intensity intensities.csv --size size_statistics.csv --no-repeated --maximum 50 --minimum 50 datafile.dat

This command runs GAJOE 100 times against the file datafile.dat, using the pool of intensities file intensities.csv and size/statistic file size_statistics.csv. No models are repeated and the ensemble size is maintained at a size of 50 members.

GAJOE output files

Once completed, GAJOE creates a subfolder in the working directory containing all files resulting from the computation. The subfolders are named in the form GAnum where num is the sequential number for each independent run (e.g. GA001, GA002 etc.). In each subfolder the following files/folders are written:

File Name	Description
`GA00n/curve_m/`	Folder containing the result of the genetic algorithm for each experimental data set m (curve_1, curve_2,... curve_m)
`GA00n/curve_m/logFile_00n_m.log`	Log file. File containing the configuration for the genetic algorithm for the experimental data set m.
`GA00n/curve_m/profiles_00n_m.fit`	Fit file. File containing the fit for the best ensemble for the curve m. It can be opened directly in SASPLOT/PRIMUS. Detailed information (e.g. the discrepancy, CHI^2) is contained in the header of the file and can be viewed with a text editor.
`GA00n/curve_m/Rg_distr_00n_m.dat`	R_g distribution file. File containing the R_g distribution of the selected ensemble for the curve m and the R_g distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average R_g values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
`GA00n/curve_m/Size_distr_00n_m.dat`	Size distribution file. File containing the size distribution of maximum model dimensions (D_max) of the selected ensemble for the curve m, and the size distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average D_max values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
`GA00n/curve_m/CaCa_distr_00n_m.dat`	End-to-end CA-CA size distribution file.
`GA00n/curve_m/Volume_distr_00n_m.dat`	Model volume distribution file.
`GA00n/curve_m/pdbs`	Additionally, a subfolder named pdbs is created, containing the models from the selected ensemble produced in the cycle with the lowest CHI^2 value for the curve m. Please note that the PDB files in this folder are NOT the structure of the flexible system but serve as descriptors of the behaviour of the system in solution and are used to generate the R_g/D_max distributions and flexibility metrics.

GAJOE on short peptides

Warning: When running GAJOE on short peptides it is recommended to use fixed size ensemble with 50 curves per ensemble and disallow repetitions. See paper

NNLSJOE (Non-Negative Linear Least-Squares algorithm Judging Optimisation of Ensembles)

Manual

Introduction

NNLSJOE is an alternative selection algorithm program for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data. NNLSJOE can be run using tabular files of intensities and size/statistics generated by FFMAKER. Thus the input models describing a pool of conformations may be derived from eg. RANCH or any other program that provides CIF/PDB format.

Running NNLSJOE

Usage:

$ nnlsjoe [OPTIONS] [DATAFILE]

The OPTIONS known by NNLSJOE are described in the next section.

NNLSJOE input files

Command-Line Arguments and Options

GAJOE requires the following command line arguments:

Argument	Description
`DATAFILE`	Required. Scattering data file in 3 column format (s, intensities, errors).

NNLSJOE recognizes the following command-line options.

Short option	Long option	Description
`-i`	`--intensity=<FILE>`	pool of theoretical intensity curves (dat,csv,txt)
`-s`	`--size=<FILE>`	pool size/statistics list file (dat,csv,txt)
	`--poolsize=<N>`	number of form-factors to use.
	`--fit=<ARG>`	name of output fit file (.dat,.csv,*.txt)
`-v`	`--version`	Print version information and exit.
`-h`	`--help`	Print a summary of arguments and options and exit.

Examples

NNLSJOE for ensemble selection from a pool of intensities generated by FFMAKER

Use NNLSJOE to perform selection from a pool of models:

$ nnlsjoe --intensity intensities.csv --size size_statistics.csv datafile.dat

This command runs NNLJOE against the file datafile.dat, using the pool of intensities file intensities.csv and size/statistic file size_statistics.csv. All models and repeats are considered and the optimum ensemble size determined.

NNLSJOE output files

Once completed, NNLSJOE writes a file describing the fit of the selected intensities to the experimental data, and reports the models selected and statistics to stdout.

File Name	Description
`user defined *.fit filename`	File containing the fit of the selected intensities to the experimental data.

EOM 3.0: Ensemble Optimization Method

Table of Contents