0
EMBL Hamburg Biological
Small Angle Scattering
BioSAXS
  Home  >  ATSAS software  >  EOM  >  Manual

EOM 2.0: Ensemble Optimization Method 2.0

eom

Written by G. Tria & D.I. Svergun.
Post all your questions about EOM to the ATSAS Forum.

© ATSAS Team, 2001-2014

This is the manual for the program suite EOM 2.0 (Ensemble Optimisation Method), which seeks to describe experimental SAXS data using an ensemble representation of atomic models.

The following sections briefly describe the program EOM 2.0 and detail the steps required to run the program. File input and output are explained. If you use results from EOM in your own publication, please cite:

Tria, G., Mertens, H. D. T., Kachala, M. & Svergun, D. I. (2015) Advanced ensemble modelling of flexible macromolecules using X-ray solution scattering. IUCrJ 2, 207-217.

Bernado, P., Mylonas, E., Petoukhov, M.V., Blackledge, M., Svergun, D.I. (2007) Structural Characterization of Flexible Proteins Using Small-Angle X-ray Scattering. J. Am. Chem. Soc. 129(17), 5656-5664.

EOM 2.0

Table of Contents

Manual

Introduction

EOM 2.0 is a program that fits an averaged theoretical scattering intensity derived from an ensemble of conformations to experimental SAXS data. A pool of n independent models based upon sequence and structural information is first generated. For multi-domain proteins where high-resolution structures for individual subunits/domains are available, these structures and distance/orientation information derived from them can be used as rigid-bodies and/or constraints in EOM model generation. For proteins expected to be intrinsically unfolded no rigid bodies are required as input, and completely random configurations of the alpha-carbon trace are created based upon the sequence alone.

Crystallographic symmetry and inter-domain/subunit contacts can be imposed in the following ways:

(1) Using an oligomeric assembly of high-resolution structures (e.g. downloaded from the PDB or using the PISA server) to define the required interface as a single rigid body (e.g. a homodimer with P2 symmetry). N.B. The option to FIX the coordinates should be used in this case.

(2) Specifying a potential oligomerization interface between the rigid-bodies using distance constraints.

Once the pool generation is completed a genetic algorithm for the selection of an ensemble is performed. The genetic algorithm compares the averaged theoretical scattering intensity from n independent ensembles of conformations against the scattering data. The ensemble that best describes the experimental SAXS data is selected.

Running EOM 2.0

Usage:

$ eom

Interactive Configuration (Dialog mode)

Settings have be configured interactively as shown in the table below:

EOM interactive prompt:

Screen TextOption (short)Option (long)DefaultDescription
Core symmetry? Select one of: (1) p1, (2) p2, (3) p3, … , (22) p42, (23) p52, (24) p62, … , (31) p222 -s=<VALUE> --symmetry=<VALUE> p1 Supported symmetries are: p1, p2, …, p19 (nineteen-fold), p22, p32, p42, p52, p62, …, p122, p222. The n-fold axis is typically Z, if there is in addition a two-fold axis it coincides with Y. N.B. If a symmetric core is expected the multichain model defining this core must also be FIXED (i.e. enter "Y" to the question: Fix the subunit in original position/orientation?)
Overall Symmetry: [S]ymmetry, [A]symmetry or [M]ix (default) -y=<VALUE> --overall-symmetry=<VALUE> Mix Overall symmetry of the particles in the pool. [S]ymmetry generates only symmetric multichain particles. [A]symmetry generates multichain particles with a symmetric core structure but leaves the remaining structure asymmetric. [M]ix generates a pool of both symmetric and asymmetric multichain particles. NOTE: If P1 Core Symmetry is selected, this option will not appear.
Percentage of Symmetric structures 50 Percentage of symmetric structures that will be in the pool. This option requires Mix to be selected for the Overall Symmetry. If Mix is not selected for Overall Symmetry, this option will not appear.
Chain type to generate? Select one of: (c) compact-chain, (n) native-like, (r) random-coil -c=<VALUE> --chain-type=<VALUE> r (Random) CA dihedral angle distribution to use for modelling the missing regions. Random uses a CA distribution consistent with chemically denaturated proteins, while native uses a CA distribution consistent with disordered proteins. On average, Random models will be more extended than those defined as native-like. The compact option uses a CA distribution consistent with disordered proteins, but also forces the reconstructed linkers to be more compact.
Sequence file name UNKNOWN Filename containing the amino acid sequence. Standard amino acids should be UPPERCASE for plain text (*.seq) and FASTA (*.fasta) file input. For the non-standard residues PTR, HSD, SEP, TPO, MSE the lowercase letters p, h, s, t, m should be used, respectively. N.B.: the filename must include an extension (e.g., seq.txt, seq.seq, seq.fasta)
Number of domains 0 Number of high-resolution PDB files to use as rigid bodies. Enter "0" if the protein is expected to be intrinsically unfolded.
PDB file name for domain n -x=<FILENAME> --mx=<FILENAME> UNKNOWN Filename of the domain/subunit/pdb file that you wish to define as a rigid body. N.B.: The filename must contain the relevant extension (e.g. domainA.pdb).
Fix the subunit in original position/orientation? -f=<yes|no> --fixed=<yes|no> No Select [Y]es if you wish to maintain the original coordinate position of this rigid body for each model generated, [N]o otherwise.
Does this PDB contain multiple chains (ie. a pre-defined interface)? -o=<yes|no> --oligomer=<yes|no> No Type [Y]es if the input PDB file contains a multichain interface (symmetric or asymmetric) that will be used as the structural core of each model. Type [N]o if the PDB file defines a monomeric subunit/domain. N.B.: The number of chains in the multichain PDB file must also match the symmetry definition from the first question (i.e. for P2 symmetry a two-chain PDB is required as input, the final models will be dimers). Note also that a multichain input PDB will count as a single rigid body for the Number of Domains question.
PDB file for D/RNA bound to the subunit (CR for none) No Filename of the additional PDB file containing the DNA/RNA bound by the subunit (e.g. dna.pdb).
Contacts file name, define distance constraints/contacts between subunits (CR for none) -u=<PATH> --contact-filename=<PATH> UNKNOWN A contact region or interface between subunits/domains can be defined using a contacts file (see below) specifying regions in close proximity and to generate symmetry mates (requires that a non P1 symmetry has been selected). If no contact file or multichain input is provided, a contact interface is generated by default and covers the whole range of the particle.
Total number of models to generate (min.10) -q=<N> --quantity=<N> 10000 Number of independent models in the pool.
Files enumeration starting from? -e=<N> --enum=<N> 1 Enumeration starting point for naming the files in the pool (i.e., 00001eom.pdb).
Save all generated PDB files? -w=<PATH> --save=<PATH> No If Yes each model of the pool will be saved on the disk (not required for regular use). If [N]o then only the PDB files of the selected ensemble will be saved (default).
Directory to save the generated PDB files? . Path to the folder where the models are to be saved.
Suffix of generated pdb files? -f=<SUFFIX> --filesuffix=<SUFFIX> eom Suffix name for the each model/PDB file created (i.e., 00001eom.pdb).
Calculate Intensity? -i --no-intensity Yes If [Y]es a CRYSOL routine will be run for each model generated and the theoretical scattering intensity as well as Rg and Dmax will be computed.
Number of harmonics (min. 10, max. 50) 15 Maximum order of harmonics (min = 10, max =50). Defines the resolution of the calculated curve. Default value should be sufficient in most of the cases. For large particles high orders could improve the results, but more CPU time is required. Fractional values are not allowed.
Maximum s value, sm (min. 0.1, max. 0.5) 0.5 Maximum scattering vector, sm (max = 0.5 Å-1) used for defining the upper limit of the s-range for theoretical curve calculation and/or fitting.
Number of points (min 10, max. 201) 51 Number of points in theoretical curve (min = 10, max = 201). Default value should be sufficient for most cases. A larger number of points is recommended for very large particles (e.g. MDa size) at a higher cost of CPU time. Fractional values are not allowed.
Run the Genetic Algorithm? Yes Type Yes if you wish to run the genetic algorithm on the generated pool of models. Type No for pool generation only.
How many times (min. 1)? -t=<N> --times=<N> 1 Number of times to run the genetic algorithm. Each run will be independent.
Number of experimental curves to fit? -d=<FILE> --data=<FILE> 1 Number of experimental curves you wish to use for ensemble selection. An independent run will be conducted on each experimental data set. The results will be written to a separate sub-directory (e.g. curve_1, curve_2 ...).
Experimental data filename (curve n) UNKNOWN Filename of the experimental scattering data. N.B.: the filename must contain the extension (e.g., data.dat)

Contacts

To define an interface for a subunit/domain interaction, and if a multichain PDB file is not available where such an interface is defined, an optional contact conditions file can be used. The format of the contacts file is as follows:

 dist 8.0
 1 25 28

This condition defines a distance of 8 Å between a symmetry related range of residues (i.e. an interface formed by residues 25-28 of the current body - in this case domain/subunit 1 - and residues 25-28 of a generated symmetry mate). The integer 1 means the first domain/subunit specified in the EOM dialog. N.B. non P1 symmetry must be defined.

 dist 8.0
 1 25 28
 1 40 47

This condition defines a distance of 8 Å between two alternative symmetry related residue ranges. In this case, either the interface defined by residues 25 to 28 or the interface defined by residues 40 to 47 and their respective symmetry mates will be used (randomly selected during pool generation). The actual residue within the range to use as the contact point is random.

 dist 8.0
 1 25 28
 2 88 94

This condition defines two independent sets of distances of 8 Å, the first between domain/subunit 1 and its corresponding symmetry mate (contacts involving residues 25 to 28), and the second set between domain/subunit 2 and its corresponding symmetry mate (contacts involving residues 88 to 94). N.B. This does not define a set of contacts between domain/subunit 1 and domain/subunit 2. Non P1 symmetry must be defined.

Examples

Interactive Configuration

EOM 2.0 input files and Runtime Output

EOM 2.0 needs the following input files:

  • Sequence: e.g. sequence.seq - amino acid sequence;
  • Domains/Subunits: e.g. dom_n.pdb - atomic coordinates of the folded domain(s), if present;
  • Data: e.g. data.dat - experimental data.

In this example the following input is available: protein sequence (sequence.seq); PDB files of two known domains, dom1.pdb as a dimer (i.e. a multichain PDB file defining an interface), dom2.pdb as a monomer; and the experimental SAXS data (data.dat). All the files for this example are included in the documentation directory of the ATSAS installation package. The two domains are joined by a disordered linker of 10 residues. EOM 2.0 will be run using the following set up in order to generate a pool of 10000 independent models with random conformations, and enforcing crystallographic point symmetry p2 for the core of each model (the symmetry axis in this case is defined by the orientation of the chains in the dom1.pdb file). Following pool generation, a genetic algorithm will select the ensemble of scattering profiles computed from the pool that best fits the data.

Typing:

$ eom

on the command line brings up the interactive dialog:

*******  ------------------------------------------------------  *******
*******     Advanced Ensemble Optimization Method - EOM 2.0      *******
*******     Copyright (c) ATSAS Team                             *******
*******     EMBL, Hamburg Outstation, 2007 - 2014                *******
*******                                                          *******
*******     For doubts/questions please visit SAXIER forum:      *******
*******     http://www.saxier.org/forum/viewtopic.php?f=10       *******
*******                                                          *******
*******     In case of bugs please refer to:                     *******
*******     G. Tria, D.I. Svergun, EMBL BioSAXS group            *******
*******     atsas@embl-hamburg.de                                *******
*******  ------------------------------------------------------  *******
Core symmetry? Select one of: (1) p1, (2) p2, (3) p3, (4) p4, (5) p5, 
(6) p6, (7) p7, (8) p8, (9) p9, (10) p10, (11) p11, (12) p12, (13) 
p13, (14) p14, (15) p15, (16) p16, (17) p17, (18) p18, (19) p19, (20) 
p22, (21) p32, (22) p42, (23) p52, (24) p62, (25) p72, (26) p82, (27) 
p92, (28) p102, (29) p112, (30) p122, (31) p222 (default: p1) ........ : p2
Overall symmetry? Select one of: (s) Symmetry, (a) Asymmetry, (m) Mix 
(default: Mix) ....................................................... : s
Chain type to generate? Select one of: (c) compact-chain, (n) 
native-like, (r) random-coil (default: random-coil) .................. : 
Sequence file name? .................................................. : sequence.seq
Number of domains? (default: 0) ...................................... : 2
------ Domain  1 ------
PDB file name for domain ............................................. : dom1.pdb
Does this PDB contain multiple chains? (default: no) ................. : y
Keep the subunit in the original PDB coordinates? (default: no) ...... : y
PDB file for D/RNA bound to the subunit (CR for none) ................ : 
------ Domain  2 ------
PDB file name for domain ............................................. : dom2.pdb
Does this PDB contain multiple chains? (default: no) ................. : n
Keep the subunit in the original PDB coordinates? (default: no) ...... : n
PDB file for D/RNA bound to the subunit (CR for none) ................ : 
Total number of models to generate (default: 10000) .................. : 
File enumeration starting from? (default: 1) ......................... : 
Save the generated PDB files? (default: no) .......................... : n
Suffix of generated pdb files? (default: eom) ........................ : 
Calculate Intensity? (default: yes) .................................. : y
Number of harmonics (min. 10, max. 50)? (default: 15) ................ : 
Maximum s value (min. 0.1, max. 0.5)? (default: 0.500) ............... : 
Number of points (min. 10, max. 201)? (default: 51) .................. : 
Run the Genetic Algorithm? (default: yes) ............................ : y
How many times (min. 1)? (default: 1) ................................ : 
Number of experimental curves to fit? (default: 1) ................... : 
Experimental data file name  1? ...................................... : data.dat
 Loading values and configuration ...
 Number of residues per chain:          387
 ... starts making models ...
[  1%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 10%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 20%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
[ 30%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
[ 40%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 
[ 50%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 60%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 70%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 80%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[ 90%] >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[100%] 
 ... completed making models
 Running GAJOE:            1
 Curve: data.dat - Loading values and configuration ...
   Number of theoretical curves :         10000
 Starting the Genetic Algorithm ...
 CYCLE:   1
    Chi^2:  0.389
    Ensemble size:  5
 ...
 ...
 ... 
 CYCLE: 100
    Chi^2:  0.388
    Ensemble size:  4
 ... finished the Genetic Algorithm!
 Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62
 Re-making selected structures ...
[ 20%] [ 40%] [ 60%] [ 80%] [100%]  ... completed!

The output of the EOM 2.0 run is by default saved in the current working directory and consists of the following files:

  • Rancheom.log: log file. File containing the configuration for the run.
  • juneom.int: intensity file. File containing the theoretical intensity computed for each model in the pool.
  • Size_listeom.txt: size file. File containing the Rg and Dmax parameters for each model in the pool.
  • eomTemp_eom.pdb: temporary model file generated by RANCH (is constantly overwritten during RANCH execution)
  • GA00n/curve_m/: folder containing the result of the genetic algorithm. An incrementing index n is used for each GAJOE run conducted in the current directory. The index m is assigned for each experimental data file used.
  • GA00n/curve_m/logFile_00n_m.log: log file. File containing the configuration for the genetic algorithm.
  • GA00n/curve_m/profiles_00n_m.fit: fit file. File containing the fit for the best ensemble. Can be directly opened by SASPLOT or PRIMUS. If opened with a text editor more information about the selected ensemble can be visualized.
  • GA00n/curve_m/Rg_distr_00n_m.dat: Rg distribution file. File containing the Rg distribution for the selected models compared to the Rg distribution for the whole pool. It can be opened using SASPLOT or PRIMUS and selecting View > AbsY:X to display them on a linear scale. The average Rg values of the pool and the selected structures are contained in the file header.
  • GA00n/curve_m/Size_distr_00n_m.dat: Size distribution file. File containing the Size distribution for the selected models compared to the Size distribution for the whole pool. It can be opened using SASPLOT or PRIMUS and selecting View > AbsY:X for display on a linear scale. The average Dmax values of the pool and the selected structures are contained in the file header.
  • GA00n/curve_m/pdbs: folder containing the PDB files of the models composing the best fitting ensemble. Please note that the PDB files of the selected models are not "the structure of the flexible system" but only models that suggest the behaviour of the system in solution.

Please note that the suffix of filenames (e.g. PDB files) will correspond to the answer of the question (default is "eom"): Suffix of generated PDB files ................ < eom >

Metrics for quantitative assessment of system flexibility

The distributions of Rg and Dmax generated by EOM can be represented as probability density functions. This allows for a quantitative estimation of the flexibility of the system using the concept of information entropy. For example, an ensemble/pool of structural parameters for a protein showing a broad Gaussian-like distribution (where it is assumed the disordered regions move randomly in solution) can be viewed as a carrier of high uncertainty. Conversely, an ensemble/pool of parameters for a protein with a narrow size distribution (a scenario where the particle exhibits limited flexibility) provides low uncertainty. Useful metrics for the quantitative description of uncertainty (flexibility) provided by EOM 2.0 are:

Rflex = -Hb(S), where Hb(S)=-∑ni=1 p(xi)logb[p(xi)], with logb[p(xi)] = 0 if p(xi) = 0 (For further detail refer to the EOM 2.0 paper)

Metric for the degree of flexibility of the selected ensemble and that of the pool. Rflex = 100% for a fully flexible system, Rflex = 0% for a fully rigid system.

Rsigma = standard_deviation(ensemble) / standard_deviation(pool)

Metric for evaluation of the variance of the distributions of the selected ensemble and that of the pool, defined as the ratio of the standard deviations of the selected ensemble and that of the pool. Rsigma approaches 1.0 for a fully flexible system and Rsigma < 1.0 for systems with significant flexibility.

For example, the following output from EOM/GAJOE facilitates assessment of the flexibility of the system:

Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62

Rflex of the selected ensemble is ~67%, compared to ~91% for the pool, suggesting that this system is significantly less flexible than the pool. Rsigma is much less than 1.0, supporting the hypothesis that the system is significantly flexible.

N.B. If Rflex of the ensemble is significantly smaller than that of the pool, but Rsigma > 1.0, this may indicate a problem with the experimental data and further investigation is required.

Command Line Configuration (options)

EOM 2.0 input files and Runtime Output

$ eom SEQUENCEFILE [OPTION [<FILE>|<SUFFIX>|<PATH>|<N>]]

The following examples demonstrate command line usage of EOM, using the example files provided in the EOM documentation directory of the ATSAS installation. These examples cover generation of a random pool for completely disordered protein sequences (where no high-resolution PDB input is required), and also the generation of multichain flexible models.

Example 1, sequence only (no PDB file input):

$ eom sequence.seq -q=15000 -c=n

This command generates a pool of scattering intensities computed from 15000 native chain models. No individual PDB files are created.

$ eom sequence.seq -w=. -i

This command generates a pool of scattering intensities computed from the default number (10000) of random chain/coil models. In addition, the -w flag allows that all generated models will be saved in PDB format in the specified directory ("." here indicating the current directory). Inclusion of the -i flag means that the theoretical intensities of the pool models will not be calculated.

$ eom sequence.seq -q=15000 -data=data.dat -t=50

This command generates a pool of 15000 random coil models, calculates the scattering intensities and subsequently performs 50 iterations of the genetic algorithm using the experimental data file data.dat.

Example 2, usage of high-resolution models and symmetry:

$ eom sequence.seq -q=5000 -s=p2 -x=dom1.pdb -x=dom2.pdb -f=yes -f=no -o=yes -o=no -d=data.dat

This command generates a pool of 5000 dimer models using the subunits dom1.pdb and dom2.pdb. The first PDB file (dom1.pdb) contains an oligomerized/multichain subunit (-o=yes) and will be fixed in the original coordinates (-f=yes). The pool of scattering intensities will be computed and the genetic algorithm run (by default: 1 x 100 cycles).

The same output files are generated by EOM in command line mode and using the interactive configuration.

EOM 2.0 version

Usage:

$ eom -v

or

$ eom --version

EOM 2.0 help

Usage:

$ eom -h

or

$ eom --help

EOM (old style)

Introduction

EOM can also be used for pool generation and ensemble selection separately, as was performed in previous versions using the independent programs, RANCH and GAJOE.

When doing so, please ensure both tools originate from the same ATSAS release.

N.B.: using versions of RANCH and GAJOE from separate ATSAS releases may generate spurious results.

  • RANCH (RANdom CHain) - tool for the generation of a pool of random models based upon user supplied sequence and structural information;
  • GAJOE (Genetic Algorithm Judging Optimisation of Ensembles) - tool using a genetic algorithm for the selection of an ensemble of models from a previously generated pool, whose combined theoretical scattering intensity best describes the experimental SAXS data.

RANCH

Manual

Introduction

RANCH is a program that generates a pool of n independent models based upon sequence and structural information. For multi-domain proteins where high-resolution structures for individual domains are available, such files (e.g. PDB) can be used as rigid-body domains/subunits and/or as an aid to define distance constraints during model generation. For proteins expected to be intrinsically unfolded, no rigid bodies are used and random configurations of the alpha-carbon trace are created based upon the sequence. Crystallographic symmetry can also be applied (P1,P2....Pn) and requires a high-resolution multichain/oligomerized PDB file as input, or through specification of a potential oligomerization interface via a set of user defined distance constraints. To run RANCH type:

Running RANCH

Usage:

$ ranch

RANCH input files

Command line usage of RANCH follows that used for EOM 2.0. To run the genetic algorithm, GAJOE must be executed separately.

Running RANCH alone provides the possibility to generate/remake specific models in PDB format according to the Ranch log file. For example:

Option (short)Option (long)Description
-r --remake Indicates that you want to re-make a specific model
-l --log=<FILE> Ranch log filename (i.e. Rancheom.log)
-m --model=<N> Single model number you want re-make (according to the Ranch log file)
$ ranch -r -l Rancheom.log -m=1245

N.B.: in case of re-making a specific model, all the above parameters are mandatory

RANCH output files

RANCH will generate files to use as input for subsequent GAJOE runs on the generated pool:

File NameDescription
Rancheom.log Ranch Log file. File containing the set-up used for the run.
juneom.int Intensity file. File containing the theoretical intensity for each model in the pool.
Size_listeom.txt Size file. File containing the Rg and Dmax parameters for each model in the pool.

Please note that the suffix of filenames (e.g. RANCH log file, intensities file, PDB files) will correspond to the answer of the question (default is "eom"): Suffix of generated PDB files ................ < eom >

Rancheom.log:

 RANCH  Version 2.0
 Started: ............................................... : Thu Nov 14 14:01:02 2013
 iSeed .................................................. : 1040227523
 Chain type ............................................. : Random
 Sequence file name: .................................... : sequence.seq
 Symmetry: .............................................. : p2
 Symmetry type: ......................................... : Mix
 Percentage of Symmetric: ............................... : 50
 Number of residues per chain: .......................... : 387
 Number of atoms: ....................................... : 3095
 Number of domains: ..................................... : 2
  ---------------------------------------------------------
 Domain number: ......................................... : 1
 Path: .................................................. : dom1.pdb
 Kept in the original PDB coordinates: .................. : 1
 Oligomer: .............................................. : 1
 DNA file name: ......................................... : None
  ---------------------------------------------------------
 Domain number: ......................................... : 2
 Path: .................................................. : dom2.pdb
 Kept in the original PDB coordinates: .................. : 0
 Oligomer: .............................................. : 0
 DNA file name: ......................................... : None
  ---------------------------------------------------------
 Contact file name: ..................................... : None
 Number of structures: .................................. : 10000
 Suffix of the pdb files: ............................... : eom
 Output Folder Path: .................................... : .
 Number of experimental curves: ......................... : 0
 Structure number         Seed       Type(0=aSymm,1=Symm,2=noSymm)
       1               1040227523     0
       2                955133408     1
       3                 78042755     0
       4               1813283763     1
       5               1527150186     0
       .               ..........     .
       .               ..........     .
       .               ..........     .           
    9998                811760116     0
    9999               1500495652     1
   10000               1885282777     0
 Completed: ............................................. : Thu Nov 14 18:33:22 2013       

This is the Ranch log file.

juneom.int:

   S values    51
  0.000000E+00
  0.100000E-01
  0.200000E-01
  0.300000E-01
  0.400000E-01
  0.500000E-01
  0.600000E-01
  0.700000E-01
  0.800000E-01
  0.900000E-01

This is the intensities file containing the theoretical scattering intensities of each pool model generated.

Size_listeom.txt:

     1   36.43   124.71    33.85  143298.47
     2   34.64   101.59    43.90  144916.18
     .     .       .
     .     .       .
     .     .       .
     .     .       .
 10000   36.04   122.38    48.57  143807.71

This is the size list file. The file columns are:

 MODEL   Rg      Dmax      CA-CA  Volume   

Where: Rg = radius of gyration of the MODEL, Dmax = maximum particle dimension of the MODEL, CA-CA = end-to-end distance of MODEL (CA of n-term to CA of c-term), Volume = dry volume computed for the MODEL

GAJOE

Manual

Introduction

GAJOE (Genetic Algorithm Judging Optimisation of Ensembles) is a program that uses a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data. Typically, this program is executed following the generation of a random pool of models using RANCH, however, GAJOE may be run on a pool of PDBs generated externally by other methods/programs. GAJOE requires several RANCH format output files as input.

Running GAJOE

Usage:

$ gajoe

Interactive Configuration (Dialog mode)

The following settings and options are configured interactively as shown in the table below:

GAJOE interactive prompt:

Screen TextOption (short)Option (long)DefaultDescription
Number of experimental curves to fit? 1 Number of experimental curves to use for ensemble selection. An independent GAJOE run will be conducted for each input data set and the results written to a separate subdirectory (e.g. curve_1, curve_2 etc).
Experimental data file name n? (*.dat) UNKNOWN Experimental data filename (*.dat). Please make sure to type the entire filename including the extension *.dat.
Number of pools to use? 1 GAJOE can operate on multiple pools simultaneously. Enter the number of different pools you want to use.
Intensities file name n? (*.int) -i=<FILE> --intensity=<FILE> UNKNOWN Intensity filename for the pool n (*.int). Please make sure to type the entire filename including the extension *.int.
Percentage of models from this pool in the final ensemble? no Type [Y]es if you wish to force the final ensemble to include a percentage of models from this pool.
Percentage of models for this pool n? 100 Percentage of models from this pool to appear in the final ensemble.
Number of generations (min. 10) -g=<VALUE> --generation=<VALUE> 1000 Number of generations used by the genetic algorithm to optimize the ensemble (Maximum 10,000). It is NOT recommended to use less than 500 generations.
Number of ensembles (min. 10) -e=<VALUE> --ensemble=<VALUE> 50 Number of ensembles of theoretical curves (from the models in the pool) to use for ensemble selection by the genetic algorithm. It is NOT recommended to use more than 200.
Ensemble size fixed? -f=<VALUE> --fixed=<VALUE> n Type [Y]es if you wish to fix the size of the final ensemble. In this case you will be asked to configure the genetic algorithm parameters manually.
Number of curves per ensemble (min. 1, max. 50) 20 Number of curves to use for each ensemble. This question appears only if a pre-fixed ensemble size is selected.
Number of mutations per ensemble (min. 1, max. 20) 10 Number of mutations (curves to be crossed/replaced) in the genetic algorithm. This value should normally be around half of the number of curves per ensemble. This question appears only if a pre-fixed ensemble size is selected.
Number of crossings per generation (min. 1, max. 50) 20 Number of best ensembles to pass to the next generation in the genetic algorithm. This value should normally be around half of the number of ensembles for each generation. This question appears only if a pre-fixed ensemble size is selected.
Maximum number of curves per ensemble (min. 1, max. 50) -a=<VALUE> --maximum=<VALUE> 20 Maximum number of curves to compose the ensemble when size not fixed. This question does not appear if a pre-fixed ensemble size is selected.
Minimum number of curves per ensemble (min. 1) This question does not appear if a pre-fixed ensemble size is selected. -m=<VALUE> --minimum=<VALUE> 5 Minimum number of curves to compose the ensemble when size not fixed.
Curve repetition in the ensemble allowed? -o --no-repeated yes Type [N]o if you do not wish to have multiple instances of the same curve in the ensemble.
Constant subtraction allowed? -c --no-constant yes Type [N]o if you do not wish to subtract a constant during the fitting of the experimental data.
Number of times you want the genetic algorithm repeated (min. 1) -t=<VALUE> --times=<VALUE> 100 Number of times (cycles) to repeat the genetic algorithm. It is recommended to run the genetic algorithm at least 50 times. Please note that final distributions are based on the repetition of the genetic algorithm. Selecting a small number of repetitions may generate nonsensical final distributions.
Save workfiles with the entire GA runs? (default: no) -t=<VALUE> --times=<VALUE> no Provides summary files for each iteration of the genetic algorithm, detailing the number of times each model profile is selected.
-v --version Print version information and exit.
-h --help Print a summary of arguments and options and exit.

GAJOE input files

GAJOE requires all the output files from RANCH in order to run

Runtime Output

On runtime, the following lines of output will be written to standard output:

*******  ------------------------------------------------------  *******
*******     GAJOE - version 2.1                                  *******
*******     Copyright (c) ATSAS Team                             *******
*******     EMBL, Hamburg Outstation, 2007 - 2015                *******
*******                                                          *******
*******     For doubts/questions please visit SAXIER forum:      *******
*******     http://www.saxier.org/forum/viewtopic.php?f=10       *******
*******                                                          *******
*******     In case of bugs please refer to:                     *******
*******     G. Tria, D.I. Svergun, EMBL BioSAXS group            *******
*******     atsas@embl-hamburg.de                                *******
*******  ------------------------------------------------------  *******
Number of experimental curves to fit? (default: 1) ................... : 1
Experimental data file name  1? (*.dat) .............................. : data.dat
Number of pools to use? (default: 1) ................................. : 1
Intensities file name 1? (*.int) ..................................... : juneom.int
Number of generations (min. 10) (default: 1000) ...................... : 
Number of ensembles (min. 10) (default: 50) .......................... : 
Ensemble size fixed? (default: no) ................................... : n
Maximum number of curves per ensemble (min. 1, max. 50) (default: 20) : 
Minimum number of curves per ensemble (min. 1) (default: 5) .......... : 
Constant subtraction allowed? (default: yes) ......................... : y
Number of times you want the genetic algorithm repeated (min. 1) 
(default: 100) ....................................................... : 
Save workfiles with the entire GA runs? (default: no) ................ :
 Curve: data.dat - Loading values and configuration ...
   Number of theoretical curves :         10000
 Starting the Genetic Algorithm ...
 CYCLE:   1
    Chi^2:  0.389
    Ensemble size:  5
 ...
 ...
 ...
 CYCLE: 100                                                                                                                                                  
    Chi^2:  0.388
    Ensemble size:  4
 ... finished the Genetic Algorithm!
 Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62
 Re-making selected structures ...
[ 20%] [ 40%] [ 60%] [ 80%] [100%]  ... completed!

Command-Line Arguments and Options

GAJOE currently supports command-line options. According to the Option/Parameters table the program can be run as follows:

gajoe data.dat -i=juneom.int -t=50

This command runs GAJOE 50 times against the file data.dat using the pool of intensities juneom.int.

gajoe data.dat -m=3 -i=junPOOL1.int -i=junPOOL2.int -i=junPOOL3.int -c

This command runs GAJOE (by default: 100 times) against the file data.dat using three different pools of intensities. In this example, the constant subtraction is disabled.

N.B.: the parameter -m=3 is required to define the number of pools used

gajoe data.dat -i=juneom.int -f=10 -t=150

This command runs GAJOE 150 times using a single pool and a pre-fixed ensemble size (10 model curves).

GAJOE output files

Once completed, GAJOE creates a subfolder in the working directory containing all files resulting from the computation. The subfolders are named in the form GAnum where num is the sequential number for each independent run (e.g. GA001, GA002 etc.). In each subfolder the following files/folders are written:

File NameDescription
GA00n/curve_m/ Folder containing the result of the genetic algorithm for each experimental data set m (curve_1, curve_2,... curve_m)
GA00n/curve_m/logFile_00n_m.log Log file. File containing the configuration for the genetic algorithm for the experimental data set m.
GA00n/curve_m/profiles_00n_m.fit Fit file. File containing the fit for the best ensemble for the curve m. It can be opened directly in SASPLOT/PRIMUS. Detailed information (e.g. the discrepancy, CHI^2) is contained in the header of the file and can be viewed with a text editor.
GA00n/curve_m/Rg_distr_00n_m.dat Rg distribution file. File containing the Rg distribution of the selected ensemble for the curve m and the Rg distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average Rg values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
GA00n/curve_m/Size_distr_00n_m.dat Size distribution file. File containing the size distribution of maximum model dimensions (Dmax) of the selected ensemble for the curve m, and the size distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average Dmax values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
GA00n/curve_m/pdbs Additionally, a subfolder named pdbs is created, containing the models from the selected ensemble produced in the cycle with the lowest CHI^2 value for the curve m. Please note that the PDB files in this folder are NOT the structure of the flexible system but serve as descriptors of the behaviour of the system in solution and are used to generate the Rg/Dmax distributions and flexibility metrics.

IMPORTANT: in order to re-create the selected models with GAJOE all the RANCH files used during pool generation must be in the same directory (i.e.: sequence file, PDB files of domains/subunits, contact file, etc.).

GAJOE on an external pool (own pool)

Models created by other methods/programs (i.e. not using RANCH) can also be used as input for GAJOE.

Usage:

$ gajoe -p

or

$ gajoe --pool

File format: there is no restriction on the type of models as long as they are in standard PDB format

Calculation of theoretical intensities is performed using CRYSOL

IMPORTANT: running GAJOE on an externally generated pool of PDB files may take a long time due to the computation of intensities.

Runtime Output on an own pool

On runtime, the following lines of output will be written to standard output:

*******  ------------------------------------------------------  *******
*******     GAJOE - version 2.1 - (r7286M)                  *******
*******     Copyright (c) ATSAS Team                             *******
*******     EMBL, Hamburg Outstation, 2007 - 2015                *******
*******                                                          *******
*******     For doubts/questions please visit SAXIER forum:      *******
*******     http://www.saxier.org/forum/viewforum.php?f=10       *******
*******                                                          *******
*******     In case of bugs please refer to:                     *******
*******     G. Tria, D.I. Svergun, EMBL BioSAXS group            *******
*******     atsas@embl-hamburg.de                                *******
*******  ------------------------------------------------------  *******
 ---------------------------- ATTENTION ------------------------------
 Using a pool of PDBs could take a while due to the CRYSOL computation
 ---------------------------------------------------------------------
Number of experimental curves to fit? (default: 1) ................... : 
Experimental data file name  1? (*.dat) .............................. : data.dat
Directory containing the PDB files? (default: .) ..................... : models
Number of harmonics (min. 10, max. 50)? (default: 15) ................ : 
Maximum s value (min. 0.1, max. 0.5)? (default: 0.500) ............... : 
Number of points (min. 10, max. 201)? (default: 51) .................. : 
Number of generations (min. 10) (default: 1000) ...................... : 1000
Number of ensembles (min. 10) (default: 50) .......................... : 
Ensemble size fixed? (default: no) ................................... : n
Maximum number of curves per ensemble (min. 1, max. 50) (default: 20) : 
Minimum number of curves per ensemble (min. 1) (default: 5) .......... : 
Curve repetition in the ensemble allowed? (default: yes) ............. : 
Constant subtraction allowed? (default: yes) ......................... : 
Number of times you want the genetic algorithm repeated (min. 1) 
(default: 100) ....................................................... : 
 Random seed is:   3178319491719824691
 Curve: data.dat - Loading values and configuration ...
 ... Theoretical scattering computation (CRYSOL time) ...
[  1%] >
[  5%] >
[ 10%] >
[ 15%] >
[ 20%] >
[ 25%] >
[ 30%] >
[ 35%] >
[ 40%] >
[ 45%] >
[ 50%] >
[ 55%] >
[ 60%] >
[ 65%] >
[ 70%] >
[ 75%] >
[ 80%] >
[ 85%] >
[ 90%] >
[ 95%] >
[100%] 
 ... CRYSOL time completed!
   Number of theoretical curves :       10000
 Starting the Genetic Algorithm ...
 CYCLE:   1
    Chi^2:  0.389
    Ensemble size:  5
 ...
 ...
 ...
 CYCLE: 100
    Chi^2:  0.388
    Ensemble size:  4
 ... finished the Genetic Algorithm!
 Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62
 Copying the selected models ...
[  5%] [ 10%] [ 15%] [ 21%] [ 26%] [ 31%] [ 36%] [ 42%] [ 47%] [ 52%] [ 57%] [ 63%] [ 68%] [ 73%] [ 78%] [ 84%] [ 89%] [ 94%] [100%]  ... done!

Output files

When running GAJOE using an external pool, in addition to the standard output files, a pseudo RANCH log file (RanchXXX.log), an intensity file (junXXX.int) and a Size file (Size_listXXX.txt) are also generated (where XXX is the file name entered by the user for the RANCH pool generation). These can be used as input for additional GAJOE runs without needing the option -p.


  Last modified: October 6, 2016

© BioSAXS group 2016