0
EMBL Hamburg Biological
Small Angle Scattering
BioSAXS
SASBDB

EOM: Ensemble Optimization Method

eom

Written by P. Bernado, E. Mylonas, M.V. Petoukhov & D.I. Svergun.
Post all your questions about EOM to the ATSAS Forum.

© ATSAS Team, 2001-2010

This is the manual for the program suite EOM (Ensemble Optimisation Method), which seeks to describe experimental SAXS data using an ensemble representation of atomic models. The program is separated into two components:

  • RANCH (RANdom CHain) - for the generation of a pool of random models based upon user supplied sequence and structural information;
  • GAJOE (Genetic Algorithm Judging Optimisation of Ensembles) - the genetic algorithm used for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data.

The following sections briefly describe the RANCH and GAJOE programs that form EOM, how to run them and what the input and output files are. If you use results from EOM in your own publication, please cite:

Bernado, P., Mylonas, E., Petoukhov, M.V., Blackledge, M., Svergun, D.I. (2007) Structural Characterization of Flexible Proteins Using Small-Angle X-ray Scattering. J. Am. Chem. Soc. 129(17), 5656-5664.

RANCH

Table of Contents

Manual

Introduction

RANCH is a program that generates a pool of models based upon sequence and structural information. For multi-domain proteins where high-resolution structures are available for individual domains, a pool of models with random linker configurations is created. For proteins expected to be completely random or unfolded no input rigid bodies are required and completely random configurations of the alpha-carbon trace are created based upon the sequence. An empirical database of backbone torsion angles is used to generate realistic conformations of linkers (represented as dummy residues).

Running RANCH

Usage:

$> ranch [[INPUTFILE] [OPTIONS]]

Currently only a single command line option for RANCH is available (see below), thus it is recommended to run RANCH using the interactive dialog. However, an input file (Input Files ) containing a list of answers to the command prompt from a previous run can be used. If no INPUTFILE or OPTIONS are given, the interactive dialog mode is invoked. OPTIONS for RANCH are described in the following section.

Command-Line Arguments and Options

RANCH recognizes the following command-line options.

OptionDescription
/l To enable command line mode.
/t The models generated for the random pool are written to the current directory. e.g. ranch /t . It should be AFTER the /l option
/seq <File.seq> The sequence file name.
/pdb <NUMBER> <File.pdb> Number of folded domains and their corresponding names. You can avoid this option if the protein has no folded domains, it will automatically assume 0.
/mod n Use for native-like models. Assumes random coil if the option is not present.
/ext <EXT> The suffix of the pdb files.
/dat <File.dat> The experimental data file name
/nst <NUMBER> Number of structures to be generated. Assumes 10000 if the option is not present.
/lm <NUMBER> Maximum order of harmonics. Assumes 15 if the option is not present.

Interactive Configuration (Dialog mode)

Settings available through command-line arguments and options may also be configured interactively as shown in the table below:

RANCH interactive prompt:

Screen TextDefaultDescription
Type of models to create? [R]andom coil (default) or [N]ative-like R The CA-angle distribution to be used. Random models use a CA-angle distribution consistent with a random coil, while native-like models use a CA-distribution consistent with folded proteins. Random models will on average be more extended than native-like. Change the default only when the protein's linkers are supposed to be relatively compact (or if the random coil models seem to be too big to fit the data).
Input sequence file name UNKNOWN Input the file name of the sequence text file (e.g. protein.seq). Currently the sequence file must be single-letter format and upper case. Headers must be removed from the file.
Number of domains? 0 Number of folded domains in the protein, leave 0 if the protein is considered to have no folded domains.
Total number of structures to generate? 10000 Number of random models/structures to generate for the pool.
Order of harmonics 15 Order of spherical harmonics for the calculation of scattering curves from the generated models. Using higher harmonics will increase the accuracy of calculated scattering patterns especially at higher angles, however, it makes the program significantly slower. Maximum 50.
Extension of new pdb files UNKNOWN This is the extension for the pdb files that will be created by the program (written to the current directory if the keep option is used).
Experimental data filename UNKNOWN The experimental scattering data file. This file is used for the naming of RANCH output files. It is not used directly for the pool generation.

Runtime Output

On runtime, the following lines of output will be written to the ranch.log file:

 RANCH  Version 1.3
 Start:  Fri Oct  8 20:16:21 2010
 Chain type:                  Random
 Sequence file name           protein.seq
 Number of residues           476
 Number of domains             2
 Domain number           1 :  dom1.pdb
 Domain number           2 :  dom2.pdb
 Extension of pdb files       model.pdb
 Experimental data filename  data.dat
 Structure number         Seed
       1                201617000
       2               1260097231
       .                    .
       .                    .
       .                    .
       .                    .
       .                    .
     10000             8795950758

The above output is produced for a two domain protein (with high resolution structures of domain 1 and 2, and a flexible linker between these domains of unknown structure) with the sequence file protein.seq and data file data.dat.

RANCH input files

An input file for ranch can be created by hand and has the following format:

r
protein.seq
2
dom1.pdb
dom2.pdb
10000
15
0.5
51
model
data.dat

This text file is a list of answers to the command prompt in the order shown above for the Dialog mode.

To use such an answers file ranch is executed on the command line as follows: ranch < INPUTFILE

RANCH output files

RANCH creates the following three output files:
File NameDescription
junX00.int This file contains the intensities of the random models created by RANCH and calculated in the same way as in CRYSOL. X is the extension of the new pdb files. e.g. junMODEL00.int, where MODEL is the pdb file extension.
RanchX.log This is a log file for the RANCH run. It is also required as input for GAJOE. eg. RanchMODEL.log, where MODEL is the pdb file extension.
Size_listZ.txt Is the size list file of the pool, where Z is the experimental data file name as entered in the interactive dialog. eg. Size_listDATA.txt, where DATA.dat is the SAXS data file.

CAUTION: Changing the output files' names could cause the EOM programs to function improperly since they rely on naming assumptions. This is especially important for the fixed part of the file names (eg. jun, Size_list etc.) Changing manually the varying part of the names (in this example X and Z) is allowed. Size_listZ.txt requires alteration when the ensemble selection is to be tested against several experimental data files. In this case you should create copies of the same Size_list file and name it accordingly, i.e. if you have the experimental files Z1.dat, Z2.dat and Z3.dat the size list files should be named Size_listZ1.txt, Size_listZ2.txt and Size_listZ3.txt.

Examples

Interactive Configuration

In this example the user has the protein sequence (prot.seq), the pdb files of two known domains (dom1.pdb, dom2.pdb) and experimental SAXS data (data.dat). The two domains are joined by a linker and the relative orientation of each domain relative to the other is unknown. RANCH is run using this information to generate a pool of 10000 models with random conformations. i.e., typing

$> ranch13

on the command line brings up the interactive dialog::

 ***  ------------------------------------------------  ***
 ***    RANCH  Wintel/UNIX/Linux version 1.3            ***
 ***   Please reference: P. Bernado, E. Mylonas,        ***
 ***   M.V. Petoukhov, M. Blackledge, D.I. Svergun      ***
 ***   Copyright (c) ATSAS Team                         ***
 ***   EMBL, Hamburg Outstation, 2007                   ***
 ***  ------------------------------------------------  ***
    Type of models to create
 [R]andom coil (default) or [N]ative-like <       Random >: r
 Input sequence file name ............... <         .seq >: prot.seq
 Number of residues read ................................ : 476
 Number of domains ...................... <            0 >: 2
 Domain number           1
 Input pdb file name .................... <         .pdb >: dom1.pdb
 Domain number           2
 Input pdb file name .................... <         .pdb >: dom2.pdb
 Total number of structures to generate . <        10000 >: 10000
 Order of harmonics (max. 50) ........... <           15 >: 15
 Maximum s value ........................ <       0.5000 >: 0.5
 Number of points (max. 101) ............ <           51 >: 51
 Extension of new pdb files ............. <         .pdb >: model
 Experimental data filename ............. <         .dat >: data.dat

The results of this RANCH run are written to the following files:

ranchmodel.log:

 RANCH  Version 1.3
 Start:  Fri Oct  8 20:16:21 2010
 Chain type:                  Random
 Sequence file name           prot.seq
 Number of residues           476
 Number of domains             2
 Domain number           1 :  dom1.pdb
 Domain number           2 :  dom2.pdb
 Extension of pdb files       model.pdb
 Experimental data filename  data.dat
 Structure number         Seed
       1                201617000
       2               1260097231
       .                    .
       .                    .

junmodel00.int:

   S values    51
  0.000000E+00
  0.100000E-01
  0.200000E-01
  0.300000E-01
  0.400000E-01
  0.500000E-01
  0.600000E-01
  0.700000E-01
  0.800000E-01
  0.900000E-01

This is the intensities file.

Size_listdata.txt:

     1   29.75  100.66
     2   28.98   89.38
     .     .       .
     .     .       .
     .     .       .
     .     .       .
 10000   29.88  100.22

This is the size list file.

Command Line Execution

Ranch13.exe /l /seq sequence.seq /pdb 3 one.pdb two.pdb three.pdb /mod n /ext ss.pdb /dat expdata.dat /nst 1000 /lm 30

GAJOE

Table of Contents

Manual

Introduction

GAJOE (Genetic Algorithm Judging Optimisation of Ensembles) is a program that uses a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data. Typically, this program is executed following the generation of a random pool of models using RANCH, however, it is possible to adjust the necessary input files such that a random pool generated by some other method can be used. GAJOE requires several RANCH format output files as input.

Running gajoe

Usage:

$> gajoe [INPUTFILE]

Here, INPUTFILE is an answers file that can be optionally created and used as described for RANCH. GAJOE requires that an intensities file, size-list file and RANCH log file exist in the working directory. If an INPUTFILE is not specified then dialog mode is invoked. OPTIONS known by gajoe are described in the next section.

Command-Line Arguments and Options

GAJOE currently does not support command-line options.

Interactive Configuration (Dialog mode)

The following settings and options are configured interactively as shown in the table below:

GAJOE interactive prompt:

Screen TextDefaultDescription

Program mode?

0 = Ensemble Optimization (default)

1 = Size/Rg Analysis

2 = Create Size_list file

Enter your option (0-2)

0

Choose the program mode.

Size/Rg Analysis is used to repeat the Dmax/Rg analysis using the output of a previous GAJOE run. This can be useful in 2 cases: 1) to create new histogram files when there was no appropriate Size_list file when GAJOE finished (and thus no histograms were created) and 2) to change the histogram intervals (in this case the program will overwrite the preexisting files.

Create Size_list file is used to generate input for GAJOE using a pool of structures created by the user and requires CRYSOL .log files.

Input the experimental data file name. .dat The experimental data filename.
Angular units in the input file UNKNOWN Units of the scattering data file. (1) 4*pi*sin(theta)/lambda [angstrom-1], (2) 4*pi*sin(theta)/lambda [nm-1].
Input the name of the intensities file. .int The junX00.int file name.
Do you want to change the settings? (y/n) n If you choose the default No the program will not ask any more questions and will start running the algorithm using default values for all the parameters (which should be OK for most cases). If you press y/yes then you enter Advanced Mode and you can tweak the values of each parameter.
Enter number of generations 1000 Select the number of generations the genetic algorithm can use to optimize the ensemble. Maximum 10,000. It is NOT suggested to use less than 1000 generations.
Enter number of ensembles 50 Number of ensembles of theoretical curves (calculated from the structures created by RANCH) to compare with experimental data. Avoid numbers bigger than 200.
Enter number of curves per ensemble 20 Number of curves that compose one ensemble. Again probably it would make no sense to have numbers more than 200. Setting this number very low can help determine the minimum number of curves/structures that are needed to fit the data.
Enter maximum number of mutations per ensemble 10 Should be normally around half of the number of curves per ensemble. Avoid using a number bigger than the number of curves per ensemble.
Enter number of crossings per generation 20 This number should have values from around half the number of ensembles up to the number of ensembles.
Enter number of times you want the process repeated. 50 This asks how many times (cycles) the genetic algorithm process is going to be repeated. If it is just for a test run trying different parameter values it should probably changed to 1 but for Size/Rg analysis it is advisable to keep it to 50 or maybe even 100.
Do you want to allow repetitions? (y/n) y If it is yes then one ensemble can contain more than once the same structure-curve. Should be fine as yes.
Allow constant subtraction? (y/n) y If yes then it is allowed to adjust the fitting of the experimental data to the theoretical ensemble using a constant. It probably needs to be kept enabled, especially in the case of completely unfolded systems. Alternatively program runs with constant subtraction disabled can be performed and compared with program runs with it enabled.
Create convergence file? (y/n) n The creation of the convergence file should only be allowed when checking at which generation the CHI2 is sufficiently small (it is also advisable to use it only in one-cycle runs to avoid creating a huge convergence file). If, for example, the program is run for 10,000 generations and the CHI2 improves slightly only 3 times after eg. 1000 generations probably is a waste of time to use 10,000 generations.
Create analysis files? (y/n) y If yes it will create Size/Rg distribution histograms when the program is finished. If it does not find the appropriate Size_list file it will create nothing and the program will not continue.
Enter the Size of the (max. size) histogram intervals 2.00 The size of the Dmax (maximum size) histogram intervals can be selected here. The Rg intervals are going to be 1/4 of the Dmax intervals. This is asked after the run is finished.

Runtime Output

On runtime, the following lines of output will be written to standard output:

Reading the theoretical curves
                       Number of theoretical curves : 10000
 Reading finished
 CYCLE:           1
 Initial genes created
 Starting Genetic Algorithm
 CYCLE:           2
 Initial genes created
 Starting Genetic Algorithm
 CYCLE:           3
 Initial genes created
 Starting Genetic Algorithm
 CYCLE:           4
 Initial genes created
          ...
          ...
          ...
 Starting Genetic Algorithm
 CYCLE:       10000
 Initial genes created

GAJOE input files

An input file for gajoe can be created by hand and has the following format (no advanced settings used here):

0
data.dat
2
junmodel00.int
n

This text file is a list of answers to the command prompt in the order shown above for the Dialog mode.

To use such an answers file gajoe is executed on the command line as follows: gajoe < INPUTFILE

GAJOE output files

GAJOE will create a subfolder with all the files that it creates. All subfolders have a name in the form GAnum where num is the sequential number for every time you run the program (e.g. GA001, GA002 etc.). In each subfolder the following files can be found:
File NameDescription
GAnum.log The log file showing the files used, the parameter values etc. A replicate of this file can be also found in the original folder in order to easily see what is in each folder e.g. GA001.log, where 001 is the run number.
best_curvenum.txt file showing the number of times each curve-structure was selected in the final generation of all cycles (curves-structures that were never selected are not shown).
selected_ensemnum.txt file showing the best 10 ensembles at the final generation of each cycle.
profilesZnum.fit the fit of the selected ensemble of the best cycle to the experimental data Z.dat. It can be opened by Sasplot. If opened with a text editor the CHI of all cycles can be seen.
Size_distrZnum.dat and Rg_distrZnum.dat the Dmax/Rg distribution histograms for the experimental data Z.dat compared to the distribution of the pool of all curves-structures created by RANCH. They can be opened using Sasplot and selecting View > AbsY:X to watch them in linear scale (logarithmic Y scale is the default). By opening them with a text editor the average Dmax/Rg values of the pool and the selected structures can also be seen.

Additionally, a subfolder named pdbs is created, which contains the structures that were selected in the cycle with the lowest CHI.

Using models created from an external program (own pool):

Models created by other methods/programs (i.e. not using RanCh) can also be used.

File format: there is no restriction on the type of models as long as they are in standard PDB format. All the files should have the same common extension:

e.g. X.pdb
and be numbered either in the form 1X.pdb, 2X.pdb, ..., 10X.pdb, ..., 100X.pdb, ..., 1000X.pdb, ..., 10000X.pdb, ...
(i.e. the length of the file name varies with the size of the number)
or 
00001X.pdb, 00002X.pdb, ..., 00010X.pdb, ..., 00100X.pdb, ..., 01000X.pdb, ..., 10000X.pdb, ... 
(i.e. 5 digit number in all cases).

Calculation of theoretical intensities: run CRYSOL on the *X.pdb files using the default parameters. Three types of files will be created: *X00.alm, *X00.log and *.X00.int. The scattering amplitudes *.alm files are not used by GAJOE so they can be discarded.

Creation of the intensities master file: while it is possible to use the original *.int files it is advisable to avoid doing so and rather create a master file junX00.int (the same type of file that RanCh creates) by hand using a text editor or using the small program ONEFILE2 (Windows only).

onefile2 interactive prompt:

Screen TextDefaultDescription
Enter the extension of the crysol files .int The filename extension of the calculated intensities (e.g. X00, for X00.int)

The program will then generate an input file for GAJOE.

Creation of the Size_list file for GAJOE: in the same folder where the *X00.log files are located run GAJOE in mode 2 to generate the size list file, then run GAJOE again with the newly created input files. A subfolder named pdbs containing the pool of models can also be placed in the same folder so that the program automatically copies the selected models to the output subfolder GAnum.


  Last modified: April 11, 2013

© BioSAXS group 2013