This is the manual for the program suite EOM (Ensemble Optimisation Method), which seeks to describe experimental SAXS data using an ensemble representation of atomic models. The program is separated into two components:
RANCH (RANdom CHain) - for the generation of a pool of random models based upon user supplied sequence and structural information;
GAJOE (Genetic Algorithm Judging Optimisation of Ensembles) - the genetic algorithm used for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data.
The following sections briefly describe the RANCH and GAJOE programs that form EOM, how to run them and what the input and output files are.
If you use results from EOM in your own publication, please cite:
RANCH is a program that generates a pool of models based upon sequence and structural information. For multi-domain proteins where high-resolution structures are available for individual domains, a pool of models with random linker configurations is created. For proteins expected to be completely random or unfolded no input rigid bodies are required and completely random configurations of the alpha-carbon trace are created based upon the sequence. An empirical database of backbone torsion angles is used to generate realistic conformations of linkers (represented as dummy residues).
Currently only a single command line option for RANCH is available (see below), thus it is recommended to run RANCH using the interactive dialog. However, an input file (Input Files ) containing a list of answers to the command prompt from a previous run can be used. If no INPUTFILE or OPTIONS are given, the interactive dialog mode is invoked.
OPTIONS for RANCH are described in the following section.
Number of folded domains and their corresponding
names. You can avoid this option if the protein has no folded domains, it will
automatically assume 0.
The CA-angle distribution to be used. Random models use a CA-angle distribution consistent with a random coil, while native-like models use a CA-distribution consistent with folded proteins. Random models will on average be more extended than native-like. Change the default only when the protein's linkers are supposed to be relatively compact (or if the random coil models seem to be too big to fit the data).
Input the file name of the sequence text file (e.g. protein.seq). Currently the sequence file must be single-letter format and upper case. Headers must be removed from the file.
Order of spherical harmonics for the calculation of scattering curves from the generated models. Using higher harmonics will increase the accuracy of calculated scattering patterns especially at higher angles, however, it makes the program significantly slower. Maximum 50.
On runtime, the following lines of output will be written to the ranch.log file:
RANCH Version 1.3
Start: Fri Oct 8 20:16:21 2010
Chain type: Random
Sequence file name protein.seq
Number of residues 476
Number of domains 2
Domain number 1 : dom1.pdb
Domain number 2 : dom2.pdb
Extension of pdb files model.pdb
Experimental data filename data.dat
Structure number Seed
1 201617000
2 1260097231
. .
. .
. .
. .
. .
10000 8795950758
The above output is produced for a two domain protein (with high resolution structures of domain 1 and 2, and a flexible linker between these domains of unknown structure) with the sequence file protein.seq and data file data.dat.
This file contains the intensities of the random models created by RANCH and calculated in the same way as in CRYSOL. X is the extension of the new pdb files.
e.g. junMODEL00.int, where MODEL is the pdb file extension.
Is the size list file of the pool, where Z is the experimental data file name as entered in the interactive dialog.
eg. Size_listDATA.txt, where DATA.dat is the SAXS data file.
CAUTION: Changing the output files' names could cause the EOM programs to function improperly since they rely on naming assumptions. This is especially important for the fixed part of the file names (eg. jun, Size_list etc.) Changing manually the varying part of the names (in this example X and Z) is allowed. Size_listZ.txt requires alteration when the ensemble selection is to be tested against several experimental data files. In this case you should create copies of the same Size_list file and name it accordingly, i.e. if you have the experimental files Z1.dat, Z2.dat and Z3.dat the size list files should be named Size_listZ1.txt, Size_listZ2.txt and Size_listZ3.txt.
In this example the user has the protein sequence (prot.seq), the pdb files of two known domains (dom1.pdb, dom2.pdb) and experimental SAXS data (data.dat). The two domains are joined by a linker and the relative orientation of each domain relative to the other is unknown. RANCH is run using this information to generate a pool of 10000 models with random conformations.
i.e., typing
$> ranch13
on the command line brings up the interactive dialog::
*** ------------------------------------------------ ***
*** RANCH Wintel/UNIX/Linux version 1.3 ***
*** Please reference: P. Bernado, E. Mylonas, ***
*** M.V. Petoukhov, M. Blackledge, D.I. Svergun ***
*** Copyright (c) ATSAS Team ***
*** EMBL, Hamburg Outstation, 2007 ***
*** ------------------------------------------------ ***
Type of models to create
[R]andom coil (default) or [N]ative-like < Random >: r
Input sequence file name ............... < .seq >: prot.seq
Number of residues read ................................ : 476
Number of domains ...................... < 0 >: 2
Domain number 1
Input pdb file name .................... < .pdb >: dom1.pdb
Domain number 2
Input pdb file name .................... < .pdb >: dom2.pdb
Total number of structures to generate . < 10000 >: 10000
Order of harmonics (max. 50) ........... < 15 >: 15
Maximum s value ........................ < 0.5000 >: 0.5
Number of points (max. 101) ............ < 51 >: 51
Extension of new pdb files ............. < .pdb >: model
Experimental data filename ............. < .dat >: data.dat
The results of this RANCH run are written to the following files:
ranchmodel.log:
RANCH Version 1.3
Start: Fri Oct 8 20:16:21 2010
Chain type: Random
Sequence file name prot.seq
Number of residues 476
Number of domains 2
Domain number 1 : dom1.pdb
Domain number 2 : dom2.pdb
Extension of pdb files model.pdb
Experimental data filename data.dat
Structure number Seed
1 201617000
2 1260097231
. .
. .
GAJOE (Genetic Algorithm Judging Optimisation of Ensembles) is a program that uses a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data. Typically, this program is executed following the generation of a random pool of models using RANCH, however, it is possible to adjust the necessary input files such that a random pool generated by some other method can be used. GAJOE requires several RANCH format output files as input.
Here, INPUTFILE is an answers file that can be optionally created and used as described for RANCH. GAJOE requires that an intensities file, size-list file and RANCH log file exist in the working directory. If an INPUTFILE is not specified then dialog mode is invoked. OPTIONS known by gajoe are described in the next section.
Size/Rg Analysis is used to repeat the Dmax/Rg analysis using the output of a previous GAJOE run. This can be useful in 2 cases: 1) to create new histogram files when there was no appropriate Size_list file when GAJOE finished (and thus no histograms were created) and 2) to change the histogram intervals (in this case the program will overwrite the preexisting files.
Create Size_list file is used to generate input for GAJOE using a pool of structures created by the user and requires CRYSOL .log files.
If you choose the default No the program will not ask any more questions and will start running the algorithm using default values for all the parameters (which should be OK for most cases). If you press y/yes then you enter Advanced Mode and you can tweak the values of each parameter.
Select the number of generations the genetic algorithm can use to optimize the ensemble. Maximum 10,000. It is NOT suggested to use less than 1000 generations.
Number of ensembles of theoretical curves (calculated from the structures created by RANCH) to compare with experimental data. Avoid numbers bigger than 200.
Number of curves that compose one ensemble. Again probably it would make no sense to have numbers more than 200. Setting this number very low can help determine the minimum number of curves/structures that are needed to fit the data.
This asks how many times (cycles) the genetic algorithm process is going to be repeated. If it is just for a test run trying different parameter values it should probably changed to 1 but for Size/Rg analysis it is advisable to keep it to 50 or maybe even 100.
If yes then it is allowed to adjust the fitting of the experimental data to the theoretical ensemble using a constant. It probably needs to be kept enabled, especially in the case of completely unfolded systems. Alternatively program runs with constant subtraction disabled can be performed and compared with program runs with it enabled.
The creation of the convergence file should only be allowed when checking at which generation the CHI2 is sufficiently small (it is also advisable to use it only in one-cycle runs to avoid creating a huge convergence file). If, for example, the program is run for 10,000 generations and the CHI2 improves slightly only 3 times after eg. 1000 generations probably is a waste of time to use 10,000 generations.
If yes it will create Size/Rg distribution histograms when the program is finished. If it does not find the appropriate Size_list file it will create nothing and the program will not continue.
The size of the Dmax (maximum size) histogram intervals can be selected here. The Rg intervals are going to be 1/4 of the Dmax intervals. This is asked after the run is finished.
GAJOE will create a subfolder with all the files that it creates. All subfolders have a name in the form GAnum where num is the sequential number for every time you run the program (e.g. GA001, GA002 etc.). In each subfolder the following files can be found:
The log file showing the files used, the parameter values etc. A replicate of this file can be also found in the original folder in order to easily see what is in each folder e.g. GA001.log, where 001 is the run number.
file showing the number of times each curve-structure was selected in the final generation of all cycles (curves-structures that were never selected are not shown).
the fit of the selected ensemble of the best cycle to the experimental data Z.dat. It can be opened by Sasplot. If opened with a text editor the CHI of all cycles can be seen.
the Dmax/Rg distribution histograms for the experimental data Z.dat compared to the distribution of the pool of all curves-structures created by RANCH. They can be opened using Sasplot and selecting View > AbsY:X to watch them in linear scale (logarithmic Y scale is the default). By opening them with a text editor the average Dmax/Rg values of the pool and the selected structures can also be seen.
Additionally, a subfolder named pdbs is created, which contains the structures that were selected in the cycle with the lowest CHI.
Models created by other methods/programs (i.e. not using RanCh) can also be used.
File format:
there is no restriction on the type of models as long as they are in standard PDB format. All the files should have the same common extension:
e.g. X.pdb
and be numbered either in the form 1X.pdb, 2X.pdb, ..., 10X.pdb, ..., 100X.pdb, ..., 1000X.pdb, ..., 10000X.pdb, ...
(i.e. the length of the file name varies with the size of the number)
or
00001X.pdb, 00002X.pdb, ..., 00010X.pdb, ..., 00100X.pdb, ..., 01000X.pdb, ..., 10000X.pdb, ...
(i.e. 5 digit number in all cases).
Calculation of theoretical intensities: run CRYSOL on the *X.pdb files using the default parameters. Three types of files will be created: *X00.alm, *X00.log and *.X00.int. The scattering amplitudes *.alm files are not used by GAJOE so they can be discarded.
Creation of the intensities master file: while it is possible to use the original *.int files it is advisable to avoid doing so and rather create a master file junX00.int (the same type of file that RanCh creates) by hand using a text editor or using the small program ONEFILE2 (Windows only).
The filename extension of the calculated intensities (e.g. X00, for X00.int)
The program will then generate an input file for GAJOE.
Creation of the Size_list file for GAJOE: in the same folder where the *X00.log files are located run GAJOE in mode 2 to generate the size list file, then run GAJOE again with the newly created input files. A subfolder named pdbs containing the pool of models can also be placed in the same folder so that the program automatically copies the selected models to the output subfolder GAnum.