EMBL logo

Biological Small Angle Scattering Group

BioSAXS logo

HOME USER INFO BEAMLINE RESEARCH SOFTWARE COURSES
 

EOM Manual

EOM: Ensemble Optimization Method

Two files are needed for running the genetic algorithm that selects the ensembles that fit the experimental data. The RanCh (Random Chain) creates the random models of intrinsically unfolded or multidomain proteins with flexible linkers. The program GAJOE (Genetic Algorithm Judging Optimization of Ensembles) is the selection program. Preferably both should be kept in the same directory (or in an executables directory). Additionally, the program ONEFILE is also provided for those who want the use their own models.
Files required:

  1. The subtracted experimental data file (the original *.dat, NOT the GNOM *.out file). ASCII format where the first column are the s values, second column the intensities and third the experimental errors (if they exist).
  2. A text file with the sequence in FASTA format (*.seq).
  3. If the protein is multidomain then the *.pdb files of the folded domains. CAUTION: It is absolutely required that the sequence of these domains corresponds exactly to a specific part of the sequence given.

Quick start guide

A quick explanation to run the programs using default parameters. A more detailed explanation of all program questions and parameters follows.
RanCh:

  • Type of models to create
    [R]andom coil (default) or [N]ative-like < Random >:
    Keep the default mode.
  • Input sequence file name ............... < .seq >: The FASTA sequence file name.
  • Number of domains ...................... < 0 >: Number of folded domains. 0 for completely unfolded proteins
  • Domain number X
    Input pdb file name .................... <.pdb >:
    The file name of the X folded domain.
  • Total number of structures to generate . < 10000 >: Keep the default number.
  • Order of harmonics ..................... < 15 >: Leave default value.
  • Extension of new pdb files ............. <.pdb >: A sort of "ID" for the created files.
  • Experimental data filename ............. < .dat >: The experimental data file name.

GAJOE:

  • Program mode?
    0 = Ensemble Optimization Method (default) 1 = Size/Rg Analysis Enter your option (0-2) ................ < 0 >:
    Choose "0".
  • Input the experimental file name ....... <.dat >: The experimental data file name.
  • Input the name of the intensities file . <.int >: The junX00.int file name that was created by RanCh where X is the "ID" (extension) RanCh asked for.
  • Do you want to change the settings? (y/n) < No >: Choose "n" for default parameters.
  • Enter the Size of the (max. size) histogram intervals < 2.000 >: The interval of the Dmax histogram. Depending on the number of aminoacids of the protein should be changed accordingly.

 

DETAILED DESCRIPTION

Running RanCh: it should be pretty straightforward but let's look at the questions asked by the program.

  • Type of models to create [R]andom coil (default) or [N]ative-like < Random >: The Random coil models are in average more extended and have a random coil CA-angle distribution, while native-like ones resemble folded proteins distribution (although still more extended than a normal folded protein). Change the default only when the protein's linkers are supposed to be relatively compact (or if the random coil models seem to be too big to fit the data).
  • Input sequence file name ............... < .seq >: Input the file name of the sequence text file.
  • Number of domains ...................... < 0 >: Number of folded domains in the protein, leave 0 if the protein is considered to have no folded domains.
  • Total number of structures to generate . < 10000 >: Number of random structures to generate. Normally default 10000 should be OK.
  • Order of harmonics ..................... < 15 >: Using higher harmonics will make the calculated scattering patterns more accurate especially at higher angles but it makes the program significantly slower. Maximum 50.
  • Extension of new pdb files ............. <.pdb >: This is the extension for the pdb files that will be created by the program. By default, you will see no pdb files when the program finishes running but it is used as a sort of "ID" for the output files that it creates (see below for the explanation of the output).
  • Experimental data filename ............. < .dat >: The name of the experimental data file you are going to use.

RanCh creates three output files:

  • junX00.int where X is the extension of the new pdb files. This file contains the intensities of the random models created by RanCh calculated in the same way as in CRYSOL.
  • RanchX.log where X is the extension of the new pdb files. This is a log file needed also by GAJOE.
  • Size_listZ.txt where Z is the experimental data file name as entered in the last question.

CAUTION: Changing the output files' names could cause the programs to function improperly since they rely on naming assumptions. This is especially important for the fixed part of the file names (eg. jun, Size_list etc.) Changing manually the varying part of the names (in this example X and Z) is allowed. The only reason to change these though is in the case of Size_listZ.txt where you might have miswritten the experimental file name or you want to use more than one experimental file. In this case you should create copies of the same Size_list file and name it accordingly, i.e. if you have the experimental files Z1.dat, Z2.dat and Z3.dat the size list files should be named Size_listZ1.txt, Size_listZ2.txt and Size_listZ3.txt Extra option: By default RanCh does not save the pdbs it creates. If you want to have the pdbs you should run the program with the option /t, i.e. in the command line type: RanCh10.exe /t.

Running RanCh with command line options:
/l: To enable command line mode
/seq *.seq: The sequence file name
/pdb number *.pdb *.pdb...: Number of folded domains and their corresponding names. You can avoid this option if the protein has no folded domains, it will automatically assume 0.

The following arguments are optional:
/mod r/n: Use r for random coil and n for native-like models. Assumes random coil if the option is not present.
/ext X.pdb: The extension of the pdb files. Assumes a.pdb if the option is not present.
/dat Z.dat: The experimental data file name. Assumes a.dat if the option is not present.
/nst number: Number of structures to be generated. Assumes 10000 if the option is not present.
/lm number: Order of harmonics. Assumes 15 if the option is not present. /t: The /t option is also compatible with the command line mode (but it should be AFTER the /l option).

Example: ranch13.exe /l /seq sequence.seq /pdb 3 one.pdb two.pdb three.pdb /mod n /ext ss.pdb /dat expdata.dat /nst 1000 /lm 30

 

Running GAJOE: a look at the questions of GAJOE and their meaning.

  • Program mode?
    0 = Ensemble Optimization Method (default)
    1 = Size/Rg Analysis
    2 = Create Size_list file
    Enter your option (0-2) ................ < 0 >:
    Choose the program mode. Let's first take a look at the default "0" mode.
  • Input the experimental file name ....... <.dat >: The experimental data filename.
  • Input the name of the intensities file . <.int >: The junX00.int file name.
  • Do you want to change the settings? (y/n) < No >: If you choose the default No the program will not ask any more questions and will start running the algorithm using default values for all the parameters (which should be OK for most cases). If you press y/yes then you enter Advanced Mode and you can tweak the values of each parameter.
  • Enter number of generations ............ < 1000 >: Select the number of generations the genetic algorithm can use to optimize the ensemble. Maximum 10,000. It is NOT suggested to use less than 1000 generations.
  • Enter number of ensembles .............. < 50 >: Number of ensembles of theoretical curves (calculated from the structures created by RanCh) to compare with experimental data. Avoid numbers bigger than 200.
  • Enter number of curves per ensemble .... < 20 >: Number of curves that compose one ensemble. Again probably it would make no sense to have numbers more than 200. Setting this number very low can help determine the minimum number of curves-structures that are needed to fit the data.
  • Enter maximum number of mutations per ensemble < 10 >: Should be normally around half of the number of curves per ensemble. Avoid using a number bigger than the number of curves per ensemble.
  • Enter number of crossings per generation < 20 >: This number should have values from around half the number of ensembles up to the number of ensembles.
  • Enter number of times you want the process repeated < 50 >: This asks how many times (cycles) the genetic algorithm process is going to be repeated. If it is just for a test run trying different parameter values it should probably changed to 1 but for Size/Rg analysis it is advisable to keep it to 50 or maybe even 100.
  • Do you want to allow repetitions? (y/n) < Yes >: If it is yes then one ensemble can contain more than once the same structure-curve. Should be fine as yes.
  • Allow constant subtraction? (y/n) ..... < Yes >: If yes then it is allowed to adjust the fitting of the experimental data to the theoretical ensemble using a constant. It probably needs to be kept enabled, especially in the case of completely unfolded systems. Alternatively program runs with constant subtraction disabled can be performed and compared with program runs with it enabled.
  • Create convergence file? (y/n) ......... < No >: The creation of the convergence file should only be allowed when checking at which generation the Χ2 is sufficiently small (it is also advisable to use it only in one-cycle runs to avoid creating a huge convergence file). If, for example, the program is run for 10,000 generations and the Χ2 improves slightly only 3 times after eg. 1000 generations probably is a waste of time to use 10,000 generations.
  • Create analysis files? (y/n) ........... < Yes >: If yes it will create Size/Rg distribution histograms when the program is finished. If it does not find the appropriate Size_list file it will create nothing. If it does find it though it will ask the following question.
  • Experimental file Z
    Enter the Size of the (max. size) histogram intervals < 2.000 >:
    The size of the Dmax (maximum size) histogram intervals can be selected here. The Rg intervals are going to be 1/4 of the Dmax intervals. This is asked after the run is finished.

Output files: the program will create a subfolder with all the files that it creates. All subfolders have a name in the form GAnum where num is the sequential number for every time you run the program. In each subfolder the following files can be found.

  • GAnum.log is the log file that shows the files used, the parameter values etc. A replicate of this file can be also found in the original folder in order to easily see what is in each folder without having to look on it.
  • best_curvenum.txt is a file showing the number of times each curve-structure was selected in the final generation of all cycles (curves-structures that were never selected are not shown).
  • selected_ensemnum.txt is a file showing the best 10 ensembles at the final generation of each cycle.
  • profilesZnum.fit is the fit of the selected ensemble of the best cycle to the experimental data Z.dat. It can be opened by Sasplot. If opened with a text editor the Χ of all cycles can be seen.
  • Size_distrZnum.dat and Rg_distrZnum.dat are the Dmax/Rg distribution histograms for the experimental data Z.dat compared to the distribution of the pool of all curves-structures created by RanCh. They can be opened using Sasplot and selecting View > AbsY:X to watch them in linear scale (logarithmic Y scale is the default). By opening them with a text editor the average Dmax/Rg values of the pool and the selected structures can also be seen.
  • A subfolder named pdbs which contains the structures that were selected in the cycle with the lowest Χ.

If you choose Program mode "1" (Size/Rg Analysis) the Dmax/Rg analysis can be redone using the outputs of a previous GAJOE run. This can be useful in 2 cases: 1)to create new histogram files when there was no appropriate Size_list file when GAJOE finished (and thus no histograms were created) and 2)to change the histogram intervals (in this case the program will overwrite the preexisting files). The questions asked are as follows:

  • Enter the profiles file name ........... < .fit >: The name of the fit file of the GAJOE run you want to analyse (i.e. profilesZnum.fit where Z is the experimental data file name and num is the run number).
  • Experimental file Z
    Enter the Size of the (max. size) histogram intervals < 2.000 >:
    The same question as in the main program ("0") mode, the intervals for the Dmax histogram.
  • Do you want to analyze another construct? (y/n) < No >: If the analysis of another GAJOE run is required yes can be answered (the profiles file name is going to be asked again etc).

Using models made by other methods
Models created by other methods/programs except for RanCh can also be used. File format: there is no restriction on the type of models as long as they are in standard PDB format. All the files should have the same common extension eg. X.pdb and be numbered either in the form 1X.pdb, 2X.pdb, ..., 10X.pdb, ..., 100X.pdb, ..., 1000X.pdb, ..., 10000X.pdb, ... (i.e. the length of the file name varies with the size of the number) or 00001X.pdb, 00002X.pdb, ..., 00010X.pdb, ..., 00100X.pdb, ..., 01000X.pdb, ..., 10000X.pdb, ... (i.e. 5 digit number in all cases).
Calculation of theoretical intensities: run CRYSOL on the *X.pdb files using the default parameters. (It can be done by typing crysol26.exe *.pdb - this will run CRYSOL on all the files of the current folder). Three types of files will be created: *X00.alm, *X00.log and *.X00.int. The *.alm files are not used by GAJOE so they can be discarded.
Creation of the intensities master file: while it is possible to use the original *.int files it is advisable to avoid doing so and rather create a master file junX00.int (the same type of file that RanCh creates) using the small program (with the very inspired name) ONEFILE2.exe. In the query input the common extension of the intensity files (eg. X00.int).
Creation of the Size_list file: in the same folder where the *X00.log files are located run GAJOE and answer "2" to the first question. Enter the name of the experimental data file name (eg. Z.dat) as well as the extension of the log files (eg. X00.log). The output file will be named Size_listZ.txt (the same type of output file from RanCh). Run GAJOE as before: Make sure to keep the experimental data file (Z.dat), the intensities master file (junX00.int) and the file containing the Sizes/Rgs of the pool (Size_listZ.txt) in the same folder (there is no RanchX.log file in this case). A subfolder named pdbs containing the models can also be placed in the same folder so that the program automatically copies the selected models to the output subfolder GAnum.



Last edited :

Thursday, 10 August, 2006

© Copyright BioSAXS Group 2004