|
EOM: Ensemble Optimization Method
Two files are needed for running the genetic algorithm that selects the ensembles that fit the experimental data.
The RanCh (Random Chain) creates the random models of intrinsically unfolded or multidomain proteins with flexible linkers. The program GAJOE (Genetic Algorithm Judging Optimization of Ensembles) is the selection program. Preferably both should be kept in the same directory (or in an executables directory).
Additionally, the program ONEFILE is also provided for those who want the use their own models.
Files required:
- The subtracted experimental data file (the original *.dat, NOT the GNOM *.out file). ASCII format where the first column are the s values, second column the intensities and third the experimental errors (if they exist).
- A text file with the sequence in FASTA format (*.seq).
- If the protein is multidomain then the *.pdb files of the folded domains. CAUTION: It is absolutely required that the sequence of these domains corresponds exactly to a specific part of the sequence given.
Quick start guide
A quick explanation to run the programs using default parameters. A more detailed explanation of all program questions and parameters follows.
RanCh:
- Type of models to create
[R]andom coil (default) or [N]ative-like < Random >: Keep the default mode.
- Input sequence file name ............... < .seq >: The FASTA sequence file name.
- Number of domains ...................... < 0 >: Number of folded domains. 0 for completely unfolded proteins
- Domain number X
Input pdb file name .................... <.pdb >: The file name of the X folded domain.
- Total number of structures to generate . < 10000 >: Keep the default number.
- Order of harmonics ..................... < 15 >: Leave default value.
- Extension of new pdb files ............. <.pdb >: A sort of "ID" for the created files.
- Experimental data filename ............. < .dat >: The experimental data file name.
GAJOE:
- Program mode?
0 = Ensemble Optimization Method (default) 1 = Size/Rg Analysis
Enter your option (0-2) ................ < 0 >: Choose "0".
- Input the experimental file name ....... <.dat >: The experimental data file name.
- Input the name of the intensities file . <.int >: The junX00.int file name that was created by RanCh where X is the "ID" (extension) RanCh asked for.
- Do you want to change the settings? (y/n) < No >: Choose "n" for default parameters.
- Enter the Size of the (max. size) histogram intervals < 2.000 >: The interval of the Dmax histogram. Depending on the number of aminoacids of the protein should be changed accordingly.
DETAILED DESCRIPTION
Running RanCh: it should be pretty straightforward but let's look at the questions asked by the program.
- Type of models to create
[R]andom coil (default) or [N]ative-like < Random >: The Random coil
models are in average more extended and have a random coil CA-angle
distribution, while native-like ones resemble folded proteins distribution
(although still more extended than a normal folded protein). Change the
default only when the protein's linkers are supposed to be relatively compact
(or if the random coil models seem to be too big to fit the data).
- Input sequence file name ............... < .seq >: Input the file name of the
sequence text file.
- Number of domains ...................... < 0 >: Number of folded domains in the
protein, leave 0 if the protein is considered to have no folded domains.
- Total number of structures to generate . < 10000 >: Number of random
structures to generate. Normally default 10000 should be OK.
- Order of harmonics ..................... < 15 >: Using higher harmonics will make
the calculated scattering patterns more accurate especially at higher angles but
it makes the program significantly slower. Maximum 50.
- Extension of new pdb files ............. <.pdb >: This is the extension for the pdb
files that will be created by the program. By default, you will see no pdb files
when the program finishes running but it is used as a sort of "ID" for the
output files that it creates (see below for the explanation of the output).
- Experimental data filename ............. < .dat >: The name of the experimental
data file you are going to use.
RanCh creates three output files:
- junX00.int where X is the extension of the new pdb files. This file contains the
intensities of the random models created by RanCh calculated in the same way
as in CRYSOL.
- RanchX.log where X is the extension of the new pdb files. This is a log file
needed also by GAJOE.
- Size_listZ.txt where Z is the experimental data file name as entered in the last
question.
CAUTION: Changing the output files' names could cause the programs to function
improperly since they rely on naming assumptions. This is especially important for
the fixed part of the file names (eg. jun, Size_list etc.) Changing manually the varying
part of the names (in this example X and Z) is allowed. The only reason to change
these though is in the case of Size_listZ.txt where you might have miswritten the
experimental file name or you want to use more than one experimental file. In this
case you should create copies of the same Size_list file and name it accordingly, i.e. if
you have the experimental files Z1.dat, Z2.dat and Z3.dat the size list files should be
named Size_listZ1.txt, Size_listZ2.txt and Size_listZ3.txt
Extra option: By default RanCh does not save the pdbs it creates. If you want to have
the pdbs you should run the program with the option /t, i.e. in the command line type:
RanCh10.exe /t.
Running RanCh with command line options:
/l: To enable command line mode
/seq *.seq: The sequence file name
/pdb number *.pdb *.pdb...: Number of folded domains and their corresponding
names. You can avoid this option if the protein has no folded domains, it will
automatically assume 0.
The following arguments are optional:
/mod r/n: Use r for random coil and n for native-like models. Assumes random coil if
the option is not present.
/ext X.pdb: The extension of the pdb files. Assumes a.pdb if the option is not present.
/dat Z.dat: The experimental data file name. Assumes a.dat if the option is not
present.
/nst number: Number of structures to be generated. Assumes 10000 if the option is
not present.
/lm number: Order of harmonics. Assumes 15 if the option is not present.
/t: The /t option is also compatible with the command line mode (but it should be
AFTER the /l option).
Example: ranch13.exe /l /seq sequence.seq /pdb 3 one.pdb two.pdb three.pdb /mod n
/ext ss.pdb /dat expdata.dat /nst 1000 /lm 30
Running GAJOE: a look at the questions of GAJOE and their meaning.
- Program mode?
0 = Ensemble Optimization Method (default)
1 = Size/Rg Analysis
2 = Create Size_list file
Enter your option (0-2) ................ < 0 >: Choose the program mode. Let's
first take a look at the default "0" mode.
- Input the experimental file name ....... <.dat >: The experimental data filename.
- Input the name of the intensities file . <.int >: The junX00.int file name.
- Do you want to change the settings? (y/n) < No >: If you choose the default
No the program will not ask any more questions and will start running the
algorithm using default values for all the parameters (which should be OK for
most cases). If you press y/yes then you enter Advanced Mode and you can
tweak the values of each parameter.
- Enter number of generations ............ < 1000 >: Select the number of
generations the genetic algorithm can use to optimize the ensemble. Maximum
10,000. It is NOT suggested to use less than 1000 generations.
- Enter number of ensembles .............. < 50 >: Number of ensembles of
theoretical curves (calculated from the structures created by RanCh) to
compare with experimental data. Avoid numbers bigger than 200.
- Enter number of curves per ensemble .... < 20 >: Number of curves that
compose one ensemble. Again probably it would make no sense to have
numbers more than 200. Setting this number very low can help determine the
minimum number of curves-structures that are needed to fit the data.
- Enter maximum number of mutations per ensemble < 10 >: Should be
normally around half of the number of curves per ensemble. Avoid using a
number bigger than the number of curves per ensemble.
- Enter number of crossings per generation < 20 >: This number should have
values from around half the number of ensembles up to the number of
ensembles.
- Enter number of times you want the process repeated < 50 >: This asks how
many times (cycles) the genetic algorithm process is going to be repeated. If it
is just for a test run trying different parameter values it should probably
changed to 1 but for Size/Rg analysis it is advisable to keep it to 50 or maybe
even 100.
- Do you want to allow repetitions? (y/n) < Yes >: If it is yes then one
ensemble can contain more than once the same structure-curve. Should be fine
as yes.
- Allow constant subtraction? (y/n) ..... < Yes >: If yes then it is allowed to
adjust the fitting of the experimental data to the theoretical ensemble using a
constant. It probably needs to be kept enabled, especially in the case of
completely unfolded systems. Alternatively program runs with constant
subtraction disabled can be performed and compared with program runs with
it enabled.
- Create convergence file? (y/n) ......... < No >: The creation of the convergence
file should only be allowed when checking at which generation the Χ2 is
sufficiently small (it is also advisable to use it only in one-cycle runs to avoid
creating a huge convergence file). If, for example, the program is run for
10,000 generations and the Χ2 improves slightly only 3 times after eg. 1000
generations probably is a waste of time to use 10,000 generations.
- Create analysis files? (y/n) ........... < Yes >: If yes it will create Size/Rg
distribution histograms when the program is finished. If it does not find the
appropriate Size_list file it will create nothing. If it does find it though it will
ask the following question.
- Experimental file Z
Enter the Size of the (max. size) histogram intervals < 2.000 >: The size of the
Dmax (maximum size) histogram intervals can be selected here. The Rg
intervals are going to be 1/4 of the Dmax intervals. This is asked after the run is
finished.
Output files: the program will create a subfolder with all the files that it creates. All subfolders
have a name in the form GAnum where num is the sequential number for every time
you run the program. In each subfolder the following files can be found.
- GAnum.log is the log file that shows the files used, the parameter values etc.
A replicate of this file can be also found in the original folder in order to easily
see what is in each folder without having to look on it.
- best_curvenum.txt is a file showing the number of times each curve-structure
was selected in the final generation of all cycles (curves-structures that were
never selected are not shown).
- selected_ensemnum.txt is a file showing the best 10 ensembles at the final
generation of each cycle.
- profilesZnum.fit is the fit of the selected ensemble of the best cycle to the
experimental data Z.dat. It can be opened by Sasplot. If opened with a text
editor the Χ of all cycles can be seen.
- Size_distrZnum.dat and Rg_distrZnum.dat are the Dmax/Rg distribution histograms
for the experimental data Z.dat compared to the distribution of the pool of all
curves-structures created by RanCh. They can be opened using Sasplot and
selecting View > AbsY:X to watch them in linear scale (logarithmic Y scale is
the default). By opening them with a text editor the average Dmax/Rg values of
the pool and the selected structures can also be seen.
- A subfolder named pdbs which contains the structures that were selected in the
cycle with the lowest Χ.
If you choose Program mode "1" (Size/Rg Analysis) the Dmax/Rg
analysis can be redone using the outputs of a previous GAJOE run. This can
be useful in 2 cases: 1)to create new histogram files when there was no
appropriate Size_list file when GAJOE finished (and thus no histograms were
created) and 2)to change the histogram intervals (in this case the program will
overwrite the preexisting files). The questions asked are as follows:
- Enter the profiles file name ........... < .fit >: The name of the fit file of the
GAJOE run you want to analyse (i.e. profilesZnum.fit where Z is the
experimental data file name and num is the run number).
- Experimental file Z
Enter the Size of the (max. size) histogram intervals < 2.000 >: The same
question as in the main program ("0") mode, the intervals for the Dmax
histogram.
- Do you want to analyze another construct? (y/n) < No >: If the analysis of
another GAJOE run is required yes can be answered (the profiles file name is
going to be asked again etc).
Using models made by other methods
Models created by other methods/programs except for RanCh can also be used.
File format: there is no restriction on the type of models as long as they are in
standard PDB format. All the files should have the same common extension eg. X.pdb
and be numbered either in the form 1X.pdb, 2X.pdb, ..., 10X.pdb, ..., 100X.pdb, ...,
1000X.pdb, ..., 10000X.pdb, ... (i.e. the length of the file name varies with the size of
the number) or 00001X.pdb, 00002X.pdb, ..., 00010X.pdb, ..., 00100X.pdb, ...,
01000X.pdb, ..., 10000X.pdb, ... (i.e. 5 digit number in all cases).
Calculation of theoretical intensities: run CRYSOL on the *X.pdb files using the
default parameters. (It can be done by typing crysol26.exe *.pdb - this will run
CRYSOL on all the files of the current folder). Three types of files will be created:
*X00.alm, *X00.log and *.X00.int. The *.alm files are not used by GAJOE so they
can be discarded.
Creation of the intensities master file: while it is possible to use the original *.int files
it is advisable to avoid doing so and rather create a master file junX00.int (the same
type of file that RanCh creates) using the small program (with the very inspired name)
ONEFILE2.exe. In the query input the common extension of the intensity files (eg.
X00.int).
Creation of the Size_list file: in the same folder where the *X00.log files are located
run GAJOE and answer "2" to the first question. Enter the name of the experimental
data file name (eg. Z.dat) as well as the extension of the log files (eg. X00.log). The
output file will be named Size_listZ.txt (the same type of output file from RanCh).
Run GAJOE as before: Make sure to keep the experimental data file (Z.dat), the
intensities master file (junX00.int) and the file containing the Sizes/Rgs of the pool
(Size_listZ.txt) in the same folder (there is no RanchX.log file in this case). A
subfolder named pdbs containing the models can also be placed in the same folder so
that the program automatically copies the selected models to the output subfolder
GAnum.
|