The following sections shortly describe how to run the PIPELINE from the command-line on any of the supported platforms and the required input and the produced output files.
If you use results of the PIPELINE in your own publication, please cite:
The PIPELINE may perform data processing and analysis either in an online mode during the experiment or evaluate the collected data in an offline mode. Each data processing step (e.g. background subtraction, Guinier analysis etc.) is represented by a pipeline component that employs one or more tasks, i.e. stand-alone command-line programs, to perform the actual operations. The tasks may be run sequentially or in parallel. The pipeline components communicate with each other by passing messages. A message is sent when a particular event occurs, e.g. when a new file becomes available for processing or when a pipeline component finished processing a file. A single message may be received by several components. This way of connecting different components enables one to modify the behaviour of the pipeline to meet different requirements, e.g. by including or excluding certain steps if needed. If a component produces one or more output files, the file location is contained in the message;
thus, the output files of one component can serve as input files for other components.
It is possible to add custom components and tasks to the PIPELINE and include them in the configuration with the ATSAS tools. If you are interested to add such custom components to your pipeline setup, please contact us for the developer documentation.
Working directory. If specified the PIPELINE process changes to this working directory prior to executing any task, default: none, the current directory is used.
The configuration file configures the pipeline components and how the components are connected to each other.
The file format is XML, the root element <pipelinerc> shall contain one <components> element and one <connections> element each. The <components> element may contain multiple <component> elements which in turn may contain <option> elements. The <connections> element may contain multiple <connection> elements.
Here COMPONENT_CLASS refers to the class (type) of the component and COMPONENT_NAME shall be a unique name which may be defined by the author of the configuration file. The following rules apply:
component classes may be used multiple times in the same pipeline configuration;
component names must be unique;
components may have any number of options;
option names must be unique within a component;
names used in connections must be defined in components beforehand;
any component may receive messages from any number of other components.
It is possible to use a XML schema to validate the configuration file. The validation already includes checks for many of the constraints listed above. The validation by schema can be done in two ways, (1) by adding the schema to the root element and letting the pipeline do the validation on startup:
The abinitio component may be used to build one or more ab initio models from regularized data, i.e. it requires the .out file of the distances component as an input.
For each input file, DAMMIF is run repeat times in mode configuration and all its output files are placed in outputdir. If it is known that all input files come in inverse Angstroms or inverse nanometres, the unit may be fixed in the configuration file to exclude any wrong unit estimations.
If more than one model has been computed by DAMMIF, DAMAVER is run to refine the modelling. The resulting DAMSTART model will be used as an input for DAMMIN which is employed to do the final refinement step.
It is to note that the complete modelling process may take a very long time. For online processing it is recommended to set repeat to 1.
Option
Default
Accepted Range
Description
repeat
1
1
20
Number of models generated for each input file.
group
processed
Identifier used by the serializer component to group results to output files.
outputdir
(none)
The directory for the output files of DAMMIF and DAMMIN.
The classifier component attempts to classify experimental data, i.e. determine to which distinct group it probably belongs. Currently only a classification of folded/unfolded is implemented. The output is a binary value, indicating whether the specimen is more likely folded (1) or unfolded (0).
For normalization purposes the radius of gyration (Rg) and the forward scattering (I(0)) are required. The classifier component runs AUTORG to obtain these values. If the Rg/I(0) could not be determined, no classification is attempted.
For comparable results, provide the same values for options here as for the guinier component.
Option
Default
Accepted Range
Description
mininterval
10
2
Inf
Minimum Rg interval length in data points.
smaxrg
1.3
0.0
Inf
Maximum Smax*Rg value.
sminrg
1.0
0.0
Inf
Minimum Smin*Rg value.
group
processed
Identifier used by the serializer component to group results to output files.
The distances component computes the inverse transformation to obtain the p(r) from experimental scattering data by means of GNOM.
As an input it requires a subtracted data file and provides the regularized data and p(r) information as an output.
The file is placed in outputdir.
For normalization purposes the radius of gyration (Rg) is required.
The distances component runs AUTORG to obtain these values.
If the Rg could not be determined, or the estimated quality is below minquality, no indirect transformation is attempted.
For comparable results, provide the same values for options here as for the guinier component.
Option
Default
Accepted Range
Description
sminrg
1.0
0.0
Inf
Minimum Smin*Rg value.
mininterval
10
2
Inf
Minimum Rg interval length in data points.
minquality
0.0
0.0
1.0
group
processed
Identifier used by the serializer component to group results to output files.
The hplcsubtract component receives signals from radaver or datwatcher. When there are no new frames within 3 seconds, it runs chromixs in automatic mode and saves results in the workdir directory.
Option
Default
Accepted Range
Description
group
processed
Identifier used by the serializer component to group results to output files.
outputdir
None
The target base directory for the processed files. Default is /analysis/processed/chromixs
The filesystemwatcher component monitors a directory of a local file system, the workdir for files matching the defined globbing filter. For each matching file a message is sent to the connected components, informing them of the file found. Only files not previously reported are flagged and only up to batch files at a time. This is repeated after every interval. If the interval is set to 0, the workdir is checked only once and no more updates are performed. This corresponds to an offline mode where all input files are available at the start of PIPELINE.
Each filesystemwatcher monitors exactly one directory, subdirectories are not monitored. To watch for files in multiple directories, use multiple instances of filesystemwatcher.
Option
Default
Accepted Range
Description
filter
*
A filter applied to the files found in workdir. Default accepts all files, may be any reasonable globbing pattern, e.g. '*.dat' to select only files with extension '.dat'.
workdir
.
The directory to check for input files. If this is a relative path, it is relative to the working directory of the application. Default is the current working directory.
batch
2147483647
0
Inf
Batch size, maximum number of new entries to be reported after each interval. Particularly useful for testing and debugging; set polling to 60000 and batch to 20 to simulate beamline operation with one dataset of 20 frames per minute.
interval
1000
0
Inf
Polling interval of workdir in milliseconds. If 0, polling is disabled and the pipeline works in offline mode, assuming that all files are available at startup.
The guinier component attempts to determine the radius of gyration (Rg) and the forward scattering (I(0)) from experimental (data using AUTORG. As an input it requires a subtracted data file scaled for concentration.
The minquality option is not used by this component, but is forwarded only to the output to facilitate a configurable means to decide what quality is still considered acceptable.
If the forward scattering and molecular mass of a standard protein are provided (standard-i0, standard-mw), then an approximation of the molecular weight of the current specimen is computed as well.
Option
Default
Accepted Range
Description
sminrg
(none)
0.0
Inf
Minimum Smin*Rg value.
mininterval
(none)
2
Inf
Minimum Rg interval length in data points.
minquality
0.0
0.0
1.0
standard-mw
(none)
0.0
Inf
standard-i0
(none)
0.0
Inf
group
processed
Identifier used by the serializer component to group results to output files.
Time to wait between a data update and its serialization to the database in milliseconds. If there is an update before the interval expired, the wait time starts again.
number
(none)
Proposal number.
experimentid
(none)
ISPyB experiment id where results will be added to. If undefined, a new experiment id will be generated.
The membraneispyb component performs sophisticated automatic analysis of available a priori data in ISPyB and, depending on the result, runs either MONSA or MEMPROT.
The component is designed for automatic membrane proteins SECSAXS analysis in detergent solution. It does not work without hplcsubtract and ispyb components.
There should be only declaration of the ispyb component in the pipeline configuration file. No connection with the membraneispyb component is required! All the data exchange with ISPyB database is performed implicitly inside the module.
For running MEMPROT a high resolution structure has to be uploaded to the ISPyB database. Otherwise, MONSA will be started with some precalculated starting parameters. Knowledge of a detergent chemical formulas is mandatory for both types of modeling.
Modeling is based on combination of a priori information (pdb, FASTA sequence, detergend head/tail chemical formulas) and SAXS a postariori data (Rg, Dmax, Vporod).
Please make sure in advance that all the precalculated starting values for both MONSA/MEMPROT configuration files are adequate, as modeling can take a while.
Option
Default
Accepted Range
Description
group
processed
Identifier used by the serializer component to group results to output files.
outputdir
(none)
Path to MEMPROT/MONSA models and configuration files.
The mixture component uses a priori information for creating the form-factor file for OLIGOMER. OLIGOMER in turn fits a scattering curve from a multicompoment mixture of proteins in order to receive the volume fractions of each component in the mixture.
Option
Default
Accepted Range
Description
lm
20
10
50
Maximum order of harmonics, defines the resolution of the calculated curve.
group
processed
Identifier used by the serializer component to group results to output files.
outputdir
apriori
The directory for the fit and log files of oligomer are stored.
The oversubtract component collects information on the subtraction results. It determines the longest sequence of consequtively following negative intensity values (LCNS) in the data and calculates a p-value which describes the risk of erroneously considering these data as oversubtracted. This procedure is repeated on the same data for multiple n with every n data points being averaged by using a stepping window of size n. If in any configuration oversubtraction is detected i.e. the p-value is below a given significance level alpha, this is flagged on the display of the results.
Furthermore the component provides the computed p-value, the start index and length of the LCNS with respect to the non-averaged data, the window size with which oversubtraction was found as well as a constant describing the degree of oversubtraction in relation to the forward scattering I(0).
Option
Default
Accepted Range
Description
group
processed
Identifier used by the serializer component to group results to output files.
minwnr
100
1
Inf
Minimal number of windows required: specifies how many data points are at least required to be present after averaging/windowing in order to perform LCNS analysis.
alpha
0.05
0
1
Significance level for the p-value. P-values below alpha indicate oversubtraction.
The porod component attempts to compute the Porod volume from regularized data, i.e. it requires the .out file of the distances component as an input and gives the Porod volume as output.
Option
Default
Accepted Range
Description
group
processed
Identifier used by the serializer component to group results to output files.
The rigidbody component evokes SASREF, that performs rigid body modeling of macromolecular complexes formed by multiple subunits (with known atomic structure) against solution scattering data.
Option
Default
Accepted Range
Description
group
processed
Identifier used by the serializer component to group results to output files.
outputdir
apriori
The directory for the fit and log files of oligomer are stored.
The serializer component handles the writing of an output file in XML format. There may be multiple serializers within a PIPELINE configuration, each serializing their own group of components.
Every component with output needs to be connected to a serializer to ensure correct and complete serialization of the data. To avoid rewrites of the output file in quick succession, a waiting interval may be defined. The serializer will wait interval milliseconds before writing out the output. If in this time more messages from other components arrive, the waiting period starts anew.
If a stylesheet is present, its location is included in the resulting output file to facilitate simple viewing of the data in a web browser. Please note that some browsers will not use a stylesheet with an absolute path for security reasons. Thus a stylesheet should be placed in the same directory as the xml or in a subdirectory. To disable this check in Firefox:
type about:config in the address bar
change security.fileuri.strict_origin_policy to false
Option
Default
Accepted Range
Description
output
(none)
The name of the file where the serialized data shall be stored. May include an absolute or relative path.
stylesheet
(none)
Absolute or relative path to the XSL style sheet to be named in the output file.
interval
5000
2000
Inf
Time to wait between a data update and its serialization in milliseconds. If there is an update before the interval expired, the wait time starts again.
group
processed
Serializes data of all components in the same group into one output file.
The stat component collects a configurable list of file header/footer information, e.g. description, code, concentration, exposure time, etc. The information may then be serialized together with the computational results to provide the full information.
The subtracter component collects incoming files, sorts them by run number extracted from the input file and groups them into samples and buffers, respectively. The grouped files are compared, significantly different frames discarded (see alpha) and averaged. The closest averaged buffer(s) in time (i.e. run number) with the same code as a sample will be subtracted from the sample, or, if verifyCodeMatch is disabled, any closest buffer(s) are used. The averaged and subtracted files are written to outputdir.
Option
Default
Accepted Range
Description
verifyCodeMatch
1
0
1
Whether or not the code must match on subtraction. If not (0), it is allowed to subtract a buffer with a code different of that of the sample.
alpha
0.01
0.0
1.0
Significance level for frame comparison. If the probablity of similarity is less than alpha, the frame is discarded.
group
processed
Identifier used by the serializer component to group results to output files.
outputdir
(none)
The directory for the subtracted and normalized files are stored.
The zeroconc component attempts to deal with concentration effects by extrapolating to zero concentration from experimental data using ALMERGE. As an input it requires a subtracted data set scaled for the respective concentrations. If only two concentrations of the same sample are available, zeroconc will merge them; more than two concentrations will be extrapolated to zero concentration. The extrapolated file is written to outputdir.
The extrapolated file may be subjected to the same analysis as any other experimental data set.
Option
Default
Accepted Range
Description
sminrg
1.0
0.0
Inf
Minimum Smin*Rg value.
mininterval
10
2
Inf
Minimum Rg interval length in data points.
minquality
0.5
0.0
1.0
overlap
30
2
Inf
Minimum overlap length in angular points.
step
2
1
Inf
Search step in points; step=1 will find the best overlap but takes most time.
Each component may produce output files, e.g. .dat, .out or .pdb, and values related to the input data, e.g. radius of gyration or molecular weight estimates.
The output files will be stored according to the configuration of the particular component, typically as defined by <option name="outputdir"> in the configuration file.
Please see the components section on how to configure and connect the serializer component.
The PIPELINE may be extended by customized components to perform tasks of the user's choice. This part of the documentation describes how to extend the PIPELINE, i.e. how to add components and tasks. it also includes an API overview.
After initialization, each component is configured and the configure method of each component instance is called. The rawnode parameter contains a pointer to the section of the configuration file corresponding to the instance of this component. Always call the base class implementation of this method before accessing the individual options through option. There should be no reason to use or modify the rawnode directly.
It is good practise to check here that any values are within their valid ranges, that output directories exist or if they do not, create them etc. It may be helpful to report any results using the message module.
Processing Data
def process(self, metadata):
data = {}
filename = metadata.getFileName()
# Fill in 'data', removed counting code for brevity.
metadata.setData(self, data)
component.processed(self, metadata)
During processing, components usually accumulate data that needs to be serialized later. The data may be any python object, e.g. a single value, a list, a dict or a class object. A reference to this data may be stored in the metadata object, where the MetaData acts like a container that associates data and tasks with a particular file being analysed. Please note: while it is technically possible to access data of different components then the current one, it is strongly discouraged. Each component should be stand-alone and not depend on values of any other.
In this example the actual work is done synchronously in Python, i.e. when the data comes in its processed, computation thus is serialized. Do this only if the computations involved a short and the overhead of launching a task would be larger than just doing the computations inline. Otherwise, to allow for parallelization, asynchronous tasks are the preferred mechanism.
When done with processing, the processed signal shall be sent by calling the corresponding class method to allow for connections between components.
Serializing Results
def serialize(self, metadata, rawnode):
node = libxml2.xmlNode(rawnode)
node = node.newChild(None, "oversubtracted", None)
data = metadata..getData(self)
for key in data.keys():
subnode = node.newChild(None, "value", str(data[key]))
subnode.newProp("name", key)
On serialization, the components serialize method is called for every metadata which contains data added by this component. As during configuration for input, the rawnode here points to the output node in the XML document where the data should be serialized to.
FIXME: finally refactor above code to use an addValue function or something.
If a component shall be re-activated at some point later in time, set an interval through this method. After expiry of interval milliseconds, the timeout method is invoked. Each component can have only one timer. Please note that the timeout occurs after every interval milliseconds and if not stopped will prevent the PIPELINE from terminating in offline mode due to one component still being active, i.e. waiting for a timeout.
Use configure and option to make the interval user configurable.