Biological
Small Angle Scattering

ATSAS online | Forum | User information | EMBL Hamburg

SASFLOW PIPELINE manual

Data Analysis Pipeline

Written by Daniel Franke and Alexey Kikhney.
Post all your questions about the Pipeline to the ATSAS Forum.

Manual
Example
- From background subtraction to modelling
Extending pipeline
- Adding Components
- Adding Tasks
- Pipeline API
  - Builtins
  - Component
  - ComponentData
  - MetaData
  - Task
  - Taskmanager
  - Message

Manual

The following sections shortly describe how to run the PIPELINE from the command-line on any of the supported platforms and the required input and the produced output files.

If you use results of the PIPELINE in your own publication, please cite:

Franke, D., Kikhney, A.G. and Svergun, D.I. (2012) Automated Acquisition and Analysis of Small Angle X-ray Scattering Data. Nuc Inst Meth A 689, 52-59

Introduction

The PIPELINE may perform data processing and analysis either in an online mode during the experiment or evaluate the collected data in an offline mode. Each data processing step (e.g. background subtraction, Guinier analysis etc.) is represented by a pipeline component that employs one or more tasks, i.e. stand-alone command-line programs, to perform the actual operations. The tasks may be run sequentially or in parallel. The pipeline components communicate with each other by passing messages. A message is sent when a particular event occurs, e.g. when a new file becomes available for processing or when a pipeline component finished processing a file. A single message may be received by several components. This way of connecting different components enables one to modify the behaviour of the pipeline to meet different requirements, e.g. by including or excluding certain steps if needed. If a component produces one or more output files, the file location is contained in the message; thus, the output files of one component can serve as input files for other components.

It is possible to add custom components and tasks to the PIPELINE and include them in the configuration with the ATSAS tools. If you are interested to add such custom components to your pipeline setup, please contact us for the developer documentation.

Running the Pipeline

Usage:

$ pipeline [OPTIONS] <CONFIGFILE>

OPTIONS known by the Pipeline are described in next section, the mandatory argument CONFIGFILE in the section on input files.

Command-line Arguments and Options

The PIPELINE accepts the following command-line argument:

Argument	Description
`CONFIGFILE`	The filename of a configuration file, possibly with relative or absolute path components.

The PIPELINE recognizes the following command-line options:

Short Option	Long Option	Description
	`--logfile=<FILE>`	Where to write log information to. By default the log information is written to `stderr`.
	`--loglevel=<debug\|info\|warning\|error>`	Minimum log verbosity, one of `debug`, `info`, `warning` or `error`. Default is `info`.
	`--working-directory=<DIR>`	Working directory. If specified the PIPELINE process changes to this working directory prior to executing any task, default: none, the current directory is used.
	`--tasklimit=<N>`	Maximum number of parallel tasks, default: 2, should not exceed the number of available cores.
`-v`	`--version`	Print version information and exit.
`-h`	`--help`	Print a summary of arguments and options and exit.

Runtime Output

If the logfile is not specified the pipeline prints the log information to stderr. The verbosity of this output is defined by the loglevel option.

Pipeline Input Files

The configuration file configures the pipeline components and how the components are connected to each other.

The file format is XML, the root element <pipelinerc> shall contain one <components> element and one <connections> element each. The <components> element may contain multiple <component> elements which in turn may contain <option> elements. The <connections> element may contain multiple <connection> elements.

Example:

<?xml version="1.0" ?>
<pipelinerc version="1.0">
  <components>
    <component class="COMPONENT_CLASS" name="COMPONENT_NAME">
      <option name="OPTION_NAME">OPTION_VALUE</option>
      <!-- more options -->
    </component>
    <!-- more components -->
  </components>
  <connections>
    <connection sender="SENDER_COMPONENT_NAME" receiver="RECEIVER_COMPONENT_NAME" />
    <!-- more connections -->
  </connections>
</pipelinerc>

Here COMPONENT_CLASS refers to the class (type) of the component and COMPONENT_NAME shall be a unique name which may be defined by the author of the configuration file. The following rules apply:

component classes may be used multiple times in the same pipeline configuration;
component names must be unique;
components may have any number of options;
option names must be unique within a component;
names used in connections must be defined in components beforehand;
any component may receive messages from any number of other components.

Validating Configuration Files

It is possible to use a XML schema to validate the configuration file. The validation already includes checks for many of the constraints listed above. The validation by schema can be done in two ways, (1) by adding the schema to the root element and letting the pipeline do the validation on startup:

<pipelinerc version="1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="/local/file/path/pipelinerc.xsd">

or (2) by validating manually on the command line, e.g.:

$ xmllint -schema /local/file/path/pipelinerc.xsd pipelinerc.xml

where xmllint is part of the libxml2 package. Other XML processors would also work.

Configuring Components

The PIPELINE comes with a number of ready-made components:

abinitio
avrgframes
avrgradial
classifier
compress
copy
dat2hdf5
distances
hplcsubtract
filesystemwatcher
fit
guinier
ispyb
membraneispyb
mixture
mow
oversubtract
plotdata
plotguinier
plotkratky
plotpdb
plotpofr
porod
razor
rigidbody
serializer
stat
subtracter
superposition
vc

abinitio

The abinitio component may be used to build one or more ab initio models from regularized data, i.e. it requires the .out file of the distances component as an input.

For each input file, DAMMIF is run repeat times in mode configuration and all its output files are placed in outputdir. If it is known that all input files come in inverse Angstroms or inverse nanometres, the unit may be fixed in the configuration file to exclude any wrong unit estimations.

If more than one model has been computed by DAMMIF, DAMAVER is run to refine the modelling. The resulting DAMSTART model will be used as an input for DAMMIN which is employed to do the final refinement step.

It is to note that the complete modelling process may take a very long time. For online processing it is recommended to set repeat to 1.

Option	Default	Accepted Range		Description
repeat	1	1	20	Number of models generated for each input file.
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	(none)			The directory for the output files of DAMMIF and DAMMIN.
unit	angstrom	['angstrom', 'nanometer']		Angular units of the input file.
mode	fast	['fast', 'slow']		Configuration of the annealing procedure.

avrgframes

Option	Default	Accepted Range		Description
group	frames			Identifier used by the serializer component to group results to output files.
interval	2000	0	Inf	TODO: docstring
outputdir	(none)			The directory for the output files of DATAVER.
prefix	(none)			TODO: docstring
alpha	0.01	0.0	1.0	Significance level for frame comparison. If the probablity of similarity is less than alpha, the frame is discarded.
groupid	(none)			TODO: docstring

avrgradial

Radial averaging of 2D images.

Option	Default	Accepted Range		Description
responsedata	(none)			File name with detector response.
axisdata	(none)			File name with angular axis.
unittime	1.0	0.0	Inf	Unit time scale in seconds.
beamstopmask	(none)			File name with beamstop mask.
beamcentery	(none)	-Inf	Inf	Vertical location of the beam center in image pixel coordinates.
beamcenterx	(none)	-Inf	Inf	Horizontal location of the beam center in image pixel coordinates.
outputdir	(none)			The directory for the output files of the radial averaging application RADAVER.
unit	unknown	['unknown', 'angstrom', 'nanometer']		Units of the angular axis.

classifier

The classifier component attempts to classify experimental data, i.e. determine to which distinct group it probably belongs. Currently only a classification of folded/unfolded is implemented. The output is a binary value, indicating whether the specimen is more likely folded (1) or unfolded (0).

For normalization purposes the radius of gyration (R_g) and the forward scattering (I(0)) are required. The classifier component runs AUTORG to obtain these values. If the R_g/I(0) could not be determined, no classification is attempted.

For comparable results, provide the same values for options here as for the guinier component.

Option	Default	Accepted Range		Description
mininterval	10	2	Inf	Minimum R_g interval length in data points.
smaxrg	1.3	0.0	Inf	Maximum S_max*R_g value.
sminrg	1.0	0.0	Inf	Minimum S_min*R_g value.
group	processed			Identifier used by the serializer component to group results to output files.
minquality	0.0	0.0	1.0

compress

The compress component compresses TIFF images inplace using Lempel-Ziv-Welch (LZW) compression.

Option	Default	Accepted Range		Description

dat2hdf5

The dat2hdf5 component converts SECSAXS frames to a data file in hdf5 format and to a json file with corresponding experimental parameters.

The structure of the hdf5 file is identical to one from the ESRF pipeline, so it can be directly downloaded and processed by the ISPyB interface.

Option	Default	Description
outputdir	(none)	The target base directory for the hdf5 and json files.
group	processed	Identifier used by the serializer component to group results to output files.
hdf5andjsonfilename	filename	Name of the hdf5 and json files (e.g. a run number).

copy

The copy component copies an incoming file to another place.

Option	Default	Accepted Range		Description
outputdir	(none)			The target base directory for the copied files.

distances

The distances component computes the inverse transformation to obtain the p(r) from experimental scattering data by means of GNOM. As an input it requires a subtracted data file and provides the regularized data and p(r) information as an output. The file is placed in outputdir.

For normalization purposes the radius of gyration (R_g) is required. The distances component runs AUTORG to obtain these values. If the R_g could not be determined, or the estimated quality is below minquality, no indirect transformation is attempted.

For comparable results, provide the same values for options here as for the guinier component.

Option	Default	Accepted Range		Description
sminrg	1.0	0.0	Inf	Minimum S_min*R_g value.
mininterval	10	2	Inf	Minimum R_g interval length in data points.
minquality	0.0	0.0	1.0
group	processed			Identifier used by the serializer component to group results to output files.
smaxrg	1.3	0.0	Inf	Maximum S_max*R_g value.
outputdir	(none)			The directory for the output files of GNOM.

hplcsubtract

The hplcsubtract component receives signals from radaver or datwatcher. When there are no new frames within 3 seconds, it runs chromixs in automatic mode and saves results in the workdir directory.

Option	Default	Accepted Range		Description
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	None			The target base directory for the processed files. Default is /analysis/processed/chromixs

filesystemwatcher

The filesystemwatcher component monitors a directory of a local file system, the workdir for files matching the defined globbing filter. For each matching file a message is sent to the connected components, informing them of the file found. Only files not previously reported are flagged and only up to batch files at a time. This is repeated after every interval. If the interval is set to 0, the workdir is checked only once and no more updates are performed. This corresponds to an offline mode where all input files are available at the start of PIPELINE.

Each filesystemwatcher monitors exactly one directory, subdirectories are not monitored. To watch for files in multiple directories, use multiple instances of filesystemwatcher.

Option	Default	Accepted Range		Description
filter	*			A filter applied to the files found in workdir. Default accepts all files, may be any reasonable globbing pattern, e.g. '*.dat' to select only files with extension '.dat'.
workdir	.			The directory to check for input files. If this is a relative path, it is relative to the working directory of the application. Default is the current working directory.
batch	2147483647	0	Inf	Batch size, maximum number of new entries to be reported after each interval. Particularly useful for testing and debugging; set polling to 60000 and batch to 20 to simulate beamline operation with one dataset of 20 frames per minute.
interval	1000	0	Inf	Polling interval of workdir in milliseconds. If 0, polling is disabled and the pipeline works in `offline mode`, assuming that all files are available at startup.

fit

The fit component uses a priori information and evokes CRYSOL to fit the theoretical scattering curve into the provided experimental data.

Option	Default	Accepted Range		Description
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	apriori			The directory for the fit and log files of crysol are stored.
lm	20	10	50	Maximum order of harmonics, defines the resolution of the calculated curve.
unit	angstrom	['angstrom', 'nanometer']		Angular units of the input file.

guinier

The guinier component attempts to determine the radius of gyration (R_g) and the forward scattering (I(0)) from experimental (data using AUTORG. As an input it requires a subtracted data file scaled for concentration.

The minquality option is not used by this component, but is forwarded only to the output to facilitate a configurable means to decide what quality is still considered acceptable.

If the forward scattering and molecular mass of a standard protein are provided (standard-i0, standard-mw), then an approximation of the molecular weight of the current specimen is computed as well.

Option	Default	Accepted Range		Description
sminrg	(none)	0.0	Inf	Minimum S_min*R_g value.
mininterval	(none)	2	Inf	Minimum R_g interval length in data points.
minquality	0.0	0.0	1.0
standard-mw	(none)	0.0	Inf
standard-i0	(none)	0.0	Inf
group	processed			Identifier used by the serializer component to group results to output files.
smaxrg	(none)	0.0	Inf	Maximum S_max*R_g value.

ispyb

Connection to the ISPyB database.

Option	Default	Accepted Range		Description
username	(none)
code	(none)			Proposal code, generally 'mx' or 'saxs'.
url	(none)			URL of the ISPyB web services.
interval	5000	2000	Inf	Time to wait between a data update and its serialization to the database in milliseconds. If there is an update before the interval expired, the wait time starts again.
number	(none)			Proposal number.
experimentid	(none)			ISPyB experiment id where results will be added to. If undefined, a new experiment id will be generated.
datadir	(none)			Target base directory for copied files.
password	(none)

membraneispyb

The membraneispyb component performs sophisticated automatic analysis of available a priori data in ISPyB and, depending on the result, runs either MONSA or MEMPROT.

The component is designed for automatic membrane proteins SECSAXS analysis in detergent solution. It does not work without hplcsubtract and ispyb components.

There should be only declaration of the ispyb component in the pipeline configuration file. No connection with the membraneispyb component is required! All the data exchange with ISPyB database is performed implicitly inside the module.

For running MEMPROT a high resolution structure has to be uploaded to the ISPyB database. Otherwise, MONSA will be started with some precalculated starting parameters. Knowledge of a detergent chemical formulas is mandatory for both types of modeling.

Modeling is based on combination of a priori information (pdb, FASTA sequence, detergend head/tail chemical formulas) and SAXS a postariori data (Rg, Dmax, Vporod).

Please make sure in advance that all the precalculated starting values for both MONSA/MEMPROT configuration files are adequate, as modeling can take a while.

Option	Default	Accepted Range		Description
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	(none)			Path to MEMPROT/MONSA models and configuration files.

mixture

The mixture component uses a priori information for creating the form-factor file for OLIGOMER. OLIGOMER in turn fits a scattering curve from a multicompoment mixture of proteins in order to receive the volume fractions of each component in the mixture.

Option	Default	Accepted Range		Description
lm	20	10	50	Maximum order of harmonics, defines the resolution of the calculated curve.
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	apriori			The directory for the fit and log files of oligomer are stored.
unit	angstrom	['angstrom', 'nanometer']		Angular units of the input file.

mow

The mow component uses the program DATMOW to estimate the molecular weight of the sample.

Option	Default	Accepted Range		Description
group	processed			Identifier used by the serializer component to group results to output files.

oversubtract

The oversubtract component collects information on the subtraction results. It determines the longest sequence of consequtively following negative intensity values (LCNS) in the data and calculates a p-value which describes the risk of erroneously considering these data as oversubtracted. This procedure is repeated on the same data for multiple n with every n data points being averaged by using a stepping window of size n. If in any configuration oversubtraction is detected i.e. the p-value is below a given significance level alpha, this is flagged on the display of the results. Furthermore the component provides the computed p-value, the start index and length of the LCNS with respect to the non-averaged data, the window size with which oversubtraction was found as well as a constant describing the degree of oversubtraction in relation to the forward scattering I(0).

Option	Default	Accepted Range		Description
group	processed			Identifier used by the serializer component to group results to output files.
minwnr	100	1	Inf	Minimal number of windows required: specifies how many data points are at least required to be present after averaging/windowing in order to perform LCNS analysis.
alpha	0.05	0	1	Significance level for the p-value. P-values below alpha indicate oversubtraction.

plotdata

The plotdata component creates a logarithmic plot of the scattering data.

Option	Default	Accepted Range		Description
smax	(none)	0	Inf	Maximum angle to plot.
group	processed			Identifier used by the serializer component to group results to output files.
gnuplotconfig	/usr/share/atsas/pipeline/tasks/dat2img.cfg			Location of the gnuplot configuration file.
imin	(none)	0	Inf	Minimum intensity to plot.
imax	(none)	0	Inf	Maximum intensity to plot.
outputdir	thumbnails			The output directory for the log I(s) thumbnails.

plotguinier

The plotguinier component creates a Guinier plot based on the output of the guinier component.

Option	Default	Description
group	processed	Identifier used by the serializer component to group results to output files.
outputdir	thumbnails	The output directory for the Guinier plot thumbnails.
gnuplotconfig	/usr/share/atsas/pipeline/tasks/guinier2img.cfg	Location of the gnuplot configuration file.

plotkratky

The plotkratky component receives data from the Guinier analysis and uses them to create a dimensionless Kratky plot.

Option	Default	Accepted Range		Description
issmax	(none)	0	Inf	Maximum s²I(s) to plot.
smax	(none)	0	Inf	Maximum angle to plot.
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	thumbnails			The output directory for the log I(s) thumbnails.
gnuplotconfig	/usr/share/atsas/pipeline/tasks/kratky2img.cfg			Location of the gnuplot configuration file.

plotpdb

The plotpdb component generates a thumbnail image from a pdb file created by the abinitio component.

Option	Default	Accepted Range		Description
rmax	(none)	0	Inf	Maximum distance to plot (usually pi / Smin - same as in plotpofr).
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	thumbnails			The output directory for the p(r) thumbnails.
gnuplotconfig	/usr/share/atsas/pipeline/tasks/pdb2img.cfg			Location of the gnuplot configuration file.

plotpofr

TODO

Option	Default	Accepted Range		Description
rmax	(none)	0	Inf	Maximum distance to plot (usually pi / Smin).
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	thumbnails			The output directory for the p(r) thumbnails.
gnuplotconfig	/usr/share/atsas/pipeline/tasks/pofr2img.cfg			Location of the gnuplot configuration file.

porod

The porod component attempts to compute the Porod volume from regularized data, i.e. it requires the .out file of the distances component as an input and gives the Porod volume as output.

Option	Default	Accepted Range		Description
group	processed			Identifier used by the serializer component to group results to output files.

razor

TODO

Option	Default	Accepted Range		Description
group	processed			Identifier used by the serializer component to group results to output files.

rigidbody

The rigidbody component evokes SASREF, that performs rigid body modeling of macromolecular complexes formed by multiple subunits (with known atomic structure) against solution scattering data.

Option	Default	Accepted Range	Description
group	processed		Identifier used by the serializer component to group results to output files.
outputdir	apriori		The directory for the fit and log files of oligomer are stored.
unit	angstrom	['angstrom', 'nanometer']	Angular units of the input file.

serializer

The serializer component handles the writing of an output file in XML format. There may be multiple serializers within a PIPELINE configuration, each serializing their own group of components.

Every component with output needs to be connected to a serializer to ensure correct and complete serialization of the data. To avoid rewrites of the output file in quick succession, a waiting interval may be defined. The serializer will wait interval milliseconds before writing out the output. If in this time more messages from other components arrive, the waiting period starts anew.

If a stylesheet is present, its location is included in the resulting output file to facilitate simple viewing of the data in a web browser. Please note that some browsers will not use a stylesheet with an absolute path for security reasons. Thus a stylesheet should be placed in the same directory as the xml or in a subdirectory. To disable this check in Firefox:

type about:config in the address bar
change security.fileuri.strict_origin_policy to false

Option	Default	Accepted Range		Description
output	(none)			The name of the file where the serialized data shall be stored. May include an absolute or relative path.
stylesheet	(none)			Absolute or relative path to the XSL style sheet to be named in the output file.
interval	5000	2000	Inf	Time to wait between a data update and its serialization in milliseconds. If there is an update before the interval expired, the wait time starts again.
group	processed			Serializes data of all components in the same group into one output file.

stat

The stat component collects a configurable list of file header/footer information, e.g. description, code, concentration, exposure time, etc. The information may then be serialized together with the computational results to provide the full information.

Example usage:

<component class="stat" name="datstat">
  <option name="query">
    <list>
      <item>sample-description</item>
      <item>sample-code</item>
      <item>sample-concentration</item>
    </list>
  </option>
</component>

Option	Default	Accepted Range		Description
query	(none)			The list of keys to extract from the input file(s).
group	processed			Identifier used by the serializer component to group results to output files.

subtracter

The subtracter component collects incoming files, sorts them by run number extracted from the input file and groups them into samples and buffers, respectively. The grouped files are compared, significantly different frames discarded (see alpha) and averaged. The closest averaged buffer(s) in time (i.e. run number) with the same code as a sample will be subtracted from the sample, or, if verifyCodeMatch is disabled, any closest buffer(s) are used. The averaged and subtracted files are written to outputdir.

Option	Default	Accepted Range		Description
verifyCodeMatch	1	0	1	Whether or not the code must match on subtraction. If not (0), it is allowed to subtract a buffer with a code different of that of the sample.
alpha	0.01	0.0	1.0	Significance level for frame comparison. If the probablity of similarity is less than alpha, the frame is discarded.
group	processed			Identifier used by the serializer component to group results to output files.
outputdir	(none)			The directory for the subtracted and normalized files are stored.

superposition

TODO

Option	Default	Accepted Range	Description
superposition	backbone	['backbone', 'all']	Selection of atoms to superimpose.
group	processed		Identifier used by the serializer component to group results to output files.
outputdir	models		The directory for the rotated models are stored.
mode	slow	['fast', 'slow']	Superposition algorithm, slow mode is more accurate.

vc

MW Estimate from Volume of Correlation.

Option	Default	Accepted Range		Description
group	processed			Identifier used by the serializer component to group results to output files.

zerconc

The zeroconc component attempts to deal with concentration effects by extrapolating to zero concentration from experimental data using ALMERGE. As an input it requires a subtracted data set scaled for the respective concentrations. If only two concentrations of the same sample are available, zeroconc will merge them; more than two concentrations will be extrapolated to zero concentration. The extrapolated file is written to outputdir.

The extrapolated file may be subjected to the same analysis as any other experimental data set.

Option	Default	Accepted Range		Description
sminrg	1.0	0.0	Inf	Minimum S_min*R_g value.
mininterval	10	2	Inf	Minimum R_g interval length in data points.
minquality	0.5	0.0	1.0
overlap	30	2	Inf	Minimum overlap length in angular points.
step	2	1	Inf	Search step in points; step=1 will find the best overlap but takes most time.
smaxrg	1.3	0.0	Inf	Maximum S_max*R_g value.
outputdir	(none)			The output directory for the extrapolated files.

Pipeline Output Files

Each component may produce output files, e.g. .dat, .out or .pdb, and values related to the input data, e.g. radius of gyration or molecular weight estimates.

The output files will be stored according to the configuration of the particular component, typically as defined by <option name="outputdir"> in the configuration file.

Please see the components section on how to configure and connect the serializer component.

Example

From background subtraction to modelling

This is a sample Configuration File that defines a pipeline that

waits for new *.dat files in the folder data;
subtracts the background and writes the subtracted data file to the folder subtracted;
from the subtracted data files reads the properties sample-description, sample-code and sample-concentration;
performs the Guinier analysis and evaluates the molecular weight from I(0);
performs the GNOM analysis (distance distribution function);
evaluates the Porod volume;
runs the ab initio modelling;
every 3000 milliseconds writes the values obtained from these steps (including the sample properties read from the subtracted file) to result.xml.

<xmp>
<?xml version="1.0" encoding="UTF-8"?>
<pipelinerc version="1.0">
  <components>
    <component class="filesystemwatcher" name="datwatcher">
      <option name="interval">1000</option>
      <option name="workdir">data</option>
      <option name="filter">*.dat</option>
    </component>

    <component class="subtracter" name="datsubtract">
      <option name="outputdir">subtracted</option>
    </component>

    <component class="stat" name="datstat">
      <option name="query">
        <list>
          <item>sample-description</item>
          <item>sample-code</item>
          <item>sample-concentration</item>
        </list>
      </option>
    </component>

    <component class="guinier" name="datguinier">
      <option name="mininterval">10</option>
      <option name="minquality">0.5</option>
      <option name="standard-i0">1</option>
      <option name="standard-mw">66</option>
    </component>

    <component class="distances" name="datgnom">
      <option name="outputdir">gnom</option>
      <option name="mininterval">10</option>
      <option name="minquality">0.5</option>
    </component>

    <component class="porod" name="datporod" />

    <component class="abinitio" name="dammif">
      <option name="outputdir">models</option>
      <option name="mode">fast</option>
      <option name="unit">nanometer</option>
      <option name="repeat">1</option>
    </component>

    <component class="serializer" name="datresult">
      <option name="interval">3000</option>
      <option name="outputdir">.</option>
      <option name="output">result.xml</option>
      <option name="stylesheet">/usr/local/atsas/etc/result.xsl</option>
    </component>
  </components>

  <connections>
    <connection sender="datwatcher" receiver="datsubtract" />

    <connection sender="datsubtract" receiver="datstat" />
    <connection sender="datsubtract" receiver="datguinier" />
    <connection sender="datsubtract" receiver="datgnom" />

    <connection sender="datgnom" receiver="datporod" />
    <connection sender="datgnom" receiver="dammif" />

    <connection sender="datstat" receiver="datresult" />
    <connection sender="datguinier" receiver="datresult" />
    <connection sender="datgnom" receiver="datresult" />
    <connection sender="datporod" receiver="datresult" />
    <connection sender="dammif" receiver="datresult" />
  </connections>
</pipelinerc>
</xmp>

Extending pipeline

The PIPELINE may be extended by customized components to perform tasks of the user's choice. This part of the documentation describes how to extend the PIPELINE, i.e. how to add components and tasks. it also includes an API overview.

Adding Components

We use the oversubtract component as an example to explain how to implement a simple component.

Initialization

from component import component
from metadata import metadata

import message, libxml2

class oversubtract(component):
  def __init__(self):
    component.__init__(self)

Definition of the class oversubtract and initialization of the component base class.

This code should always be included in new components as shown, but with a different class name.

Configuration

  def configure(self, rawnode):
    component.configure(self, rawnode)
    self.maxratio = self.option("maxratio")

After initialization, each component is configured and the configure method of each component instance is called. The rawnode parameter contains a pointer to the section of the configuration file corresponding to the instance of this component. Always call the base class implementation of this method before accessing the individual options through option. There should be no reason to use or modify the rawnode directly.

It is good practise to check here that any values are within their valid ranges, that output directories exist or if they do not, create them etc. It may be helpful to report any results using the message module.

Processing Data

  def process(self, metadata):
    data = {}
    filename = metadata.getFileName()

    # Fill in 'data', removed counting code for brevity.

    metadata.setData(self, data)
    component.processed(self, metadata)

During processing, components usually accumulate data that needs to be serialized later. The data may be any python object, e.g. a single value, a list, a dict or a class object. A reference to this data may be stored in the metadata object, where the MetaData acts like a container that associates data and tasks with a particular file being analysed. Please note: while it is technically possible to access data of different components then the current one, it is strongly discouraged. Each component should be stand-alone and not depend on values of any other.

In this example the actual work is done synchronously in Python, i.e. when the data comes in its processed, computation thus is serialized. Do this only if the computations involved a short and the overhead of launching a task would be larger than just doing the computations inline. Otherwise, to allow for parallelization, asynchronous tasks are the preferred mechanism.

When done with processing, the processed signal shall be sent by calling the corresponding class method to allow for connections between components.

Serializing Results

 def serialize(self, metadata, rawnode):
    node = libxml2.xmlNode(rawnode)
    node = node.newChild(None, "oversubtracted", None)

    data = metadata..getData(self)
    for key in data.keys():
      subnode = node.newChild(None, "value", str(data[key]))
      subnode.newProp("name", key)

On serialization, the components serialize method is called for every metadata which contains data added by this component. As during configuration for input, the rawnode here points to the output node in the XML document where the data should be serialized to.

FIXME: finally refactor above code to use an addValue function or something.

name(self)

Returns the configured name of the component instance self.
configure(self, rawnode)

Read the configuration for this component from the .xml tree. Should be implemented in subclasses like:
```
def configure(self, rawnode):
  component.configure(self, rawnode)
  self.outputdir = self.option('outputdir')
```
to obtain the value of the option "outputdir" for this component.
option(self, name)

Return the value of option name. Given this option configuration:
```
<option name="workdir">data</option>
```
If name is "workdir", and "data" is returned, None otherwise.
startTimer(self, interval)

If a component shall be re-activated at some point later in time, set an interval through this method. After expiry of interval milliseconds, the timeout method is invoked. Each component can have only one timer. Please note that the timeout occurs after every interval milliseconds and if not stopped will prevent the PIPELINE from terminating in offline mode due to one component still being active, i.e. waiting for a timeout.

Use configure and option to make the interval user configurable.
killTimer(self)

Stop an active timer. If no timer is active, this method has no effect.
restartTimer(self, interval)

Convenience method.

Stops the current timer and starts it again with the new interval.
timeout(self)

The default implementation does nothing.

Shall be implemented in a subtype if timers are used.
process(self, metadata)

The default implementation does nothing.

Shall be implemented in a subtype to process the incoming metadata reference.
processed(self, metadata)

This messaging method should be called to pass on the processed, and possibly supplemented, metadata reference.
serialize(self, metadata, rawnode)

The default implementation does nothing.

Shall be implemented in a subtype to write out results.

MetaData

Collection of data, associated to a data file on disk.

Methods:

getFileName(self)

Returns the associated file name of the instance self.
setData(self, component, data)

Associates a component's data with the metadata instance self. May be retrieved using getData or removed using delData.
getData(self, component)

Returns the data set for component.
delData(self, component)

Removes the data set for component from the collection.
addTask(self, component, task)

Associates a component's task with the metadata instance self. Tasks may be retrieved using getTasks, removed using delTask and stopped by killTasks.
getTasks(self, component)

Returns a list of all tasks added for component.
delTask(self, component, task)

Removes the task added for component from the task list.
killTasks(self, component)

Terminates all tasks in the task list.

Each message is prefixed by a date/time stamp and the logging level. Message levels are:

DEBUG - information for developers
INFO - progress information
WARNING - potentially serious issues, processing continues
ERROR - fatal problems, processing must be stopped

Message output can be filtered by level and redirected into a file. See Command Line Arguments and Options for details.

Functions:

message.debug(message)

Submit the message on level DEBUG.
message.info(message)

Submit the message on level INFO.
message.warning(message)

Submit the message on level WARNING.
message.error(message)

Submit the message on level ERROR. ERRORs are considered fatal.