Parameters

xiRT needs two sets of parameters that are supplied via two YAML files. The xiRT parameters contain the settings that define the network architecture and learning tasks. With different / new types of chromatography or other separation settings, the learning behavior is influenced and hence needs adjustement. The learning parameters are used to define the learning data (e.g. filtered to a desired confidence limit) and some higher-level learning behaviour. For instance, settings for loading pretrained models and cross-validation are controlled.

xiRT-Parameters

The xiRT-Parameters can be divided into several categories that either reflect the individual layers of the network or some higher level parameters. Since the input file structure is very dynamic, the xiRT configuration needs to be handled with care. For example, the RT information in the input data is encoded in the predictions section. Here, the column names of the RT data needs to be defined. Accordingly, the learning options in the output section must be adapted. Each prediction task needs the parameters x-activation, x-column, x-dimension, x-loss, x-metrics and x-weight, where “x” represents the seperation method of interest.

Please see here for an example YAML file including comments (form xiRT v. 1.0.32):

LSTM:
  activation: tanh      # activiation function
  activity_regularization: l2       # regularization to use
  activityregularizer_value: 0.001  # lambda value
  bidirectional: true               # if RNN-cell should work bidirectional
  kernel_regularization: l2         # kernel regularization method
  kernelregularizer_value: 0.001    # lambda value
  lstm_bn: true                     # use batch normalization
  nlayers: 1                        # number of layers
  type: GRU                         # RNN type of layer to use: GRU, LSTM and CuDNNGRU, CuDNNGRU
  units: 50                         # number of units in the RNN cell
dense:          # parameters for the dense layers
  activation:   # type of activiations to use for the layers (for each layer)
  - relu    # activiation function
  - relu
  - relu
  dense_bn: # use batch normalization
  - true
  - true
  - true
  dropout:  # dropout usage rate
  - 0.1
  - 0.1
  - 0.1
  kernel_regularizer:   # regularizer for the kernel
  - l2
  - l2
  - l2
  neurons:  # number of neurons per layer
  - 300
  - 150
  - 75
  nlayers: 3    # number of layers, this number must be matched by the parameters
  regularization:   # use regularization
  - true
  - true
  - true
  regularizer_value:    # lambda values
  - 0.001
  - 0.001
  - 0.001
embedding:      # parameters for the embedding layer
  length: 50    # embedding vector dimension
learning:       # learning phase parameters
  batch_size: 128   # observations to use per batch
  epochs: 75        # maximal epochs to train
  learningrate: 0.001   # initial learning rate
  verbose: 1        # verbose training information
output:     # important learning parameters
  callback-path: data/results/callbacks/       # network architectures and weights will be stored here
  # the following parameters need to be defined for each chromatography variable
  hsax-activation: sigmoid  # activiation function, use linear for regression
  hsax-column: hsax_ordinal # output column name
  hsax-dimension: 10    # equals number of fractions
  hsax-loss: binary_crossentropy    # loss function, must be adapted for regression / classification
  hsax-metrics: mse     # report the following metric
  hsax-weight: 50       # weight to be used in the loss function
  rp-activation: linear
  rp-column: rp
  rp-dimension: 1
  rp-loss: mse
  rp-metrics: mse
  rp-weight: 1
  scx-activation: sigmoid
  scx-column: scx_ordinal
  scx-dimension: 9
  scx-loss: binary_crossentropy
  scx-metrics: mse
  scx-weight: 50
siamese:        # parameters for the siamese part
  use: True         # use siamese
  merge_type: add   # how to combine individual network params after the Siamese network
  single_predictions: True  # use also single peptide predictions
callbacks:                  # callbacks to use
  check_point: True
  log_csv: True
  early_stopping: True
  early_stopping_patience: 15
  tensor_board: False
  progressbar: True
  reduce_lr: True
  reduce_lr_factor: 0.5
  reduce_lr_patience: 15
predictions:
    # parameters that define how the input variables are treated
    # "continues" means that linear (regression) activation functions are used for the learning.
    # if this should be done, the above parameters must also be adapted (weight, loss, metric, etc)
  continues:
    - rp
  fractions: # simply write fractions: [] if no fraction prediction is desired
    # if (discrete) fraction numbers should be used for the learning, this needs to be
    # indicated here
    # For fractions, either ordinal regression or classification can be used in the
    # fractions setting (regression is possible too).
    - scx
    - hsax

Apart from the very important neural network architecture definitions, the target variable encoding is also defined in the YAML.

Learning-Parameters

Parameters that govern the separation of training and testing data for the learning.

Here is an example YAML file with comments (form xiRT v. 1.0.32):

# preprocessing options:
# le: str, label encoder location. Only needed for transfer learning, or usage of pretrained
# max_length: float, max length of sequences
# cl_residue: bool, if True crosslinked residues are decoded as Kcl or in modX format clK
preprocessing:
    le: None
    max_length: -1 # -1
    cl_residue: True


# fdr: float, a FDR cutoff for peptide matches to be included in the training process
# ncv: int, number of CV folds to perform to avoid training/prediction on the same data
# mode: str, must be one of: train, crossvalidation, predict
# train and transfer share the same options that are necessary to run xiML, here is a brief rundown:
# augment: bool, if data augmentation should be performed
# sequence_type: str, must be linear, crosslink, pseudolinear. crosslink uses the siamese network
# pretrained_weights: "None", str location of neural network weights. Only embedding/RNN weights
#   are loaded. pretrained weights can be used with all modes, essentially resembling a transfer
#   learning set-up
# sample_frac: float, (0, 1) used for downsampling the input data (e.g. for learning curves).
#   Usually, left to 1 if all data should be used for training
# sample_state: int, random state to be used for shuffling the data. Important for recreating
#   results.
# refit: bool, if True the classifier is refit on all the data below the FDR cutoff to predict
# the RT times for all peptide matches above the FDR cutoff. If false, the already trained CV
# classifier with the lowest validation loss is chosen
train:
  fdr: 0.01
  ncv: 3
  mode: "crossvalidation" # other modes are: train / crossvalidation / predict
  augment: False
  sequence_type: "crosslink"
  pretrained_weights: "None"
  test_frac: 0.10
  sample_frac: 1
  sample_state: 21
  refit: False

Generally, it is better to supply more high-quality data than more data. Sometimes considerable drops in performance can be observed when 5% instead of 1% input data is used. However, there is no general rule of thumb and this needs to be optimized per run / experiment.

Hyperparameter-Optimization

Neural Networks are very sensitive to their hyperparameters. To automate the daunting task of finding the right hyperparameters two utils are shipped with xiRT. 1) a convenience function that generates YAML files from a grid YAML file. 2) a snakemake workflow that can be used to run xiRT with each parameter combination.

The grid will be generated based on all entries where not a single value is passed but a list of values. This can lead to an enormous search space, so step-wise optimization is sometimes the only viable option.