Integration of BioDICE plugin with Taverna Environment

This page presents a development of the Biological Data Interactive Clustering Explorer (BioDICE) project to allow the bioinformatics workflow suite Taverna to obtain a visual exploration of biological data.

The interface comes as a plugin to Taverna 2.3 and without further configuration it runs on all operating systems that run Java.

This page first introduces the tool description and all its components. A complete step-by-step installation section is then reported. A Taverna workflow for drug discovery analysis is showed as use case: the concepts used into this workflow should be easily transferable to other similar scenarios.

BioDICE Tool Description

BioDICE Tool is composed by 5 components in cascade:

  1. Singular Value Decomposition (SVD)
  2. Fast Learning Self Organizing Maps (FLSOM)
  3. Interactive U-Matrix visualization
  4. Canny Edge Detection
  5. Region Growing Segmentation

After biological data are introduced, BioDICE tool reduces the multidimensional vector space representing input patterns using SVD algorithm. Then FLSOM performs clustering of input data and shows results by means of U-Matrix visualization. An interactive exploration of data into clusters is allowed, in order to manually select and analyse relationships among groups of elements.
Moreover, BioDICE tool provides the button "Find Clusters" that allows to detect all groups of clusters in a semi-automatic way, by means of a customizable version of canny edge detection algorithm.
Finally, a region growing algorithm is responsible for the segmentation of interactive map into clusters, using the cluster boundaries provided by the edge detection algorithm.

The kernel component of BioDICE is the FLSOM algorithm with a principal component initialization, a machine learning algorithm based on SOM (also known as Kohonen maps), a family of unsupervised neural networks. More in details, it belongs to the ESOM (Emergent-SOM) family.

FLSOM Initialization

In general, SOMs results are influenced by the initialization of neuron weights vectors; this has led us to adopt a linear initialization technique, resulting in execution time reduction and results improvement. Linear initialization procedures are based on the analogy between SOMs and principal curve analysis algorithm, where a principal curve is a non-linear generalization of a principal component (from Principal Component Analysis). The dualism between PCA and SVD (Singular Value Decomposition) allows us to replace the former with singular eigenvalues obtained from the latter. SVD is a factorial analysis technique aimed at reducing the multidimensional vector space representing input patterns; in our case this goal is achieved using Truncated Singular Value Decomposition. The multidimensional input patterns set can be represented by a reduced number of latent factors, which loss-less describe the original set information.

In our case SVD is used to project patterns and features in a new space where both of them have the same dimensions, allowing us to project them over the same SOM.

FLSOM learning

In order to improve the clustering of input data, we adopt the FLSOM algorithm, that uses simulated annealing as evaluation and directional criterion for map evolution. This heuristic improves the learning process’ quality, providing an adaptive learning rate factor. As the original SOM, also the FLSOM is trained for epochs; at the end of each learning epoch the Quantization Error (QE) is calculated as the distance between a data vector and its best matching unit. QE is used as the “temperature” parameter and its progression delineates the system evolution.

The learning process stops when the difference between the QE calculated at the end of the current epoch and the one relative to the previous epoch is lesser than a threshold value. Each learning epoch generates a perturbation of the neurons’ status (i.e. the evolution of the network); if this perturbation satisfies low-energy criteria defined by simulated annealing, then the current configuration is accepted and a new perturbation is calculated. Otherwise, the first perturbation will be used in the next epoch and, if it is better than the current one, the previous configuration will be restored.

Taverna Plug-In Installation Instructions

BioDICE plugin, which is currently available to all Taverna Workbench 2.3 users, provides a local service to be used as a workflow component.

Installing

Taverna’s official plugin installation guides:
http://dev.mygrid.org.uk/wiki/display/taverna/Finding+plugins,
http://dev.mygrid.org.uk/wiki/display/taverna/Installing+plugins.

If you wish to install the BioDICE plugin, you will have to add our update site to Taverna’s plugin manager update sites list. To add the site, press the ‘Add update site’ button and paste this URL [http://biolab.pa.icar.cnr.it/repo] in the pop-up dialog’s ‘URL site’ field (fill in the ‘Site name’ field as you see fit). The BioDICE plugin will now be listed along the others, ready to be installed.
A tutorial is available here.


Use Taverna Plug-In

Once the plugin is installed, an 'icar.cnr.it BioDICE' folder will show up inside Taverna’s service panel (accessible from the ‘Design’ perspective). Open the folder and drag-and-drop the listed component into your workflow.

Configuration

Configuration Panel screenshot

The BioDICE local service can be configured during workflow design phase by right-clicking the corresponding graph node and selecting ‘Configure BioDICE…’ from the contextual menu; a dialog containing the list of all configurable parameters will pop up.
Parameters’ description:

  • ‘Number of features’ lets you adjust the dimension of the reduced vector space generated by TSVD; it’s possible to disable space reduction altogether by ticking the adjacent ‘Disable’ checkbox.
  • The next field couple allows you to set FLSOM lattice’s height and width. Tool reach better performance with large maps (i.e., 100x100, 150x150, 200x200, ecc..)
  • The last group of settings allows you to customize FLSOM’s learning parameter (alpha) and neighbourhood radius (sigma) maximum and minimum values, and the radius (init radius) of the initialization algorithm.

 

Input

BioDICE local service has a single depth 0 input port, named ‘ids’. Input must be a string properly formatted according to the following rules:

  • Every line must end with a single ‘\n’ character.
  • ‘id:list’ is accepted as first input line, however this string is optional and may be omitted.
  • Other lines must start with a numeric id (representing the feature's id) without leading or trailing whitespaces, followed by ‘:’ as separator from the list.
  • Every id is followed by a list of elements having that specific feature. This list is formed by strings separated by a ‘,’; the last one must not be followed by a ‘,’.
  • Every character except for ‘\n’, ‘:’ and ‘,’ may be used in strings.

 

Input example:

Input grammar regex: (“id:list”\n)?([0-9]+:([^,\n:]+,)*[^,\n:])+

Plug-in Execution

BioDICE screenshot

Once the workflow is running and the FLSOM component receives its input a window will be opened, containing U-Matrix visualization, progress information and a set of controls. Among them you will find:
• A ‘Stop’ button, used to halt learning phase; you may use this button whenever you think the FLSOM is trained enough. Halting the learning procedure won’t affect the capability to perform clustering. The learning procedure eventually halts once QE variation reachs a threshold; so pressing the ‘stop’ button is absolutely facultative.
• A canny edge detection parameters panel with 4 sliders: 1) the maximum threshold, 2) the minimum threshold, 3) the gaussian kernel radius, 4) the gaussian kernel width. Moving these sliders will substitute current U-Matrix visualization with a cluster preview.
• A ‘Find Clusters’ button that begins clustering and, when clustering is complete, closes the window and forwards the results to the component’s output port.
• Three alternative visualization options, allowing you to choose the type of element information (position, code or nothing) displayed over the SOM. These settings have no bearing on learning and clustering phases.

Output

The BioDICE local service produces a depth 1 output, i.e. a list of strings. These strings, which represent clusters, will be formatted as a list of elements separated by commas.

Use Case: A Demonstrative Taverna Workflow for analisys of chemical compounds

 

We have created a Taverna Workbench workflow showcasing a possible BioDICE plugin application; it is downloadable from myExperiment repository (here).

Workflow Overview

This workflow downloads an input set of molecular compounds in SMILES format, using Chemspider service. The most frequent molecular fragments are extracted by means of MoSS tool, in order to obtain a set of features for each compound. Then a clustering and a visual exploration of the input dataset is performed by BioDICE service, implementing Fast Learning Self-Organized Map (FLSOM) algorithm. Finally the output is a list of compounds for each cluster. The workflow requires as input a text file with a list of compounds names (or Chemspider IDs) and a file with a list of MoSS minimal support focus values (see MoSS documentation for a detailed description). 
A tutorial is available here.

Sample Input Files

For this use case, input file is given by NCI DTP Discovery Service, and it is composed by a set of 101 FDA-approved anticancer drugs. A flat list of these 101 compounds is available here.
An additional file containing a list of 101 minimum support in focus values (one for each compound), as requested by the MoSS tool, is available here.

About MoSS

MoSS (also known as MOFA) is a program that finds frequent molecular substructures and discriminative fragments in a molecular descriptions set.

Setting up MoSS Beanshell

To properly execute the workflow you’ll have to download the MoSS jar archive moss.jar. The linked jar refers to MoSS 6.8, which is compatible with java ≥ 1.6. The latest MoSS release (version 6.10) is available here end it is compatible only with Java 1.7.0. MoSS is available under the terms of the GNU Lesser (Library) General Public License. Copy it to the path specified in the ‘Dependencies’ tab, which can be reached by right-clicking on the MoSS workflow component and choosing the ‘Edit beanshell script’ option. The jar will be eventually listed in the ‘Local JAR files’ panel, if it doesn’t try closing and reopening the window; once it is displayed tick the corresponding checkbox and press the ‘Apply’ button to confirm the changes. Now you should be able to successfully run the workflow.

Video Tutorial

Sources

The source of the taverna plug-in is available, under the terms of the GNU Lesser (Library) General Public License, as a zip file.
The source of the taverna workflow used for the case study is stored in myExeriment and can be download here.
The source package of the stand alone service is still being prepared. Soon it will be offered here under terms of the GNU Public License.

References

How to Cite

A. Fiannaca, G. Di Fatta, R. Rizzo, A. Urso, S. Gaglio(2013) Simulated annealing technique for fast learning of SOM networks Neural Computing and Applications, 22(5):889-899.

See Also

  • On SOM for biological data

    • A. Fiannaca, G. Di Fatta, A. Urso, R. Rizzo, S. Gaglio (2009) A New Linear Initialization in SOM for Biomolecular DataLNCS, 5488:177-187.
    • G. Di Fatta, A. Fiannaca, R. Rizzo, A. Urso, M. R. Berthold, S. Gaglio (2006) Context-Aware Visual Exploration of Molecular Datab.ICDM 2006, 136-141.
  • On Workflows

Contact

Antonino Fiannaca, PhD - Post Doc Research Fellow
National Research Council of Italy
High Performances Computing and Networking Institute (CNR -ICAR)
Viale delle Scienze - edificio 11
90128 Palermo
Italy

Phone: +39 091 6809279
Mobile: +39 320 0969201
E-mail: fiannaca AT pa.icar.cnr.it