ELMworkshop2006
From LsmWiki
Contents |
Workshop on Machine Learning for BioMedical Informatics 2006
Workshop on Machine Learning for BioMedical Informatics 2006 and the Inaugural Meeting of the ISCB-SC RSG of Singapore
A 1-day hands-on workshop co-organised by the
Association for Medical and Bioinformatics Singapore (AMBIS)
Asia Pacific Bioinformatics Network (APBioNet)
and
the Singapore Regional Student Group,
International Society for Computational Biology Students Council (ISCB-SC)
Registration Information
- Date: Tuesday 21 Nov 2006
- Venue: NUS Mathematics Dept Computer Lab
#01-15 Block S14, Science Drive 2 - Map: Location Map of S14 [1]
NUS Campus Map [2] - Parking: Carpark 10 [3]
- Time: 9:30am-5:00pm
- Co-Organisers:
- AMBIS http://www.ambis.org.sg/
- APBioNet http://www.apbionet.org/
- Regional Students Group (RSG) of Singapore, International Society for Computational Biology Student Council (ISCB-SC)
- Hosts:
- National University of Singapore Bioinformatics Centre (Resources)
Department of Biochemistry, Yong Loo Lin School of Medicine.
Contact: A/Prof Tan Tin Wee bchtantw_at_nus.edu.sg
- National University of Singapore Bioinformatics Centre (Resources)
- Registration:
- Registration is Free
- Token $10 charge for course material, food and refreshments
- Limited to 30 participants
- Contact: heiny@nus.edu.sg cc to saraswathi55@gmail.com and mark@bic.nus.edu.sg for registration details
- Workshop Trainers:
- Advisor: Prof. Sundararajan, EEE, NTU (67905027)
- A. Course Trainer: Dr. G.-B. Huang, EEE, S. Suresh, EEE, Saraswathi.S., NTU.
- B. Course Trainer: Mr Mohamed Ariffin Kamaludin, CEO and Founder, Retriva Solutions (an NUS Venture Support company)
- Demonstrators: Lin Hong Huang, Mark De Silva, Lim Kuan Siong, Lawrence Wee, Asif Khan, Heiny Tan, Tan Tin Wee
- Circulation of Call for Participation
- NUS - Heiny Tan
- NTU - Saraswathi
- AMBIS - Darran Nathan
- APBioNet - TW Tan
- BIC (resources) - Mark
- NUS All Secs - TW Tan via Biochem Dept
Introduction
Machine Learning in Medical and Bioinformatics
The advent of the Genome Projects combined with the ICT revolution has globally transformed the way research in the biomedical fields is carried out. It has also amplified by several orders of magnitude the output of research data and publications. Accumulated data in the life sciences has reached proportions which require rapid methods for making sense of the information and transforming it into actionable knowledge. One popular approach has been to use Machine Learning techniques in the computing sciences and apply and adapt them to the life sciences. Computing techniques regularly used in medical and bioinformatics include Artificial Neural Networks (ANN), Genetic Algorithms (GA), Hidden Markov Models (HMM), Support Vector Machines (SVM), etc.
More and more life scientists need to acquire knowledge and practice in these techniques in order to improve the quality of their research. AMBIS, APBioNet and ISCB-SC's RSG (Singapore) have partnered NUS to coordinate the organisation of a 1 day hands-on training workshop in using the tool Extreme Learning Machine (ELM) developed by NTU professors and self learning bibliographic clustering/categorization system - PISMEC developed by an NUS spin-off company, Retriva Solutions.
All participants will be equipped with a PC in a computer lab, hands-on course material and the necessary software.
Training Program
Morning Session:
9:30am-11:30am
Introduction to Extreme Learning Machine
by Dr Huang Guangbin (Website CV [4])
11:30am-12noon
Introduction to Machine Learning Classification for Bibliographic data
using PISMEC bibliographic categorisation of PubMed using
Medical Subject Heading (MESH) Ontology
with Verity K2 Machine Learning System
by Mr Mohamed Ariffin, CEO Retriva Solutions http://www.retriva.com
12-1pm Lunch Break (not provided)
Afternoon Session:
1:00pm-3:00pm
Hands-on 1: Cancer Detection using Global Cancer Map Using Extreme Learning Machine
By Dr Sundaram Suresh and Ms Saraswathi Sundararajan
3:00pm-3:30pm Tea Break (refreshments provided)
3:30pm-4:30pm
Hands-on 2: Predicting Accessible Surface Area of Amino Acids using Extreme Learning Machine
By Dr Sundaram Suresh and Ms Saraswathi Sundararajan
4:30pm - 5pm
Demo: PISMEC bibliographic categorisation of PubMed
Trainer: Mr Mohamed Arrifin, CEO Retriva Solutions and Taufik.
5pm-6pm
Inaugural Meeting of the First Regional Students Group (Singapore)
International Society for Computational Biology Students Council (ISCB-SC)
Part A. Extreme Learning Machine
In the morning session, we will compare Extreme Learning Machine with other methods. In the afternoon, we will be practising the use of ELM with specific biomedical datasets.
1. Cancer Detection using Global Cancer Map – An Sparse Data Classification Problem
Cancer detection and classification using standard clinical data is a difficult process. Also, identifying the anatomic site of origin of the cancer is an important component in cancer treatment. Hence, in recent years bio-medical research activities are focused on identification of site of origin, type and molecular classification of cancer. Recently researchers started using the gene expression data to classify human cancer types. Gene expression signature database is a collection of micro-array data for cancer and normal tissue specimens. The data was collected from 6 medical institutions consisting of samples from 14 different types of cancers. Each tissue specimen consists of 16063 genes and the database has 198 primary samples from 14 types of cancer. The functional relationship between the gene expression data and the type of cancers are unknown. Hence, Ramasawmy et al, used well-known machine learning approach called support vector machine (SVM) to predict the cancer type. Since, the data set is sparse in nature and very few samples are available to estimate the functional relationship, the support vector machine based classification is not able provide better performance (overall classification accuracy is less than 78%). Also, support vector machine require higher computational effort. In this talk, an Extreme Learning Machine approach for sparse data multi-category classification problems will be discussed. The Sparse-ELM approach requires lesser computation time (approximately 20 times faster than SVM) and provides better performance (overall classification accuracy is 8% more than SVM approach). The presence of large number of features and fewer samples are quite common in bio-informatics problem, hence, once use this Sparse-ELM approach to achieve better predictability of cancer types.
2. Predicting Accessible Surface Area of Amino Acids using Extreme Learning Machine
Protein–protein interactions have a central role in numerous processes in biological cells and are one of the major areas of research in proteomics. Understanding the mechanisms of protein–protein interactions is vital when addressing issues associated with the biological function and disease. In addition, protein three-dimensional (3D) structure prediction directly from amino acid sequences still remains an open and important problem in life sciences. In the bioinformatics approaches first focus on predicting the secondary structure and/or the solvent accessibilities of a protein’s structure which represents the 1D projections of the complicated 3D structure. The successful prediction of solvent accessibility is helpful in elucidating the relationship between protein structure and interactions. The information of solvent accessibility in proteins leads to numerous insights into the organization of 3-D structure. Hence, in this talk, we address the problem of estimating the solvent accessible surface area of amino acid residues in protein sequences using extreme learning machine approach. The problem is converted into classification of amino acids into different types (buried, non-buried) using the scoring matrices obtained from PSI-BLAST. The functional relationship between them is estimated using extreme learning machine. A set of 30 proteins containing 7,545 residues from the Manesh dataset was selected for training two-stage and the remaining residues are used for testing the ELM classifier. The results show that the ELM based approach performs better than the other approaches reported in the literature.
Part B. PISMEC from Retriva
The PISMEC system, a Personalised Information System for the bio-Medical Community, has been developed by Retriva Solutions, an NUS spin-off company led by its founding members from the School of Computing, NUS over the past few years under the watchful eye of NUS computer science professor Tan Chew Lim. It has been deployed for Singapore Eye Research Institute (SERI) for annotation and classification of bibliographic references using a medical ontology and demonstrated to the National Cancer Centre (NCC).
With the support from NUS Venture Support and many other individuals, Retriva Solutions is applying this technology to various other knowledge domains and is looking for active collaborators in different domains of scientific endeavour.
Technical specifications and details are found in [5] and [6]
Inaugural Meeting of ISCB-SC RSG-Singapore
First Meeting of the Regional Student Group (RSG) of Singapore
International Society for Computational Biology Student Council (ISCB-SC)
Date: 21 Nov 2006
Time: 5pm-6pm
Venue: Seminar Room M9
Block MD7 Dept of Biochemistry
Yong Loo Lin School of Medicine
National University of Singapore
Medical Drive
Singapore
All students are welcome!
Tentative Agenda
1. Welcome
2. Introduction to the ISCB-SC
3. Introduction to the RSG of Singapore and AMBIS student membership
4. Pro-tem committee members and call for nominations
5. Proposed Elections
6. Forthcoming RSG Activities
7. RSG Projects
Potential RSG Projects
- 2nd ASEAN-India Bioinformatics Workshop, Hotel Samrat, New Delhi. 14-16 December 2006
- Student Assistance and organisers
- Call for Participation in India for formation of RSG-India
- Curriculum Design and Accreditation
- 2nd East Asia Bioinformatics Workshop, KOBIC, Daejon, Korea. March 2007
- BioDataGrid administration
- BioDatabase Monitoring and Registry Project
- 1st World Wide Workflow Grid W3FG Symposium, GridAsia2007, Singapore. June 2007
- Student Helpers and free registration
- Paper submission in life science bioinformatics applications with workflow integration
- ISMB'07 Conference Vienna July 2007
- Student Helpers and subsidised registration
- Call for formation of RSG-Europe
- Annual General Meeting of the ISCB-SC student council meeting
- Submission of papers in bioinformatics
- 4th International Life Science Grid Conference LSGrid2007, at UK eScience All-Hands Meeting, Glasgow, UK. Sep 2007
- Student Helpers and proposed free registration
- Call for formation of RSG-UK
- Submission of papers in life science grid computing
- Student posters
- International Conference on Bioinformatics InCoB2007 tbd
- Student helpers and subsidised registration
- Submission of papers in bioinformatics
- Student posters
- S* Life Science Informatics Alliance Practical Workshop Dec06 - March 2007
- Student Teaching Assistants
- Course design
- 19th Genome Informatics Workshop 2007, Singapore. Dec 2007
- Student helpers and subsidised registration
- Student posters
RSG-Singapore Pro-Tem Committee
- Saraswathi Sundararajan (NTU student rep and RSG founder)
- Heiny Tan (NUS student rep)
- Mynampati Kalyan Chakravarthy (NUS student rep)
- Susan Moore (AMBIS coordinator)
Pre-Workshop Reading for ELM
Extreme Learning Machine and Bioinformatics
Current methods used in Machine Learning
In the last few decades, extensive research has been carried out in developing the theory and applications of Artificial Neural Networks (ANN). ANNs possess an inherent structure suitable for mapping complex characteristics, learning and optimization and have emerged as a powerful tool for solving various practical problems, like pattern classification and recognition, medical imaging, speech recognition and control. Furthermore, from a practical perspective, the massive parallelism and fast adaptability of neural network implementations provide more incentives for further investigation into problems which involve complex mapping with uncertainties.
Of the many neural network architectures proposed, neural networks are found to be effective in solving a number of real world problems. Normally the free parameters of the network are learnt from the given training samples using gradient descent algorithms. The gradient descent algorithms are relatively slow and have many issues related to its convergence. Hence, the learning process is computationally intensive and network training could take several hours or days. Although Support Vector Machine (SVM) algorithms have been widely used in classification problems, they may require a large amount of training time.
In addition, one has to select proper learning parameters (learning rate and epoch/iterations) to avoid local minima problems. Presence of higher training time and issues related to selection of learning parameter led to the development of an alternative algorithm which can also be implemented in real time for sequential learning.
Extreme Learning Machine (ELM): An improved method for machine learning
Extreme Learning Machine (ELM) recently proposed by Huang, et al [2] generally works for any type of single-layer hidden networks with good generalization capabilities and extremely fast learning speed. Unlike the conventional methods, in ELM the hidden nodes are randomly chosen and the output weights are analytically calculated. The ELM algorithm overcomes many issues encountered in traditional gradient algorithms such as stopping criterion, learning rate, number of epochs and local minima. In fact, the superior performance of ELM algorithm on many real world problems has been shown compared to other existing neural network approaches. The real time learning capabilities and universal approximation abilities of Extreme Learning Machine have been illustrated in literature [1, 4].
ELM a well suited algorithm for solving problems related to Bioinformatics and Medical Informatics
The human genome project and other recent advances in have spawned the generation of astronomical amounts of biological data. More and more information is being added to burgeoning biological databases around the world. Hence there is an urgent need to develop algorithms which will be well suited to process the available information speedily. We need to infer the secrets of life stored in these data in order to provide better understanding and treatment of diseases that plague human kind.
For instance, huge amounts of micro-array gene expression data are available where the data belong to several groups of classes and there is a need to classify groups of data into their respective classes. If we consider a multi-class human cancer detection problem using micro-array gene expression data, we need to classify data belonging to different types of cancers.
Another area of area of research is in protein structure prediction for the purpose of discerning protein functionality. The accessible surface area of residues (RSA) in a protein, which comes into contact with their surrounding liquid environment, plays a major role in predicting the structure of a protein, especially in the absence of closely related sequences with similar structure or functionality. The RSA property of a protein helps to reengineer proteins by providing information on selection of the best residues for site-directed mutagenesis.
Likewise most of the biological problems can be cast into a classification framework and solved using machine learning techniques. Other issues related to this data are known as sparse conditions, which are ubiquitous in biological data. These issues are related to the availability of very few samples for training and testing purposes and the dimension problem where there is very much less number of samples as compared to the large feature set of each sample. Using traditional machine learning techniques like Support Vector Machines (SVM) and other techniques on the large feature sets of these data lead to higher model development time and poor visualization of the characteristics of this data resulting in poor classification accuracy.
Neural network algorithms and Extreme Learning Machine (ELM) techniques are well suited to model these problems, since they can be used in uncertain situations, when relationship between entities are not yet clearly defined. Extreme Learning Machine (ELM) approaches has been proposed for classification of micro array gene expression data relating to cancer classification and prediction of Relative Solvent Accessibility (RSA) of the amino acids present in proteins. These approaches address several issues related to the nature of these data, as mentioned above.
Extreme Learning Machine (ELM) and Online model development
In many real world problems, as in medical diagnosis and bioinformatics, the available samples increase and differ over a period of time, making model development an arduous task. For example in cancer detection problems using micro array gene expression data, patient sample data include parameters representing tumor markers in cells taken from different patients. The size and nature of this database will increase as the number of patients with cancer increase over a period of time. In addition, the parameters of the model like the tumor marker label may change after treatment. Adding new samples and changes in class labels and parameters of existing samples require retraining of existing models. The retraining process in turn requires large amounts of monetary and computational resources like time, memory space, man power and programming. These issues have created and avid interest among researchers to develop sequential machine learning approaches for classification problems, which initially require lesser resources. These learning machines are capable of building an initial model with the initially available data. Over a period of time, these machines can incrementally improve the learning model based on the intelligence provided by the newly arriving data. The online model of Extreme Machine Learning [3] is well suited for these applications.
Essence of computation : Speed and Accuracy
Performance comparison with results obtained from using Support Vector Machine (SVM) algorithms and other existing machine learning techniques on many benchmark data sets indicates that an ELM algorithm can process the data at a much faster rate, while significantly improving training results and giving same or better testing results found elsewhere in the literature. Considering the large amounts of data available for processing in Bioinformatics related applications, using ELM algorithms would give a distinct advantage in future processing environments in terms of processing speed and performance.
References
[1] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal Approximation Using Incremental Networks with Random Hidden Computation Nodes,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006.
[2] G.-B. Huang, Q.-Y. Zhu and C.-K. Siew, “Extreme Learning Machine: Theory and Applications,” Neurcomputing (in press), 2006.
[3] N.-Y. Liang, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A Fast and Accurate On-line Sequential Learning Algorithm for Feedforward Networks,” IEEE Transactions on Neural Networks, vol. 17, no. 6, 2006
[4] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Real-Time Learning Capability of Neural Networks,” IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 863-878, 2006.
Research Group:
- Prof. Sundararajan, EEE (67905027)
- Dr. G.-B. Huang, EEE
- S. Suresh, EEE
- Saraswathi.S., NTU.
- Contact: saraswathi55@gmail.com, 6791-3053</pre>
