Researchers at MIT are using MathWorks tools to advance bioinformatics and proteomics. MIT students are using the same tools to gain hands-on experience in these fields.
In the Lab
Alterovitz and his research group used MATLAB® to develop algorithms for analyzing the MS data and to model the protein interactivity network, which consisted of more than 20,000 nodes and 100,000 edges. Each network node represented a mass associated with a protein, and each edge represented an interaction between nodes.
The researchers also used MATLAB to visualize data, plot results, and access databases shared with other biomedical researchers.
Because MS data resembles the series of peaks and valleys in sound or voice data, researchers can apply signal processing techniques to process the data. MIT researchers used Signal Processing Toolbox™ to process this MS data and applied filters to eliminate noise and irrelevant data, enabling them to concentrate on a more manageable data set.
Bioinformatics Toolbox™ enabled the team to quickly obtain information about proteins from a variety of Internet resources. The team used Bioinformatics Toolbox to calculate molecular weights, obtain amino acid sequences as well as other properties of specific proteins, and download as well as parse information into data structures accessible by MATLAB.
MIT researchers used Statistics and Machine Learning Toolbox™ to calculate network properties, including connectivity and power law distributions. They used models for calculating the number of proteins in a sample using Statistics and Machine Toolbox to simplify curve fitting and generate negative binomial, gamma, and exponential distributions.
The group’s research involved millions of MS data points from hundreds of patients. However, because each patient’s data was independent, the task of processing the information was ideal for parallelization. Using Parallel Computing Toolbox™ and MATLAB Parallel Server™, the group executed their MATLAB algorithms concurrently on a large cluster of computers.
The group analyzed each patient’s MS data independently on a different processor. Alterovitz explains, "In addition to significantly reducing computation time, Parallel Computing Toolbox enabled us to program this approach quickly. Instead of learning distributed programming, we used our existing MATLAB code, and made it parallel using Parallel Computing Toolbox."
The team also used a distributed approach to speed the calculation of network properties and statistics by dividing the network into chunks and running the tasks in parallel.
In the Classroom
For the bioinformatics and proteomics course, Alterovitz and his fellow course instructors chose MATLAB for its ease of use, interoperability with other tools, and ability to present concepts at increasing levels of abstraction.
"About 90% of the class had already used MATLAB," says Alterovitz. "Everyone began using MATLAB immediately—even those with no prior experience—because you do not need to know how to program in order to use it."
In addition, MATLAB provided the students with an easy way to access and learn from leading research conducted at MIT and Harvard.
The course’s teaching approach was based on elaboration theory. It involved using a limited set of concepts and examples, and gradually adding complexity. Alterovitz explains, "MATLAB intrinsically supports different levels of complexity, through various levels of abstraction. In the beginning, students run the code and visualize results. Later, they can explore, update, and even integrate the code with other programming languages to add more detail."
The coursework also mirrored this approach across biological levels. The students first used MathWorks tools to analyze fundamental DNA sequence information. They then progressed to more complex expression data, proteins, and eventually interactions between proteins and other molecules using a network model.