Documentation

Analyzing Gene Expression Profiles

Overview of the Yeast Example

This example demonstrates a number of ways to look for patterns in gene expression profiles, using gene expression data from yeast shifting from fermentation to respiration.

The microarray data for this example is from DeRisi, J.L., Iyer, V.R., and Brown, P.O. (Oct 24, 1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278 (5338), 680–686. PMID: 9381177.

The authors used DNA microarrays to study temporal gene expression of almost all genes in Saccharomyces cerevisiae during the metabolic shift from fermentation to respiration. Expression levels were measured at seven time points during the diauxic shift. The full data set can be downloaded from the Gene Expression Omnibus Web site at:

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28

Exploring the Data Set

This procedure illustrates how to import data from the Web into the MATLAB® environment. The data for this procedure is available in the MAT-file yeastdata.mat. This file contains the VALUE data or LOG_RAT2N_MEAN, or log2 of ratio of CH2DN_MEAN and CH1DN_MEAN from the seven time steps in the experiment, the names of the genes, and an array of the times at which the expression levels were measured.

  1. Load data into the MATLAB environment.

    load yeastdata.mat
    
  2. Get the size of the data by typing

    numel(genes)
    

    The number of genes in the data set displays in the MATLAB Command Window. The MATLAB variable genes is a cell array of the gene names.

    ans =
            6400
    
  3. Access the entries using cell array indexing.

    genes{15}
    

    This displays the 15th row of the variable yeastvalues, which contains expression levels for the open reading frame (ORF) YAL054C.

    ans =
      YAL054C
    
  4. Use the function web to access information about this ORF in the Saccharomyces Genome Database (SGD).

    url = sprintf(...
            'http://genome-www4.stanford.edu/cgi-bin/SGD/...
             locus.pl?locus=%s',...
            genes{15});
    web(url);
  5. A simple plot can be used to show the expression profile for this ORF.

    plot(times, yeastvalues(15,:))
    xlabel('Time (Hours)');
    ylabel('Log2 Relative Expression Level');
    

    The MATLAB software plots the figure. The values are log2 ratios.

  6. Plot the actual values.

    plot(times, 2.^yeastvalues(15,:))
    xlabel('Time (Hours)');
    ylabel('Relative Expression Level');
    

    The MATLAB software plots the figure. The gene associated with this ORF, ACS1, appears to be strongly up-regulated during the diauxic shift.

  7. Compare other genes by plotting multiple lines on the same figure.

    hold on
    plot(times, 2.^yeastvalues(16:26,:)')
    xlabel('Time (Hours)');
    ylabel('Relative Expression Level');
    title('Profile Expression Levels');
    

    The MATLAB software plots the image.

Filtering Genes

This procedure illustrates how to filter the data by removing genes that are not expressed or do not change. The data set is quite large and a lot of the information corresponds to genes that do not show any interesting changes during the experiment. To make it easier to find the interesting genes, reduce the size of the data set by removing genes with expression profiles that do not show anything of interest. There are 6400 expression profiles. You can use a number of techniques to reduce the number of expression profiles to some subset that contains the most significant genes.

  1. If you look through the gene list you will see several spots marked as 'EMPTY'. These are empty spots on the array, and while they might have data associated with them, for the purposes of this example, you can consider these points to be noise. These points can be found using the strcmp function and removed from the data set with indexing commands.

    emptySpots = strcmp('EMPTY',genes);
    yeastvalues(emptySpots,:) = [];
    genes(emptySpots) = [];
    numel(genes)
    

    The MATLAB software displays:

    ans =
            6314
    

    In the yeastvalues data you will also see several places where the expression level is marked as NaN. This indicates that no data was collected for this spot at the particular time step. One approach to dealing with these missing values would be to impute them using the mean or median of data for the particular gene over time. This example uses a less rigorous approach of simply throwing away the data for any genes where one or more expression levels were not measured.

  2. Use the isnan function to identify the genes with missing data and then use indexing commands to remove the genes.

    nanIndices = any(isnan(yeastvalues),2);
    yeastvalues(nanIndices,:) = [];
    genes(nanIndices) = [];
    numel(genes)
    

    The MATLAB software displays:

    ans =
            6276
    

    If you were to plot the expression profiles of all the remaining profiles, you would see that most profiles are flat and not significantly different from the others. This flat data is obviously of use as it indicates that the genes associated with these profiles are not significantly affected by the diauxic shift. However, in this example, you are interested in the genes with large changes in expression accompanying the diauxic shift. You can use filtering functions in the toolbox to remove genes with various types of profiles that do not provide useful information about genes affected by the metabolic change.

  3. Use the function genevarfilter to filter out genes with small variance over time. The function returns a logical array of the same size as the variable genes with ones corresponding to rows of yeastvalues with variance greater than the 10th percentile and zeros corresponding to those below the threshold.

    mask = genevarfilter(yeastvalues);
    % Use the mask as an index into the values to remove the 
    % filtered genes.
    yeastvalues = yeastvalues(mask,:);
    genes = genes(mask);
    numel(genes)
    

    The MATLAB software displays:

    ans =
            5648
    
  4. The function genelowvalfilter removes genes that have very low absolute expression values. Note that the gene filter functions can also automatically calculate the filtered data and names.

    [mask, yeastvalues, genes] = genelowvalfilter(yeastvalues,genes,...
                                                  'absval',log2(4));
    numel(genes)
    

    The MATLAB software displays:

    ans =
       423
    
  5. Use the function geneentropyfilter to remove genes whose profiles have low entropy:

    [mask, yeastvalues, genes] = geneentropyfilter(yeastvalues,genes,...
                                                   'prctile',15);
    numel(genes)
    

    The MATLAB software displays:

    ans =  310
    

Clustering Genes

Now that you have a manageable list of genes, you can look for relationships between the profiles using some different clustering techniques from the Statistics Toolbox™ software.

  1. For hierarchical clustering, the function pdist calculates the pairwise distances between profiles, and the function linkage creates the hierarchical cluster tree.

    corrDist = pdist(yeastvalues, 'corr');
    clusterTree = linkage(corrDist, 'average');
    
  2. The function cluster calculates the clusters based on either a cutoff distance or a maximum number of clusters. In this case, the 'maxclust' option is used to identify 16 distinct clusters.

    clusters = cluster(clusterTree, 'maxclust', 16);
    
  3. The profiles of the genes in these clusters can be plotted together using a simple loop and the function subplot.

    figure
    for c = 1:16
        subplot(4,4,c);
        plot(times,yeastvalues((clusters == c),:)');
        axis tight
    end
    suptitle('Hierarchical Clustering of Profiles');
    

    The MATLAB software plots the images.

  4. The Statistics Toolbox software also has a K-means clustering function. Again, 16 clusters are found, but because the algorithm is different these are not necessarily the same clusters as those found by hierarchical clustering.

    [cidx, ctrs] = kmeans(yeastvalues, 16,... 
                          'dist','corr',...
                          'rep',5,...
                          'disp','final');
    figure
    for c = 1:16
        subplot(4,4,c);
        plot(times,yeastvalues((cidx == c),:)');
        axis tight
    end
    suptitle('K-Means Clustering of Profiles');
    

    The MATLAB software displays:

    13 iterations, total sum of distances = 11.4042
    14 iterations, total sum of distances = 8.62674
    26 iterations, total sum of distances = 8.86066
    22 iterations, total sum of distances = 9.77676
    26 iterations, total sum of distances = 9.01035
    

  5. Instead of plotting all of the profiles, you can plot just the centroids.

    figure
    for c = 1:16
        subplot(4,4,c);
        plot(times,ctrs(c,:)');
        axis tight
        axis off    % turn off the axis
    end
    suptitle('K-Means Clustering of Profiles');
    

    The MATLAB software plots the figure:

  6. You can use the function clustergram to create a heat map and dendrogram from the output of the hierarchical clustering.

    figure
    clustergram(yeastvalues(:,2:end),'RowLabels',genes,...
                                     'ColumnLabels',times(2:end))
    

    The MATLAB software plots the figure:

Principal Component Analysis

Principal-component analysis (PCA) is a useful technique you can use to reduce the dimensionality of large data sets, such as those from microarray analysis. You can also use PCA to find signals in noisy data.

  1. Use the pca function in the Statistics Toolbox software to calculate the principal components of a data set.

    [pc, zscores, pcvars] = pca(yeastvalues)
    

    The MATLAB software displays:

    pc =
    
      Columns 1 through 4 
    
       -0.0245   -0.3033   -0.1710   -0.2831
        0.0186   -0.5309   -0.3843   -0.5419
        0.0713   -0.1970    0.2493    0.4042
        0.2254   -0.2941    0.1667    0.1705
        0.2950   -0.6422    0.1415    0.3358
        0.6596    0.1788    0.5155   -0.5032
        0.6490    0.2377   -0.6689    0.2601
    
      Columns 5 through 7 
    
       -0.1155    0.4034    0.7887
       -0.2384   -0.2903   -0.3679
       -0.7452   -0.3657    0.2035
       -0.2385    0.7520   -0.4283
        0.5592   -0.2110    0.1032
       -0.0194   -0.0961    0.0667
       -0.0673   -0.0039    0.0521
    
  2. You can use the function cumsum to see the cumulative sum of the variances.

    cumsum(pcvars./sum(pcvars) * 100)
    

    The MATLAB software displays:

    ans =
       78.3719
       89.2140
       93.4357
       96.0831
       98.3283
       99.3203
      100.0000
    

    This shows that almost 90% of the variance is accounted for by the first two principal components.

  3. A scatter plot of the scores of the first two principal components shows that there are two distinct regions. This is not unexpected, because the filtering process removed many of the genes with low variance or low information. These genes would have appeared in the middle of the scatter plot.

    figure
    scatter(zscores(:,1),zscores(:,2));
    xlabel('First Principal Component');
    ylabel('Second Principal Component');
    title('Principal Component Scatter Plot');
    

    The MATLAB software plots the figure:

  4. The gname function from the Statistics Toolbox software can be used to identify genes on a scatter plot. You can select as many points as you like on the scatter plot.

    gname(genes);
    

    When you have finished selecting points, press Enter.

  5. An alternative way to create a scatter plot is with the gscatter function from the Statistics Toolbox software. gscatter creates a grouped scatter plot where points from each group have a different color or marker. You can use clusterdata, or any other clustering function, to group the points.

    figure
    pcclusters = clusterdata(zscores(:,1:2),6);
    gscatter(zscores(:,1),zscores(:,2),pcclusters)
    xlabel('First Principal Component');
    ylabel('Second Principal Component');
    title('Principal Component Scatter Plot with Colored Clusters');
    gname(genes)  % Press enter when you finish selecting genes.
    

    The MATLAB software plots the figure:

Was this topic helpful?