Tall Arrays for Out-of-Memory Data

Tall arrays are used to work with out-of-memory data that is backed by a datastore. Datastores enable you to work with large data sets in small blocks that individually fit in memory, instead of loading the entire data set into memory at once. Tall arrays extend this capability to enable you to work with out-of-memory data using common functions.

What is a Tall Array?

Since the data is not loaded into memory all at once, tall arrays can be arbitrarily large in the first dimension (that is, they can have any number of rows). Instead of writing special code that takes into account the huge size of the data, such as with techniques like MapReduce, tall arrays let you work with large data sets in an intuitive manner that is similar to the way you would work with in-memory MATLAB^® arrays. Many core operators and functions work the same with tall arrays as they do with in-memory arrays. MATLAB works with small blocks of the data at a time, handling all of the data chunking and processing in the background, so that common expressions, such as A+B, work with big data sets.

Benefits of Tall Arrays

Unlike in-memory arrays, tall arrays typically remain unevaluated until you request that the calculations be performed using the gather function. This lazy evaluation allows you to work quickly with large data sets. When you eventually request output using gather, MATLAB combines the queued calculations where possible and takes the minimum number of passes through the data. The number of passes through the data greatly affects execution time, so it is recommended that you request output only when necessary.

Note

Since gather returns results as in-memory MATLAB arrays, standard memory considerations apply. MATLAB might run out of memory if the result returned by gather is too large.

Creating Tall Tables

Tall tables are like in-memory MATLAB tables, except that they can have any number of rows. To create a tall table from a large data set, you first need to create a datastore for the data. If the datastore ds contains tabular data, then tall(ds) returns a tall table or tall timetable containing the data. See Datastore for more information about creating datastores.

Create a spreadsheet datastore that points to a tabular file of airline flight data. For folders that contain a collection of files, you can specify the entire folder location, or use the wildcard character, '*.csv', to include multiple files with the same file extension in the datastore. Clean the data by treating 'NA' values as missing data so that tabularTextDatastore replaces them with NaN values. Also, set the format of a few text variables to %s so that tabularTextDatastore reads them as cell arrays of character vectors.

ds = tabularTextDatastore('airlinesmall.csv');
ds.TreatAsMissing = 'NA';
ds.SelectedFormats{strcmp(ds.SelectedVariableNames,'TailNum')} = '%s';
ds.SelectedFormats{strcmp(ds.SelectedVariableNames,'CancellationCode')} = '%s';

Create a tall table from the datastore. When you perform calculations on this tall table, the underlying datastore reads blocks of data and passes them to the tall table to process. Neither the datastore nor the tall table retain any of the underlying data.

tt = tall(ds)

tt =

  M×29 tall table 

    Year    Month    DayofMonth    DayOfWeek    DepTime    CRSDepTime    ArrTime    CRSArrTime    UniqueCarrier    FlightNum    TailNum    ActualElapsedTime    CRSElapsedTime    AirTime    ArrDelay    DepDelay    Origin    Dest     Distance    TaxiIn    TaxiOut    Cancelled    CancellationCode    Diverted    CarrierDelay    WeatherDelay    NASDelay    SecurityDelay    LateAircraftDelay
    ____    _____    __________    _________    _______    __________    _______    __________    _____________    _________    _______    _________________    ______________    _______    ________    ________    ______    _____    ________    ______    _______    _________    ________________    ________    ____________    ____________    ________    _____________    _________________

    1987    10       21            3             642        630           735        727          'PS'             1503         'NA'        53                   57               NaN         8          12          'LAX'     'SJC'    308         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    1987    10       26            1            1021       1020          1124       1116          'PS'             1550         'NA'        63                   56               NaN         8           1          'SJC'     'BUR'    296         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    1987    10       23            5            2055       2035          2218       2157          'PS'             1589         'NA'        83                   82               NaN        21          20          'SAN'     'SMF'    480         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    1987    10       23            5            1332       1320          1431       1418          'PS'             1655         'NA'        59                   58               NaN        13          12          'BUR'     'SJC'    296         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    1987    10       22            4             629        630           746        742          'PS'             1702         'NA'        77                   72               NaN         4          -1          'SMF'     'LAX'    373         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    1987    10       28            3            1446       1343          1547       1448          'PS'             1729         'NA'        61                   65               NaN        59          63          'LAX'     'SJC'    308         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    1987    10        8            4             928        930          1052       1049          'PS'             1763         'NA'        84                   79               NaN         3          -2          'SAN'     'SFO'    447         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    1987    10       10            6             859        900          1134       1123          'PS'             1800         'NA'       155                  143               NaN        11          -1          'SEA'     'LAX'    954         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    :       :        :             :            :          :             :          :             :                :            :          :                    :                 :          :           :           :         :        :           :         :          :            :                   :           :               :               :           :                :
    :       :        :             :            :          :             :          :             :                :            :          :                    :                 :          :           :           :         :        :           :         :          :            :                   :           :               :               :           :                :

The display indicates that the number of rows, M, is currently unknown. MATLAB displays some of the rows, and the vertical ellipses : indicate that more rows exist in the tall table that are not currently being displayed.

Creating Tall Timetables

If the data you are working with has a time associated with each row of data, then you can use a tall timetable to work on the data. For information on creating tall timetables, see Extended Capabilities (timetable).

In this case, the tall table tt has times associated with each row, but they are broken down into several table variables such as Year, Month, DayofMonth, and so on. Combine all of these pieces of datetime information into a single new tall datetime variable Dates, which is based on the departure times DepTime. Then, create a tall timetable using Dates as the row times. Since Dates is the only datetime variable in the table, the table2timetable function automatically uses it for the row times.

hrs = (tt.DepTime - mod(tt.DepTime,100))/100;
mins = mod(tt.DepTime,100);
tt.Dates = datetime(tt.Year, tt.Month, tt.DayofMonth, hrs, mins, 0);
tt(:,1:8) = [];
TT = table2timetable(tt)

TT =

  M×21 tall timetable

            Dates           UniqueCarrier    FlightNum    TailNum    ActualElapsedTime    CRSElapsedTime    AirTime    ArrDelay    DepDelay    Origin    Dest     Distance    TaxiIn    TaxiOut    Cancelled    CancellationCode    Diverted    CarrierDelay    WeatherDelay    NASDelay    SecurityDelay    LateAircraftDelay
    ____________________    _____________    _________    _______    _________________    ______________    _______    ________    ________    ______    _____    ________    ______    _______    _________    ________________    ________    ____________    ____________    ________    _____________    _________________

    21-Oct-1987 06:42:00    'PS'             1503         'NA'        53                   57               NaN         8          12          'LAX'     'SJC'    308         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    26-Oct-1987 10:21:00    'PS'             1550         'NA'        63                   56               NaN         8           1          'SJC'     'BUR'    296         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    23-Oct-1987 20:55:00    'PS'             1589         'NA'        83                   82               NaN        21          20          'SAN'     'SMF'    480         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    23-Oct-1987 13:32:00    'PS'             1655         'NA'        59                   58               NaN        13          12          'BUR'     'SJC'    296         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    22-Oct-1987 06:29:00    'PS'             1702         'NA'        77                   72               NaN         4          -1          'SMF'     'LAX'    373         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    28-Oct-1987 14:46:00    'PS'             1729         'NA'        61                   65               NaN        59          63          'LAX'     'SJC'    308         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    08-Oct-1987 09:28:00    'PS'             1763         'NA'        84                   79               NaN         3          -2          'SAN'     'SFO'    447         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    10-Oct-1987 08:59:00    'PS'             1800         'NA'       155                  143               NaN        11          -1          'SEA'     'LAX'    954         NaN       NaN        0            'NA'                0           NaN             NaN             NaN         NaN              NaN              
    :                       :                :            :          :                    :                 :          :           :           :         :        :           :         :          :            :                   :           :               :               :           :                :
    :                       :

Creating Tall Arrays

When you extract a variable from a tall table or tall timetable, the result is a tall array of the appropriate underlying data type. A tall array can be a numeric, logical, datetime, duration, calendar duration, categorical, string, or cell array. Also, you can convert an in-memory array A into a tall array with tA = tall(A). The in-memory array A must have one of the supported data types.

Extract the arrival delay ArrDelay from the tall timetable TT. This creates a new tall array variable with underlying data type double.

a = TT.ArrDelay

a =

  M×1 tall double column vector

     8
     8
    21
    13
     4
    59
     3
    11
    :
    :

The classUnderlying and isaUnderlying functions are useful to determine the underlying data type of a tall array.

Lazy Evaluation

One important aspect of tall arrays is that as you work with them, most operations are not performed immediately. These operations appear to execute quickly, because the actual computation is deferred until you specifically request that the calculations be performed. You can trigger evaluation of a tall array with either the gather function (to bring the result into memory) or the write function (to write the result to disk). This lazy evaluation is important because even a simple command like size(X) executed on a tall array with a billion rows is not a quick calculation.

As you work with tall arrays, MATLAB keeps track of all of the operations to be carried out. This information is then used to optimize the number of passes through the data that will be required when you request output with the gather function. Thus, it is normal to work with unevaluated tall arrays and request output only when you require it. For more information, see Lazy Evaluation of Tall Arrays.

Calculate the mean and standard deviation of the arrival delay. Use these values to construct the upper and lower thresholds for delays that are within one standard deviation of the mean. Notice that the result of each operation indicates that the array has not been calculated yet.

m = mean(a,'omitnan')

m =

  tall double

    ?

Preview deferred. Learn more.

s = std(a,'omitnan')

s =

  tall

    ?

Preview deferred. Learn more.

one_sigma_bounds = [m-s m m+s]

one_sigma_bounds =

  M×N×... tall array

    ?    ?    ?    ...
    ?    ?    ?    ...
    ?    ?    ?    ...
    :    :    :
    :    :    :

Preview deferred. Learn more.

Evaluation with `gather`

The benefit of lazy evaluation is that when the time comes for MATLAB to perform the calculations, it is often possible to combine the operations in such a way that the number of passes through the data is minimized. So even if you perform many operations, MATLAB only makes extra passes through the data when absolutely necessary.

The gather function forces evaluation of all queued operations and brings the resulting output into memory. For this reason, you can think of gather as a bridge between tall arrays and in-memory arrays. For example, you cannot control if or while loops using a tall logical array, but once the array is evaluated with gather it becomes an in-memory logical array that you can use in these contexts.

Since gather returns the entire result in MATLAB, you should make sure that the result will fit in memory.

Use gather to calculate one_sigma_bounds and bring the result into memory. In this case, one_sigma_bounds requires several operations to calculate, but MATLAB combines the operations into one pass through the data. Since the data in this example is small, gather executes quickly. However, the elimination of passes through the data becomes more valuable as the size of your data increases.

sig1 = gather(one_sigma_bounds)

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 1.5 sec
Evaluation completed in 1.8 sec

sig1 =

  -23.4572    7.1201   37.6975

You can specify multiple inputs and outputs to gather if you want to evaluate several tall arrays at once. This technique is faster than calling gather multiple times. For example, calculate the minimum and maximum arrival delay. Computed separately, each value requires a pass through the data to calculate for a total of two passes. However, computing both values simultaneously requires only one pass through the data.

[max_delay, min_delay] = gather(max(a),min(a))

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 1.1 sec
Evaluation completed in 1.1 sec

max_delay =

        1014


min_delay =

   -64

These results indicate that on average, most flights arrive about 7 minutes late. But it is within one standard deviation for a flight to be up to 37 minutes late or 23 minutes early. The quickest flight in the data set arrived about an hour early, and the latest flight was delayed by many hours.

Saving, Loading, and Checkpointing Tall Arrays

The save function saves the state of a tall array, but does not copy any of the data. The resulting .mat file is typically small. However, the original data files must be available in the same location in order to subsequently use load.

The write function makes a copy of the data and saves the copy as a collection of files, which can consume a large amount of disk space. write executes all pending operations on the tall array to calculate the values prior to writing. Once write copies the data, it is independent of the original raw data. Therefore, you can recreate the tall array from the written files even if the original raw data is no longer available.

You can recreate the tall array from the written files by creating a new datastore that points to the location where the files were written. This functionality enables you to create checkpoints or snapshots of tall array data. Creating a checkpoint is a good way to save the results of preprocessing your data, so that the data is in a form that is more efficient to load.

If you have a tall array TA, then you can write it to the folder location with the command:

write(location,TA);

Later, to reconstruct TA from the written files, use the commands:

ds = datastore(location);
TA = tall(ds);

Additionally, you can use the write function to trigger evaluation of a tall array and write the results to disk. This use of write is similar to gather, however, write does not bring any results into memory.

Supporting Functions

Most core functions work the same way with tall arrays as they do with in-memory arrays. However, in some cases the way that a function works with tall arrays is special or has limitations. You can tell whether a function supports tall arrays, and if it has any limitations, by looking at the bottom of the reference page for the function in the Extended Capabilities section (for an example, see filloutliers).

For a filtered list of all MATLAB functions that support tall arrays, see Function List (Tall Arrays).

Tall arrays also are supported by several toolboxes, enabling you to do things like write machine learning algorithms, deploy standalone apps, and run calculations in parallel or on a cluster. For more information, see Extend Tall Arrays with Other Products.