Technically it appears to be a bug based on the doc at gpuArray doesn't have any explicit restrictions on addressing by either full rows or columns, agreed. However, I suspect this has uncovered limitations on actual useage owing to the unique nature of how the GPU implements sparse storage. Details are lacking in the MATLAB doc's, but I venture it's owing to how the GPU stores the sparse matrix; it uses Compressed Sparse Column (CSC) or Compressed Sparse Row (CSR) layouts and since zeros are not stored, storage is represented by three arrays: - Values: The actual non-zero numbers.
- Inner Indices: The exact row indices (for CSC) or column indices (for CSR).
- Outer Pointers: A map specifying the physical offset index where each new column or row starts in the arrays above.
Because different rows/columns have a varying number of non-zero elements, the distance between elements is completely dynamic. there is no fixed "stride" to allow for direct addressing without the lookup.
Consequently, trying to pull a sub-matrix (e.g., A(2:5, 2:5)), requires thousands of parallel threads to fetch that data simultaneously. To get to the initial element A(2, 2) a GPU thread must:
- Look up the Outer Pointer for column 2.
- Read the start pointer for column 3.
- Linear-search through every non-zero index inside that block just to see if row 2 exists.
Because the data are tightly packed, a thread handling row 3 cannot know where its data begins until the threads handling rows 0, 1, and 2 have resolved how many elements they contain. This creates a strictly sequential data dependency and forcing a massively parallel architecture like a GPU to execute a sequential pointer-chasing search destroys performance, making arbitrary sub-array slicing structurally infeasible.
The reason MATLAB allows full rows or full columns (depending on if it is utilizing CSR or CSC under the hood) is because the starting and ending bounds of a full slice are explicitly defined by the Outer Pointers array. The GPU can read the single pointer for the requested column and instantly copy that entire contiguous chunk of memory into a new vector.
I don't know if the MATLAB implementation documents whether CSR or CSC is used or if it may depend upon the structure of the sparsity, perhaps. I could see that when creating a particular sparse GPU array, it must choose one or the other and it then could run into the indexing problem if one asks for the opposite slice direction.
I wonder if the example might be an illustration of that; if you try the other direction
on the same array it would succeed?
In short, GPUs are good if you can load 'em up and then they can operate on the data in parallel; anything else is likely to be nothing but a bottleneck that totally defeats the purpose.