Sparse gpuArray indexing limitations

Question

0 votes

When trying to extract a sub-matrix of a sparse gpuArray, I get,

>> S=gpuArray(sprand(100,100,0.1));
>> S(1:5,1:5)

Sparse gpuArray matrices only support referencing whole rows or whole columns.

OK, but here, I am indexing a block of complete rows and get the same error,

>> S(1:5,:)

Sparse gpuArray matrices only support referencing whole rows or whole columns.

Aside from this, what is the internal hurdle that makes subsrefing a sparse matrix so hard on the GPU? This has been a limitation for quite a while now.

3 Comments
Show 1 older comment Hide 1 older comment

Matt J on 5 Jul 2026 at 0:48

Open in MATLAB Online

Yes, single row/column extraction works,

>> S=gpuArray.speye(3);
>> S(1,:)

ans =

1×3 sparse gpuArray double row vector (1 nonzero)

(1,1) 1

>> S(:,1)

ans =

3×1 sparse gpuArray double column vector (1 nonzero)

(1,1) 1

But I wonder why that would be the limit. If you can do one, surely you should be able to do more?

dpb on 5 Jul 2026 at 11:44

Edited: dpb on 5 Jul 2026 at 22:15

Open in MATLAB Online

Theoretically, yes it could be done but doing so creates the complexity issue noted before so the ability isn't directly supported. Reading the gpuArray doc on <Working With Sparse GPU Arrays> again and parsing the addressing discussion more carefully, it appears to be a restriction documented by example and the implication that index is a single value, not by an explicit statement. "Sparse GPU arrays only support referencing whole rows or columns by index. For example ..." uses 5 in that example but doesn't directly state that "index" cannot be a colon expression. Like you apparently, @Matt J, I initially presumed the index could be generalized as with addressing in-memory arrays, sparse or not.

If this operation is needed, it would have to be implemented by extracting each row/column individually and catenating them in local memory as assigning values to sparse GPU arrays by index is not supported. The result would then have to be moved to a GPU array from memory.

It's not clear from the documentation whether the writing restrictions precludes something like

B=gpuArray(sparse([]));     % empty sparse GPU array????
for i=1:5
  B=[B;S(i,:)];
end

but I suspect they do/will.

Syntax issues aside, even if it were to work, constructing a new sparse matrix dynamically would require the GPU to rebuild the entire CSR/CSC data structure for every iteration. If it were toy-sized arrays this might be feasible, but presuming real cases would be large enough to actually need sparse instead of dense arrays, the reconstruction sequentially would undoubtedly turn out to be prohibitively slow.

It's hard to not expect all the basic indexing flexibility that is inherent in MATLAB memory access to be available, but the GPU is a completely different architecture and the sotware implementation is so drastically different that virtually all the pardigms that are so ingrained just don't carry over effectively to that environment.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

dpb on 3 Jul 2026 at 21:41

Edited: dpb on 5 Jul 2026 at 15:30

Open in MATLAB Online

2 votes

EDIT 5Jul26 -- Initial conclusion it is a bug that the documentation doesn't restrict the index to a single value instead of allowing a general indexing expression was in error given later more careful reading. It is documented implicitly by the example using a specific index value (5) and the lack of an example illustrating the multiple-valued index expression. A more explicit statement might be a useful doc enhancement with, perhaps a note in the Tips section on why this is simply not feasible.

--dpb

Technically it appears to be a bug based on the doc at gpuArray doesn't have any explicit restrictions on addressing by either full rows or columns, agreed. However, I suspect this has uncovered limitations on actual useage owing to the unique nature of how the GPU implements sparse storage. Details are lacking in the MATLAB doc's, but I venture it's owing to how the GPU stores the sparse matrix; it uses Compressed Sparse Column (CSC) or Compressed Sparse Row (CSR) layouts and since zeros are not stored, storage is represented by three arrays:

Values: The actual non-zero numbers.
Inner Indices: The exact row indices (for CSC) or column indices (for CSR).
Outer Pointers: A map specifying the physical offset index where each new column or row starts in the arrays above.

Because different rows/columns have a varying number of non-zero elements, the distance between elements is completely dynamic. there is no fixed "stride" to allow for direct addressing without the lookup.

Consequently, trying to pull a sub-matrix (e.g., A(2:5, 2:5)), requires thousands of parallel threads to fetch that data simultaneously. To get to the initial element A(2, 2) a GPU thread must:

Look up the Outer Pointer for column 2.
Read the start pointer for column 3.
Linear-search through every non-zero index inside that block just to see if row 2 exists.

Because the data are tightly packed, a thread handling row 3 cannot know where its data begins until the threads handling rows 0, 1, and 2 have resolved how many elements they contain. This creates a strictly sequential data dependency and forcing a massively parallel architecture like a GPU to execute a sequential pointer-chasing search destroys performance, making arbitrary sub-array slicing structurally infeasible.

The reason MATLAB allows full rows or full columns (depending on if it is utilizing CSR or CSC under the hood) is because the starting and ending bounds of a full slice are explicitly defined by the Outer Pointers array. The GPU can read the single pointer for the requested column and instantly copy that entire contiguous chunk of memory into a new vector.

I don't know if the MATLAB implementation documents whether CSR or CSC is used or if it may depend upon the structure of the sparsity, perhaps. I could see that when creating a particular sparse GPU array, it must choose one or the other and it then could run into the indexing problem if one asks for the opposite slice direction.

I wonder if the example might be an illustration of that; if you try the other direction

S(:,1:5)

on the same array it would succeed?

In short, GPUs are good if you can load 'em up and then they can operate on the data in parallel; anything else is likely to be nothing but a bottleneck that totally defeats the purpose.

2 Comments
Show None Hide None

Matt J on 4 Jul 2026 at 15:39

It makes me wonder then, how sparse matrix multiplication A*B is done if submatrices are so difficult to extract. Seemingly, each thread multiplies a whole row of A by a whole column of B?

dpb on 4 Jul 2026 at 16:40

Edited: dpb on 5 Jul 2026 at 18:05

It uses the NVIDIA libraries which have specialized code designed specifically to use the capabilities built into the NVIDIA hardware and algorithms that calculate partial results using outer products and merge them in shared memory. The architecture also uses internal structured formats which condense storage and that align smaller sparse sections with the GPU's dense array storage tensor core hardware. So, as I understand it, the sparse matrices eventually get mapped indirectly from sparse to dense(+) in a zillion little blocks. Then parallelism gives the performance.

<A paper that may provide more insight> than the summaries I've pulled from and edited down significantly from NVIDIA developer forum and just asking google for background info.

It's a real feat of engineering, indeed, inside there.

(+) I presume but did not find it stated explicitly without more extensive digging that only subarrays with a nonzero element are mapped, not reconstructing the entire array since the others wouldn't add anything to the outer product being calculated.

Sign in to comment.

Answer 2

Ben Tordoff on 6 Jul 2026 at 8:38

1 vote

We have been working on improvements to sparse gpuArray indexing for R2026b. If you have access to the R2026b pre-release please give it a try and send us your feedback. It should address your use-case.

We do try to prioritise features or improvements that people are asking for. If you have other specific requests for improvements to our sparse gpuArray support please let us know: https://supportcases.mathworks.com/mwsupport/s/casecreation?c__caseParameter=Suggest_an_enhancement

7 Comments
Show 5 older comments Hide 5 older comments

dpb on 9 Jul 2026 at 15:10

Maybe it's in the other routines that the concentration should go, then? Have you been able to isolate the specific bottleneck(s)? Is this all sparse array routines or do you see same issues in dense storage routines, too?

Matt J on 9 Jul 2026 at 17:15

Edited: Matt J on 9 Jul 2026 at 17:21

@dpb Dense problems are probably hit even harder by the issue, see for example,

https://www.mathworks.com/matlabcentral/answers/2184297-gpuarray-support-for-optimization-toolbox-solvers?s_tid=srchtitle

I would be happy enough if gpuArray support were limited to dense data types if it is too hard to implement the necessary matrix operations for sparse gpuArrays. However, I have wondered if the reason TMW resists the enhancement is because it would be somehow awkward to partition the solver in such a way that only 3 out of the 4 combinations (sparse vs. full, GPU vs. CPU) are supported.

Sign in to comment.

Sparse gpuArray indexing limitations

3 Comments
Show 1 older comment Hide 1 older comment

Accepted Answer

2 Comments
Show None Hide None

More Answers (1)

7 Comments
Show 5 older comments Hide 5 older comments

Categories

Products

Release

Tags

Community Treasure Hunt

Sparse gpuArray indexing limitations

3 Comments Show 1 older comment Hide 1 older comment

Accepted Answer

2 Comments Show None Hide None

More Answers (1)

7 Comments Show 5 older comments Hide 5 older comments

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

3 Comments
Show 1 older comment Hide 1 older comment

2 Comments
Show None Hide None

7 Comments
Show 5 older comments Hide 5 older comments