Why with running an intensive time-simulation routine in a local multi-core, with the parallel toolboxes, there is a slowdown respect to the sequential code, whilst with running the same code in a cluster workstation there is a speed up?

Question

Giovanni De Luca on 28 Jun 2013

0
Link

Direct link to this question

https://ch.mathworks.com/matlabcentral/answers/80595-why-with-running-an-intensive-time-simulation-routine-in-a-local-multi-core-with-the-parallel-toolb

Closed: MATLAB Answer Bot on 20 Aug 2021

Hi, I implemented a code with a subroutine that is quite expensive in time simulation, the backslash operator in solving very large linear systems, with multiple right-hand side (the matrices could be both sparse and dense). Since my code offers the opportunity to "embarrassangly" parallelize a part of itself (then with no communication inter-slave nodes) and for few iterations (I will show that a small number of them is not the problem), I tested the parallel code in a local multi-core and in a cluster workstation, with varying the size of the datasets (from 300 KB to 450 MB, some sparse, others dense). Well, in a multi-core with 48 GB of RAM and a certain processor I almost never obtained speed up, even worst the parallel code is quite slower than the sequential one; however, with running the same algorithm in the cluster (each node has 2 GB of RAM and almost the same powerful processor, even less) there is the speed up (quite linear with the number of nodes)!. Why? I provide these comments and some questions:

1)In a multi-core, the sequential algorithm (that works in multi-thread mode) is the fastest non-parallelized algorithm that solves the same problem of the parallelized one (definition of speed up), whilst in the parallel one, every lab works in single-thread mode. However, in a cluster, after you connect to the master node (that could not be in your local computer), you call the desired number of workers, then for the parallel code you call "n" workers, and for the sequential code you call just 1 worker. This time, the sequential algorithm runs in single-thread mode, since you call a pool of 1 worker from the client, thus the definition of speed up is not matched (is it right?). If true, it could be a factor that determines the evident speed up of the simulations with the cluster. Consequently, it means that with the backslash operator, which is optimally implemented for the multi-thread mode, we have the best performance when working with a local multi-core in a sequential fashion, instead of delegating the computation to all the workers, each of them working in single-thread mode; on the other hand, with the cluster the sequential fashion is single-threading, thus it's useful to parallelize the code. Anyway, I don't think it's neither the only nor the worst factor of the slowdown in multi-core (or the speed up in the cluster)...

2)For sparse matrices and with the multi-core, I can obtain a certain speed up, but it's a very small factor; for dense matrices, trying with several architectural configurations of multi-core (memory resources and core series) there is almost always the slowdown, in some case (small size datasets) the parallel time equals the sequential one, but there is no benefit from the parallelization. This means there is a memory and bus bandwidth problem (btw, the overhead from the startup of the parallel initializations characterizes both multi-core and cluster). In fact, I tried in the multi-core to do not gather the data back, after the computations by the labs, and I noted a save in time (overhead communication); besides, to verify the benefit of the multi-threading of the sequential code for dense datasets (taking advantage of BLAS), I simulated the data without calling the matlabpool (sequantial) vs calling just 1 worker (parallel, but with just one core) and the multi-threading is clearly faster than the parallel code.

3)In local multi-core, the RAM is partitioned into "n_labs" pieces (included that for the client): with a lot of RAM, it should no longer be a bottleneck for shared resources and memory management (the resources, RAM and cores, are comparable between multi-core and cluster). Do you agree (it's not a problem of hardware resources)?

4)From point 3), what differs is the bus bandwidth (10 Gb/sec for the network of the cluster with Infiniband; at which speed the data travels into the local bus?)

Thank you in advance,

Giovanni