Note that when you turn on profiling, aspects of the JIT (Just In Time Compiler) might be turned off. It can take considerably longer to execute code when profiling is turned on, and you can end up with severely distorted understandings of relative executions time.
It is essentially impossible to deduce computational complexity by measuring execution time on modern computers.
A couple of months ago I posted this with regards to computational complexity of exp():
The question can be talked about from a theory standpoint, or from a practical standpoint.
From a theory standpoint, exp(), being a 1 to 1 function, could theoretically be implemented as a single lookup table, and therefore could be represented as taking constant time.
Some processors implement exp() in hardware, though it does not appear that any of the mainstream general purpose architectures do. NVIDIA offers __device__float __expf for base e exponentiation.
If you examine the theory of what computational complexity measures, it basically measures the growth in the number of times that you need to apply a unary or binary operation to values, measuring how the number of times rises as the size of the input increases. Computational complexity does not care how long it takes to calculate any one operation -- that is all folded into constants of proportionality that do not matter when it comes time to talk about rate of growth. An algorithm that needs to touch every item of input 1073 times will be faster, on large enough inputs, than an algorithm that somehow computes the same thing with 3*n operations per n input (so 3*n*n = 3*n^2 operations). So as far as computational complexity is concerned, how long it takes to compute exp() does not matter: the time is a bounded constant per single input for any given processor. And yet, obviously in practice it makes a notable difference whether you do exp(sum(x)) or you do prod(exp(x)). Computational complexity doesn't care, but real processors do.
Theoretical analysis says that you can calculate a matrix multiplication in time proportional to about n^2.3 when the array is n by n: people have proven it in theory and said explicitly how it could be done. But in practice, the computation takes a lot of memory and needs a lot of initialization, so it is slower in theory than the common n^3 algorithm until you get array sizes somewhere beyond 10^10. What is faster in practice is the "blocked" matrix multiplication algorithms from the BLAS type libraries, that know about processor cache sizes and know to reduce locality of reference and are suitable for execution on multiple hardware threads simultaneously. Thus if you create an algorithm that involves a matrix multiplication, then the computational complexity you need to use for theory analysis is the n^2.3-ish figure, because it could be done in that proportion -- but the complexity you would measure for that step would be n^3, and you would get the results very much sooner than if you use the n^2.3-ish algorithm...