Skip to content Skip to sidebar Skip to footer

Is There A Way To Speed Up Numpy Array Calculations When They Only Contain Values In Upper/lower Triangle?

I'm doing some matrix calculations (2d) that only involve values in the upper triangle of the matrices. So far I've found that using Numpy's triu method ('return a copy of a matrix

Solution 1:

We should simplify things there to leverage broadcasting at minimal places. Upon which, we would end up with something like this to directly get the final output using u and m, like so -

np.triu((1-u/u.reshape(N, 1))**3*m)

Then, we could leverage numexpr module that performs noticeably better when working with transcendental operations as is the case here and also is very memory efficient. So, upon porting to numexpr version, it would be -

import numexpr as ne

np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))

Bring in the masking part within the evaluate method for further perf. boost -

M = np.tri(N,dtype=bool)
ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)})

Timings on given dataset -

In [25]: %timeit method1()
1000 loops, best of 3: 521 µs per loop

In [26]: %timeit method2()
1000 loops, best of 3: 417 µs per loop

In [27]: %timeit np.triu((1-u/u.reshape(N, 1))**3*m)
1000 loops, best of 3: 408 µs per loop

In [28]: %timeit np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
10000 loops, best of 3: 159 µs per loop

In [29]: %timeit ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1),'M':np.tri(N,dtype=bool)})
10000 loops, best of 3: 110 µs per loop

Note that another way to extend u to a 2D version would be with np.newaxis/None and this would be the idiomatic way. Hence, u.reshape(N, 1) could be replaced by u[:,None]. This shouldn't change the timings though.

Solution 2:

Here is another solution that is faster in some cases but slower in some other cases.

idx = np.triu_indices(N)

def my_method():
    result= np.zeros((N, N))
    t =1- u[idx[1]] / u[idx[0]]
    result[idx] = t * t * t * m[idx[1]]
    returnresult

Here, the computation is done only for the elements in the (flattened) upper triangle. However, there is overhead in the 2D-index-based assignment operation result[idx] = .... So the method is faster when the overhead is less than the saved computations -- which happens when N is small or the computation is relatively complex (e.g., using t ** 3 instead of t * t * t).

Another variation of the method is to use 1D-index for the assignment operation, which can lead to a small speedup.

idx = np.triu_indices(N)
raveled_idx = np.ravel_multi_index(idx, (N, N))
def my_method2():
    result = np.zeros((N, N))
    t = 1 - u[idx[1]] / u[idx[0]]
    result.ravel()[raveled_idx] = t * t * t * m[idx[1]]
    return result

Following is the result of performance tests. Note that idx and raveled_idx and are fixed for each N and do not change with u and m (as long as their shapes remain unchanged). Hence their values can be precomputed and the times are excluded from the test. (If you need to call these methods with matrices of many different sizes, there will be added overhead in the computations of idx and raveled_idx.) For the comparision, method4b, method5 and method6 cannot benefit much from any precomputation. For method_ne, the precomputation M = np.tri(N, dtype=bool) is also excluded from the test.

%timeit method4b()
%timeit method5()
%timeit method6()
%timeit method_ne()
%timeit my_method()
%timeit my_method2()

Result (for N = 160):

1.54 ms ± 7.15 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
1.63 ms ± 11.4 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
167 µs ± 15.8 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
255 µs ± 14.9 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
233 µs ± 1.95 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
177 µs ± 907 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)

For N = 32:

89.9 µs ± 880 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)
84 µs ± 728 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)
25.2 µs ± 223 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)
28.6 µs ± 4.68 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
17.6 µs ± 1.56 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
14.3 µs ± 52.8 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)

For N = 1000:

70.7 ms ± 871 µs per loop (mean ± std. dev. of7 runs, 10 loops each)
65.1 ms ± 1.9 ms per loop (mean ± std. dev. of7 runs, 10 loops each)
21.4 ms ± 642 µs per loop (mean ± std. dev. of7 runs, 10 loops each)
3.03 ms ± 342 µs per loop (mean ± std. dev. of7 runs, 100 loops each)
15.2 ms ± 95.7 µs per loop (mean ± std. dev. of7 runs, 100 loops each)
12.7 ms ± 217 µs per loop (mean ± std. dev. of7 runs, 100 loops each)

Using t ** 3 instead of t * t * t in my_method and my_method2 (N = 160):

1.53 ms ± 14.5 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
1.6 ms ± 13.6 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
156 µs ± 1.62 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
235 µs ± 8.6 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
1.4 ms ± 4.78 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
1.32 ms ± 9.07 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)

Here, my_method and my_method2 outperform method4b and method5 a little bit.

Solution 3:

I think the answer may be quite simple. Just put zeros in the cells that you don't want to calculate and the overall calculation will be faster. I think that might explain why method1() was faster than method2().

Here are some tests to illustrate the point.

In [29]: size = (160, 160)                                                      

In [30]: z = np.zeros(size)                                                     

In [31]: r = np.random.random(size) + 1                                         

In [32]: t = np.triu(r)                                                         

In [33]: w = np.ones(size)                                                      

In [34]: %timeit z**3177 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [35]: %timeit t**3376 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [36]: %timeit r**3572 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [37]: %timeit w**3138 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [38]: %timeit np.triu(r)**3427 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [39]: %timeit np.triu(r**3)                                                  
625 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Not sure how all this works at a low level but clearly, zero or one raised to a power takes much less time to compute than any other value.

Also interesting. With numexpr computation there is no difference.

In [42]: %timeit ne.evaluate("r**3")                                            
79.2 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [43]: %timeit ne.evaluate("z**3")                                            
79.3 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So, I think the fastest without using numexpr may be this way:

def method5(): 
    return np.triu(1 - u/u[:, None])**3*m  
assert np.array_equal(method1(), method5())                            
In [65]: %timeit method1()656 µs ± 2.78 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)

In [66]: %timeit method5()587 µs ± 5.05 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)

Or, if you are really chasing every micro-second:

def method4b():  
    split = N//2  
    x = np.zeros((N, N))  
    u_mat = np.triu(1 - u/u.reshape(N, 1))  
    x[:split, :] = u_mat[:split,:]**3*m  
    x[split:, split:] = u_mat[split:, split:]**3*m[split:]  
    return x 
assert np.array_equal(method1(), method4b())                           
In [71]: %timeit method4b()543 µs ± 3.57 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)

In [72]: %timeit method4b()533 µs ± 7.43 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)

And @Divakar's answer using numexpr is the fastest overall.

UPDATE

Thanks to @GZ0's comment, if you only need to raise to the power of 3, this is much faster:

def method6(): 
    a = np.triu(1 - u/u[:, None]) 
    return a*a*a*m 
assert np.isclose(method1(), method6()).all()                          

(But there is a slight loss of precision I noticed).

In [84]: %timeit method6()                                                      
195 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In fact it is not far off the numexpr methods in @Divakar's answer (185/163 µs on my machine).

Post a Comment for "Is There A Way To Speed Up Numpy Array Calculations When They Only Contain Values In Upper/lower Triangle?"