Is There A Way To Speed Up Numpy Array Calculations When They Only Contain Values In Upper/lower Triangle?
Solution 1:
We should simplify things there to leverage broadcasting at minimal places. Upon which, we would end up with something like this to directly get the final output using u
and m
, like so -
np.triu((1-u/u.reshape(N, 1))**3*m)
Then, we could leverage numexpr
module that performs noticeably better when working with transcendental operations as is the case here and also is very memory efficient. So, upon porting to numexpr
version, it would be -
import numexpr as ne
np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
Bring in the masking part within the evaluate
method for further perf. boost -
M = np.tri(N,dtype=bool)
ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)})
Timings on given dataset -
In [25]: %timeit method1()
1000 loops, best of 3: 521 µs per loop
In [26]: %timeit method2()
1000 loops, best of 3: 417 µs per loop
In [27]: %timeit np.triu((1-u/u.reshape(N, 1))**3*m)
1000 loops, best of 3: 408 µs per loop
In [28]: %timeit np.triu(ne.evaluate('(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1)}))
10000 loops, best of 3: 159 µs per loop
In [29]: %timeit ne.evaluate('(1-M)*(1-u/u2D)**3*m',{'u2D':u.reshape(N, 1),'M':np.tri(N,dtype=bool)})
10000 loops, best of 3: 110 µs per loop
Note that another way to extend u
to a 2D
version would be with np.newaxis/None
and this would be the idiomatic way. Hence, u.reshape(N, 1)
could be replaced by u[:,None]
. This shouldn't change the timings though.
Solution 2:
Here is another solution that is faster in some cases but slower in some other cases.
idx = np.triu_indices(N)
def my_method():
result= np.zeros((N, N))
t =1- u[idx[1]] / u[idx[0]]
result[idx] = t * t * t * m[idx[1]]
returnresult
Here, the computation is done only for the elements in the (flattened) upper triangle. However, there is overhead in the 2D-index-based assignment operation result[idx] = ...
. So the method is faster when the overhead is less than the saved computations -- which happens when N
is small or the computation is relatively complex (e.g., using t ** 3
instead of t * t * t
).
Another variation of the method is to use 1D-index for the assignment operation, which can lead to a small speedup.
idx = np.triu_indices(N)
raveled_idx = np.ravel_multi_index(idx, (N, N))
def my_method2():
result = np.zeros((N, N))
t = 1 - u[idx[1]] / u[idx[0]]
result.ravel()[raveled_idx] = t * t * t * m[idx[1]]
return result
Following is the result of performance tests. Note that idx
and raveled_idx
and are fixed for each N
and do not change with u
and m
(as long as their shapes remain unchanged). Hence their values can be precomputed and the times are excluded from the test.
(If you need to call these methods with matrices of many different sizes, there will be added overhead in the computations of idx
and raveled_idx
.) For the comparision, method4b
, method5
and method6
cannot benefit much from any precomputation. For method_ne
, the precomputation M = np.tri(N, dtype=bool)
is also excluded from the test.
%timeit method4b()
%timeit method5()
%timeit method6()
%timeit method_ne()
%timeit my_method()
%timeit my_method2()
Result (for N = 160):
1.54 ms ± 7.15 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
1.63 ms ± 11.4 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
167 µs ± 15.8 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
255 µs ± 14.9 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
233 µs ± 1.95 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
177 µs ± 907 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)
For N = 32:
89.9 µs ± 880 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)
84 µs ± 728 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)
25.2 µs ± 223 ns per loop (mean ± std. dev. of7 runs, 10000 loops each)
28.6 µs ± 4.68 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
17.6 µs ± 1.56 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
14.3 µs ± 52.8 ns per loop (mean ± std. dev. of7 runs, 100000 loops each)
For N = 1000:
70.7 ms ± 871 µs per loop (mean ± std. dev. of7 runs, 10 loops each)
65.1 ms ± 1.9 ms per loop (mean ± std. dev. of7 runs, 10 loops each)
21.4 ms ± 642 µs per loop (mean ± std. dev. of7 runs, 10 loops each)
3.03 ms ± 342 µs per loop (mean ± std. dev. of7 runs, 100 loops each)
15.2 ms ± 95.7 µs per loop (mean ± std. dev. of7 runs, 100 loops each)
12.7 ms ± 217 µs per loop (mean ± std. dev. of7 runs, 100 loops each)
Using t ** 3
instead of t * t * t
in my_method
and my_method2
(N = 160):
1.53 ms ± 14.5 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
1.6 ms ± 13.6 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
156 µs ± 1.62 µs per loop (mean ± std. dev. of7 runs, 10000 loops each)
235 µs ± 8.6 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
1.4 ms ± 4.78 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
1.32 ms ± 9.07 µs per loop (mean ± std. dev. of7 runs, 1000 loops each)
Here, my_method
and my_method2
outperform method4b
and method5
a little bit.
Solution 3:
I think the answer may be quite simple. Just put zeros in the cells that you don't want to calculate and the overall calculation will be faster. I think that might explain why method1()
was faster than method2()
.
Here are some tests to illustrate the point.
In [29]: size = (160, 160)
In [30]: z = np.zeros(size)
In [31]: r = np.random.random(size) + 1
In [32]: t = np.triu(r)
In [33]: w = np.ones(size)
In [34]: %timeit z**3177 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [35]: %timeit t**3376 µs ± 2.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [36]: %timeit r**3572 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [37]: %timeit w**3138 µs ± 548 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: %timeit np.triu(r)**3427 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [39]: %timeit np.triu(r**3)
625 µs ± 3.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not sure how all this works at a low level but clearly, zero or one raised to a power takes much less time to compute than any other value.
Also interesting. With numexpr
computation there is no difference.
In [42]: %timeit ne.evaluate("r**3")
79.2 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [43]: %timeit ne.evaluate("z**3")
79.3 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, I think the fastest without using numexpr
may be this way:
def method5():
return np.triu(1 - u/u[:, None])**3*m
assert np.array_equal(method1(), method5())
In [65]: %timeit method1()656 µs ± 2.78 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)
In [66]: %timeit method5()587 µs ± 5.05 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)
Or, if you are really chasing every micro-second:
def method4b():
split = N//2
x = np.zeros((N, N))
u_mat = np.triu(1 - u/u.reshape(N, 1))
x[:split, :] = u_mat[:split,:]**3*m
x[split:, split:] = u_mat[split:, split:]**3*m[split:]
return x
assert np.array_equal(method1(), method4b())
In [71]: %timeit method4b()543 µs ± 3.57 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)
In [72]: %timeit method4b()533 µs ± 7.43 µs per loop(mean ± std. dev. of 7 runs, 1000 loops each)
And @Divakar's answer using numexpr
is the fastest overall.
UPDATE
Thanks to @GZ0's comment, if you only need to raise to the power of 3, this is much faster:
def method6():
a = np.triu(1 - u/u[:, None])
return a*a*a*m
assert np.isclose(method1(), method6()).all()
(But there is a slight loss of precision I noticed).
In [84]: %timeit method6()
195 µs ± 609 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In fact it is not far off the numexpr
methods in @Divakar's answer (185/163 µs on my machine).
Post a Comment for "Is There A Way To Speed Up Numpy Array Calculations When They Only Contain Values In Upper/lower Triangle?"