Skip to content Skip to sidebar Skip to footer

Numpy Performance Differences Between Linux And Windows

I am trying to run sklearn.decomposition.TruncatedSVD() on 2 different computers and understand the performance differences. computer 1 (Windows 7, physical computer) OS Name Micro

Solution 1:

{built-in method dot} is the np.dot function, which is a NumPy wrapper around the CBLAS routines for matrix-matrix, matrix-vector and vector-vector multiplication. Your Windows machines uses the heavily tuned Intel MKL version of CBLAS. The Linux machine is using the slow old reference implementation.

If you install ATLAS or OpenBLAS (both available through Linux package managers) or, in fact, Intel MKL, you're likely to see massive speedups. Try sudo apt-get install libatlas-dev, check the NumPy config again to see if it picked up ATLAS, and measure again.

Once you've decided on the right CBLAS library, you may want to recompile scikit-learn. Most of it just uses NumPy for its linear algebra needs, but some algorithms (notably k-means) use CBLAS directly.

The OS has little to do with this.

Solution 2:

Notice the {built-in method dot} difference from 0.035s/call to 16.058s/call, 450 times slower!!

Clock speeds and cache hit ratio are two big factors to consider. The Xeon E5-2670 has a lot more cache than the Core i7-3770. And the i7-3770 has a higher peak clock speed with turbo mode. While your Xeon has a big cache in hardware, on EC2 you might be effectively sharing that cache with other customers.

Is there a way I can further debug this performance issue?

Well, you have different measurements (outputs) and multiple differences on the inputs (OS and hardware). Given the differing inputs, these different outputs are likely expected.

CPU performance counters will isolate better the effects of your algorithm's performance on different systems. The Xeons have richer performance counters, but they should all have CPU_CLK_UNHALTED and LLC_MISSES. These work by mapping the instruction pointer to events like code being executed or cache misses. Therefore you can see which parts of the code are CPU and cache bound. Since the clock speeds and cache sizes differ among your targets, you might find that one is cache bound and the other is CPU bound.

Linux has a tool called perf (sometimes perf_events). See also http://www.brendangregg.com/perf.html

On Linux and Windows you can also use Intel VTune.

Post a Comment for "Numpy Performance Differences Between Linux And Windows"