# Compare lstsq performance in Python

## April 28, 2019

**Edit 2019-05-09: The benchmark has been updated to include the latest CuPy syntax for cupy.linalg.lstsq.**

CuPy is a GPU accelerated version of NumPy that is very easy to use. I just submited a PR which adds `cupy.linalg.lstsq`

, and naturally I wanted to compare the least squares performance to `numpy.linalg.lstsq`

. I also compared `tensorflow.linalg.lstsq`

as a reasonable alternative. Hopefully this will provide some insight into when it will pay off to solve least squares problems with CuPy instead of NumPy.

**I only have one computer that I was able to run this on so far. I’ll update the post if I can run this benchmark on different hardware.**

# Least squares solvers

NumPy and CuPy use singular value decomposition (SVD) to solve least squares problems. NumPy defaults to the *gelsd* lapack routine which is a divide-and-conquer SVD strategy. TensorFlow’s *lstsq* defaults to using a Cholesky decomposition which *should be* faster than SVD.

# Benchmark

I have a small benchmark problem which performs piecewise linear least squares fits to a sine wave. The benchmark performs fits with known breakpoint locations for 6 and 20 line segments (which translates to solving least squares problem for 6 or 20 unknown parameters). The number of data points was varied from 100 to 6,309,573. This is an odd number, because I couldn’t solve 10**7 data points in the 12 Gb of my GPU’s memory. The code to run the benchmark is available here, and the process to run the code was

```
python3 sine_benchmark_fixed_six_break_points.py
python3 sine_benchmark_fixed_six_break_points_TFnoGPU.py
python3 sine_benchmark_fixed_twenty_break_points.py
python3 sine_benchmark_fixed_twenty_break_points_TFnoGPU.py
python3 plot_results.py
```

This code basically runs the following:

```
import tensorflow as tf
import cupy as cp
import numpy as np
import pwlf
from time import time
import os
n_data = 10**6 # number of data points
np.random.seed(256)
# generate sin wave data
x = np.linspace(0, 10, num=n_data)
y = np.sin(x * np.pi / 2)
# add noise to the data
y = np.random.normal(0, 0.05, size=n_data) + y
my_pwlf = pwlf.PiecewiseLinFit(x, y)
A = my_pwlf.assemble_regression_matrix(breaks, my_pwlf.x_data)
Acp = cp.asarray(A)
ycp = cp.asarray(y)
# numpy.linalg.lstsq
t0 = time()
beta_np, _, _, _ = np.linalg.lstsq(A, y, rcond=1e-15)
t1 = time()
# cupy.linalg.lstsq
t2 = time()
beta_cp = cp.linalg.lstsq(Acp, ycp)
t3 = time()
Atf = tf.convert_to_tensor(A)
ytf = tf.convert_to_tensor(y.reshape(-1, 1))
beta_tf_fast = tf.linalg.lstsq(Atf, ytf, fast=True)
beta_tf_not_fast = tf.linalg.lstsq(Atf, ytf, fast=False)
with tf.Session():
# tf.linalg.lstsq fast=True
t4 = time()
beta_tf = beta_tf_fast.eval()
t5 = time()
```

I’ve ran the benchmark on the following computers:

CPU | GPU | OS | TF & NumPy built from source |
---|---|---|---|

AMD FX-8350 | NVIDIA TITAN Xp | Linux | Yes |

# Results

The results below compare the run time of the fits against the number of data points. The vertical error bars denote the 10th and 90th percentile, considering a Normal distribution with 90% confidence (from the 10 replicate runs).

CPU/GPU | lstsq parameters | Data points | CuPy x faster NumPy |
---|---|---|---|

AMD FX-8350/NVIDIA TITAN Xp | 6 | 6,309,573 | 4.84 |

AMD FX-8350/NVIDIA TITAN Xp | 20 | 6,309,573 | 6.54 |

# Disucssion

It appears that the CuPy least squares solver wasn’t faster than NumPy until there was at least 100,000 data points. With 6,309,573 data points `cupy.linalg.lstsq`

was about 6 times faster than `numpy.linalg.lstsq`

.

I should really run this benchmark on a computer with Intel CPU + NVIDIA GPU. Especially since a previous post hinted that TensorFlow performs significantly better on an Intel CPU, and a Cholesky decomposition should be faster than SVD.

NumPy was built from source using Intel MKL on the AMD FX-8350 which isn’t the fastest, however it is the best supported…

# Acknowledgements

The TITAN Xp used for this work was donated by the NVIDIA Corporation.