pwlf update: fitting continuous piecewise linear models to data

April 8, 2018

pwlf improvements in 0.2.0 release

  • much faster at finding optimum break point locations due to new derivation of regression problem
  • pwlf now uses pure numpy instead of Python to solve continuous piecewise linear fit
  • new mathematical derivation (defined below)
  • pep8 style naming for class, variables, and functions
  • old naming convention still usable, but will be depreciated at some time
  • Available soon on github or pypi

New derivation

There was nothing wrong with Golovchenko (2004), which was the basis for the the first release of pwlf. I’ve always knew that the for loops used to assemble the regression matrix were slow, and this was killing the performance for large problems. After reading this, I felt a bit inspired to do something about it. The pwlf library has been modified to solve the linear regression matrix directly, where previously a square matrix was assembled (as oppose to the regression matrix). The result is that more of the code is solved in numpy (which is fast), instead of Python.

Let’s assume we have a one dimensional data set. In this case we will assume \(\mathbf{x} \) is the independent variable and \(\mathbf{y} \) is dependent on \(\mathbf{x} \) such that \(\mathbf{y}(\mathbf{x}) \). Our data is paired as

$$ \begin{bmatrix} x_1 & y_1 \\ x_2 & y_2 \\ x_3 & y_3 \\ \vdots & \vdots \\ x_n & y_n \\ \end{bmatrix} $$

where \((x_1, y_1) \) represents the first data point and the data points have been ordered according to \( x_1 < x_2 < x_3 < \cdots < x_n \) for \(n \) number of data points. A piecewise linear function can be constructed to the function as follows

$$ \mathbf{y}(x) = \begin{cases} \eta_1 + \beta_1(x-b_1) & b_1 < x \leq b_2 \\ \eta_2 + \beta_2(x-b_2) & b_2 < x \leq b_3 \\ \vdots & \vdots \\ \eta_n + \beta_{n_b}(x-b_{n_b-1}) & b_{n-1} < x \leq b_{n_b} \\ \end{cases} $$

where \(b_1 \) is the \(x \) location of the first break point, \(b_2 \) is the \(x \) location of the second break point, and so forth until the last break point \(b_{n_b} \) for \(n_b \) number of break points. The break points are also ordered as \(b_1 < b_2 < \cdots < b_{n_b} \). Additionally the first break point is always \(b_1 = x_1 \), and the last break point is always \(b_{n_b} = x_n \). This initialization of the data seems tedious at first, but some magic will happen later on if you arrange the data this way.

Now if we enforce that the piecewise linear functions be continuous on the domain, we’ll end up with

$$ \mathbf{y}(x) = \begin{cases} \beta_1 + \beta_2(x-b_1) & b_1 \leq x \leq b_2 \\ \beta_1 + \beta_2(x-b_1) + \beta_3(x-b_2) & b_2 < x \leq b_3 \\ \vdots & \vdots \\ \beta_1 + \beta_2(x-b_1) + \beta_3(x-b_2) + \cdots + \beta_{n_b+1}(x-b_{n_b-1}) & b_{n-1} < x \leq b_{n_b} \\ \end{cases} $$

as our continuous piecewise linear function. This can be extended in Matrix form as

$$ \begin{bmatrix} 1 & x_1-b_1 & (x_1-b_2)1_{x_1 > b_2} & (x_1-b_3)1_{x_1 > b_3} & \cdots & (x_1-b_{n_b-1})1_{x_1 > b_{n_b-1}} \\ 1 & x_2-b_1 & (x_2-b_2)1_{x_2 > b_2} & (x_2-b_3)1_{x_2 > b_3} & \cdots & (x_2-b_{n_b-1})1_{x_2 > b_{n_b-1}} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n-b_1 & (x_n-b_2)1_{x_n > b_2} & (x_n-b_3)1_{x_n > b_3} & \cdots & (x_n-b_{n_b-1})1_{x_n > b_{n_b-1}} \\ \end{bmatrix} \begin{bmatrix} \beta_1 \\ \beta_2 \\ \vdots \\ \beta_{n_b} \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} $$

where \(1_{x_n > b_1} \) represents the piecewise function of form

$$ 1_{x_n > b_2} = \begin{cases} 0 & x_n \leq b_2 \\ 1 & x_n > b_2 \\ \end{cases} $$

and \(1_{x_n > b_3} \) represents

$$ 1_{x_n > b_3} = \begin{cases} 0 & x_n \leq b_3 \\ 1 & x_n > b_3 \\ \end{cases} $$

and so forth. This is where the magic happens! If you’ve ordered your data, the result will be a regression matrix that’s already in a lower triangular style. Since \(\mathbf{x} \) was initially sorted from min to max, the matrix can be assembled quickly by replacing only the non-zero values. This won’t be a big deal if you know the break point locations \(\mathbf{b} \). However if you were running an optimization to find the ideal location for break points (as in pwlf), you may need to assemble the regression matrix thousands of times.

We can express the matrix expression as a linear equation

$$ \mathbf{A} \mathbf{\beta} = \mathbf{y} $$

where \(\mathbf{A} \) is our regression matrix, \(\mathbf{\beta} \) is our unknown set of parameters, and \(\mathbf{y} \) is our vector of \(y \) values. We can use a least squares solver to solve for \(\mathbf{\beta} \). In Python, I like to use the numpy lstsq solver which uses LAPACK to solve the matrix.

Once you have found your set of optimal parameters \(\mathbf{\beta} \), then you can predict for new \(x \) values by assembling your regression matrix \(\mathbf{A} \) for the new \(x \) values. The new \(y \) values are solved by multiplying

$$ \mathbf{A} \mathbf{\beta} = \mathbf{y} $$

the new regression matrix with the determined set of parameters.