# Segmented Least Squares Problem

The least squares problem is described as follows:

Line of Best Fit

Given n points in the plane: (x1, y1), (x2, y2), …, (xn, yn), find a line y = ax + b that minimizes the sum of squared errors:

SSE = sum1 ≤ i ≤ n(yi – axi – b)2

This is a fundamental problem in statistical and numerical analysis and has a nice closed-form solution. The line of minimum error is y = ax + b, where

Multiple segments representing the best fit

The segmented least squares problem is described as follows:

Given a set of points P = {(x1, y1), (x2, y2), …, (xn, yn)}, where x1 ≤ x2 ≤ … ≤ xn, partition P into some number of segments. Each segment is a subset of P that represents a contiguous set of x-coordinates i.e., a subset of the form {(xi, yi), (xi+1, yi+1), …, (xj, yj)} for some i ≤ j. For each segment S, we find the line minimizing the squared error w.r.t the points in S with the formula given above.

The penalty of a partition is defined to be the sum of following two terms:

1. The number of segments multiplied by a given multiplier C > 0.
2. For each segment, the error value of the optimal line through that segment.

The goal is to find a partition of minimum penalty. Increasing the number of segments reduces the penalty term (2) but increases the penalty term (1). The multiplier C is part of the input.

The segmented least squares problem can be solved with dynamic programming.

Let OPT(j) denote the minimum cost for points p1, p2, …, pj. Let e(i, j) denote the minimum squared error for points pi, pi+1, …, pj. The crucial observation is that the last point pj for some subproblem OPT(j) belongs to a single segment in the optimal partition, and that segment begins at some earlier point pi. Thus, if we knew OPT(i-1), we could compute OPT(j). This leads to the following recursive formulation:

OPT(j) = min1 ≤ i ≤ j(e(i,j) + C + OPT[i-1])

The segment pi, pi+1, …, pj is used in an optimal solution for the subproblem if and only if the minimum is obtained using index i.

Pseudocode:

To find the optimal solution, we can store the index i which minimizes the penalty for OPT(j) in an array and backtrack from OPT(n).

Running time: O(n3)
There are O(n2) pairs (i, j) for which we need to compute e(i, j) and each computation of e(i, j) takes O(n) time. Thus, all e(i, j) values can be computed in O(n3) time. Once the e(i, j) values are computed, the OPT array can be filled in O(n2) time.

Take a look at C++ implementation.

Reference: Algorithm Design by Kleinberg and Tardos

## 8 thoughts on “Segmented Least Squares Problem”

1. Anonymous says:

Can I just add a z-axis to your c++ code to have a working 3D implementation to use on meshes modeled for animated film production(s) (which are usually vertices with at least four or more edges)?

• The closed form solution for least squares problem as described in the post has only two dimensions. I don’t know if there is an analogous closed form in three dimensions.

Dynamic programming requires the subproblems to be ordered. In 3D, we would need to sort the points on all 3 coordinates. I think the recursive formulation is general enough to be extended to more than two dimensions but may not be a good fit to your problem. I bet there exist specialized algorithms for your problem.

2. Aaron A G String says:

The running time should really be O(n^2) because you can compute sum(x_k y_k) (etc) over the range i to j+1 from the same sum over the range i to j in O(1) time just by adding the last value.

• Look at the recursive formulation: OPT(j) = min over 1 ≤ i ≤ j (e(i,j) + C + OPT[i-1])

If we could calculate, e(i, j) in O(1), then the total running time would be O(n^2) but since adding a new point changes the ideal line passing through a set of points, we have to iterate over those points to calculate the error, which takes O(n).

All the sums are, in fact, being calculated in O(n^2). It’s just that e(i,j) has to be calculated for all (i,j) pairs and each calculation takes O(n), summing to a total of O(n^3).

3. I am still new with C++. When I ran your code, it seems like its not able to work on a series with over 1000 points (1000 x and 1000 y points). Is there a reason why?

• Yes because I’m using a fixed array of size 1000. If you want to run this program over larger inputs, increase the max size on line 5.

4. If I make the cost 0, shouldn’t I get N segments (N being the number of x or y points)? If not, why? Another question I had was if I wanted the optimal solution for Q segments, how would I do this? Q is a number we will set.

• If the cost of creating a new segment is 0, then the cost of optimal solution will be 0. We may end up with N segments or fewer if some contiguous points lie on a line.

As for the second question, I could come up with this dynamic formulation:
Let i, j be point indices and k be the desired number of segments.
opt(i,j,k) = e(i,j) if k == 1 else min(opt(i,m,n) + opt(m,j,k-n)) where i<m<j and 0<n<k, with some base cases.