A scalable dynamic programming scheme for the computation of optimal k-segments for ordered data
MetadataShow full item record
The optimal kk-segments of an ordered dataset of size nn consists of kk tuples that are obtained by merging consecutive tuples such that a given error metric is minimized. The problem is general and has been studied in various flavors, e.g., piecewise-constant approximation, parsimonious temporal aggregation, and v-optimal histograms. A well-known computation scheme for the optimal kk-segments is based on dynamic programming, which computes a k×nk×n error matrix E and a corresponding split point matrix J of the same size. This yields O(n·k)O(n·k) space and O(n2·k)O(n2·k) runtime complexity. In this paper, we propose three optimization techniques for the runtime complexity and one for the space complexity. First, diagonal pruning identifies regions of the error matrix E that need not to be computed since they cannot lead to a valid solution. Second, for those cells in E that are computed, we provide a heuristic to determine a better seed value, which in turn leads to a tighter lower bound for the potential split points to be considered for the calculation of the minimal error. Third, we show how the algorithm can be effectively parallelized. The space complexity is dominated by the split point matrix J, which needs to be kept till the end. To tackle this problem, we replace the split point matrix by a dynamic split point graph, which eliminates entries that are not needed to retrieve the optimal solution. A detailed experimental evaluation shows the effectiveness of the proposed solutions. Our optimization techniques significantly improve the runtime of state-of-the-art matrix implementations, and they guarantee a comparable performance of an implementation that uses the split point graph. The split point graph reduces the memory consumption up to two orders of magnitude and allows us to process large datasets for which the memory explodes if the matrix is used.