Abstract
Max-sum diversity is a fundamental primitive for web search and data mining. For a given set S of n elements, it returns a subset of k«l n representatives maximizing the sum of their pairwise distances, where distance models dissimilarity. An important variant of the primitive prescribes that the desired subset of representatives satisfies an additional orthogonal requirement, which can be specified as a matroid constraint (i.e., a feasible solution must be an independent set of size k). While unconstrained max-sum diversity admits efficient coreset-based strategies, the only known approaches dealing with the additional matroid constraint are inherently sequential and are based on an expensive local search over the entire input set. We devise the first coreset constructions for max-sum diversity under various matroid constraints, together with efficient sequential, MapReduce and Streaming implementations. By running the local-search on the coreset rather than on the entire input, we obtain the first practical solutions for large instances. Technically, our coresets are subsets of S containing a feasible solution which is no more than a factor 1-ε away from the optimal solution, for any fixed ε <1, and, for spaces of bounded doubling dimension, they have a small size independent of n. Extensive experiments show that, with respect to full-blown local search, our coreset-based approach yields solutions of comparable quality, with improvements of up to two orders of magnitude in the running time, also for input sets of unknown dimensionality.