Abstract
Recommender systems are Integrated Development Environment (IDE) extensions which assist developers in the task of coding. However, since they assist specific aspects of the general activity of programming, their impact is hard to assess. In previous work, we used with success an evaluation strategy using automated benchmarks to automatically and precisely evaluate several recommender systems, based on recording and replaying developer interactions. In this paper, we highlight the challenges we expect to encounter while applying this approach to other recommender systems.