Contrasting offline and online results when evaluating recommendation algorithms
MetadataShow full item record
Most evaluations of novel algorithmic contributions assess their accuracy in predicting what was withheld in an of- ine evaluation scenario. However, several doubts have been raised that standard offline evaluation practices are not ap- propriate to select the best algorithm for field deployment. The goal of this work is therefore to compare the offline and the online evaluation methodology with the same study participants, i.e. a within users experimental design. This paper presents empirical evidence that the ranking of algo- rithms based on offline accuracy measurements clearly con- tradicts the results from the online study with the same set of users. Thus the external validity of the most commonly applied evaluation methodology is not guaranteed.