Abstract
Statistical language modeling techniques have successfully beenapplied to large source code corpora, yielding a variety of newsoftware development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with thesetechniques is that code introduces new vocabulary at a far higherrate than natural language, as new identifier names proliferate.Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degradingtheir performance and rendering them unable to scale.In this paper, we address this issue by: 1) studying how variousmodelling choices impact the resulting vocabulary on a large-scalecorpus of 13,362 projects; 2) presenting an open vocabulary sourcecode NLM that can scale to such a corpus, 100 times larger than inprevious work, and outperforms the state of the art. To our knowledge, this is the largest NLM for code that has been reported.