Abstract
Machine learning techniques have been increasingly applied to source code in recent years, leading to numerous advances in the field of Software Engineering (SE). The rich structure of source code have attracted significant research interest, yet, there is still much to be understood about what these machine learning models learn about source code and how to best utilize this learned information to address a range of SE tasks. In this dissertation, we introduce JEMMA, An Extensible Java Dataset for Machine Learning for Source Code (ML4Code), which is aimed at lowering the barrier to entry in ML4Code. The dataset provides a large-scale, diverse, and high-quality collection of preprocessed information, including metadata, representations (e.g., source code tokens, ASTs, graphs), and properties (e.g., metrics, static analysis results) for mulit-granular code entities in 50,000 Java projects. The dataset is also extensible, allowing users to add new properties and representations for evaluation, using the JEMMA Workbench tools. Results from two empirical studies demonstrate the utility of the dataset, highlighting the need for the development of context-aware source code modelsthat can reason over a broader network of source code entities in software. Additionally, we present GLUECODE, a benchmark of diverse tasks to evaluate machine learning models of source code, which acknowledges that code is composed of multiple interacting entities and requires models to leverage both local reasoning and global reasoning. Results show that current models do not achieve convincing performance across all tasks, however it provides initial evidence that appending global context to the training input leads to improved performance for tasks that require global reasoning. We trained several baselines on the benchmark which show that there is ample room for improvement in building robust source code models that incorporate both local and global reasoning for a range of tasks. Finally, our experiments on Codex show that large-scale source code models suffer from memorization issues, which motivates us to prepare a systematic evaluation framework for evaluating the learned code characteristics encoded in the pre-trained layers of a model. We utilize the probing paradigm, defining probing tasks, to evaluate several pre-trained source code models to discover to what extent pre-trained models learn about specific aspects of source code, such as syntax, semantics, and code structure. Results indicate that models that incorporate structural information have a better representation of source code characteristics, but there are still opportunities to improve source-code specific pre-training on the respective code characteristics. We encourage other researchers to use the probing task suite we provide to evaluate their models and determine the intrinsic code characteristics encoded. Our results demonstrate the importance of considering both local and global reasoning in source code models, as well as the utility of extensible datasets and benchmarks in advancing the field. First, our work provides a foundation data and infrastructure for future research in the field of ML4Code, second it defines and releases a set of S E task datasets which comprise of an extrinsic benchmark for source code models, and third, it introduces the probing paradigm in the field of ML4Code with probing tasks used to evaluate the internal pre-trained embeddings of large-scale models. We release a Workbench tool suite, and a probing framework, and encourage others to use them to prepare datasets, experiment with representations, train models, and run their own extrinsic and intrinsic evaluations. Ultimately, our goal is to continue to push the boundary of what is possible with machine learning and source code, leading to more powerful and effective tools for software engineers.