Open-vocabulary models for source code (extended abstract)

RM Karampatsis; Hlib Babii; Romain Pierre Julien Robbes; C Sutton; Andrea Alexander Janes

doi:10.1145/3377812.3390806

Back

Abstract

Peer reviewed

Open-vocabulary models for source code (extended abstract)

RM Karampatsis, Hlib Babii, Romain Pierre Julien Robbes, C Sutton and Andrea Alexander Janes

ICSE '20: 42nd International Conference on Software Engineering, Companion Volume, Seoul, South Korea, 27 June - 19 July, 2020, pp.294-295

International Conference on Software Engineering

42nd International Conference on Software Engineering, ICSE-Companion 2020 (Seoul, online, 27/06/2020–19/07/2020)

01/10/2020

DOI: https://doi.org/10.1145/3377812.3390806

Handle:

https://hdl.handle.net/10863/32653

Abstract

Naturalness of code

Byte-pair encoding

Neural language models

Statistical language modeling techniques have successfully beenapplied to large source code corpora, yielding a variety of newsoftware development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with thesetechniques is that code introduces new vocabulary at a far higherrate than natural language, as new identifier names proliferate.Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degradingtheir performance and rendering them unable to scale.In this paper, we address this issue by: 1) studying how variousmodelling choices impact the resulting vocabulary on a large-scalecorpus of 13,362 projects; 2) presenting an open vocabulary sourcecode NLM that can scale to such a corpus, 100 times larger than inprevious work, and outperforms the state of the art. To our knowledge, this is the largest NLM for code that has been reported.

Files and links (1)

url

https://dl.acm.org/doi/abs/10.1145/3377812.3390806View

Details

Title: Open-vocabulary models for source code (extended abstract)
Creators: RM Karampatsis - University of Edinburgh
Hlib Babii - Free University of Bozen-Bolzano
Romain Pierre Julien Robbes
C Sutton - University of Edinburgh
Andrea Alexander Janes - Free University of Bozen-Bolzano
Publication Details: ICSE '20: 42nd International Conference on Software Engineering, Companion Volume, Seoul, South Korea, 27 June - 19 July, 2020, pp.294-295
Editor(s): Rothermel G, Bae D
ISBN: 978-1-72816-528-8
EISBN: 978-1-4503-7122-3
ISSN: 0270-5257
EISSN: 1558-1225
Conference: 42nd International Conference on Software Engineering, ICSE-Companion 2020 (Seoul, online, 27/06/2020–19/07/2020)
Series / Volume: International Conference on Software Engineering
Publisher: IEEE Computer Society
Piscataway, NJ
Number of pages: 2
Identifiers: 9781728165288
(UNIBZ)37175907
991006493898101241
Web of Science ID: 000637244600090
Scopus ID: 2-s2.0-85094110689
Academic Unit: Faculty of Computer Science
Language: English
Resource Type: Abstract
Author Names String: Karampatsis RM, Babii H, Robbes R, Sutton C, Janes A
Additional Description: Editors/Supervisors: Rothermel G, Bae D

Metrics

6 Record Views