Repository logo

Infoscience

  • English
  • French
Log In
Logo EPFL, École polytechnique fédérale de Lausanne

Infoscience

  • English
  • French
Log In
  1. Home
  2. Academic and Research Output
  3. Student works
  4. Learning Representations of Source Code from Structure and Context
 
master thesis

Learning Representations of Source Code from Structure and Context

Bourgeois, Dylan  
March 15, 2019

Large codebases are routinely indexed by standard Information Retrieval systems, starting from the assumption that code written by humans shows similar statistical properties to written text [Hindle et al., 2012]. While those IR systems are still relatively successful inside companies to help developers search on their proprietary codebase, the same cannot be said about most of public platforms: throughout the years many notable names (Google Code Search, Koders, Ohloh, etc.) have been shut down. The limited functionalities offered, combined with the low quality of the results, did not attract a critical mass of users to justify running those services. To this date, even GitHub (arguably the largest code repository in the world) offers search functionalities that are not more innovative than those present in platforms from the past decade. We argue that the reason why this happens has happened can be imputed to the fundamental limitation of mining information exclusively from the textual representation of the code. Developing a more powerful representation of code will not only enable a new generation of search systems, but will also allow us to explore code by functional similarity, i.e., searching for blocks of code which accomplish similar (and not strictly equivalent) tasks. In this thesis, we want to explore the opportunities provided by a multimodal representation of code: (1) hierarchical (both in terms of object and package hierarchy), (2) syntactical (leveraging the Abstract Syntax Tree representation of code), (3) distributional (embedding by means of co-occurrences), and (4) textual (mining the code documentation). Our goal is to distill as much information as possible from the complex nature of code. Recent advances in deep learning are providing a new set of techniques that we plan to employ for the different modes, for instance Poincaré Embeddings [Nickel and Kiela, 2017] for (1) hierarchical, and Gated Graph NNs [Li et al., 2016] for (2) syntactical. Last but not the least, learning multimodal similarity [McFee and Lanckriet, 2011] is an ulterior research challenge, especially at the scale of large codebases – we will explore the opportunities offered by a framework like GraphSAGE [Hamilton et al., 2017] to harmonize a large graph with rich feature information.

  • Files
  • Details
  • Metrics
Type
master thesis
Author(s)
Bourgeois, Dylan  
Advisors
Catasta, Michele  
•
Defferrard, Michaël  
•
Leskovec, Jure
•
Vandergheynst, Pierre  
Date Issued

2019-03-15

Subjects

Graph Neural Networks

•

Natural Language Processing

•

Representation Learning

Written at

EPFL

EPFL units
LTS2  
RelationURL/DOI

HasPart

https://github.com/dtsbourg/BiFocalE

HasPart

https://github.com/dtsbourg/codegraph-fmt

IsSupplementedBy

https://purl.stanford.edu/zj784yy0646
Show more
Available on Infoscience
April 23, 2020
Use this identifier to reference this record
https://infoscience.epfl.ch/handle/20.500.14299/168348
Logo EPFL, École polytechnique fédérale de Lausanne
  • Contact
  • infoscience@epfl.ch

  • Follow us on Facebook
  • Follow us on Instagram
  • Follow us on LinkedIn
  • Follow us on X
  • Follow us on Youtube
AccessibilityLegal noticePrivacy policyCookie settingsEnd User AgreementGet helpFeedback

Infoscience is a service managed and provided by the Library and IT Services of EPFL. © EPFL, tous droits réservés