Predicting program properties from 'big code'

Authors:
Veselin Raychev

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Martin Vechev

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Andreas Krause

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

Authors Info & Claims

Communications of the ACM Volume 62 Issue 3March 2019pp 99–107https://doi.org/10.1145/3306204

Published:21 February 2019Publication History

Communications of the ACM

Abstract

We present a new approach for predicting program properties from large codebases (aka "Big Code"). Our approach learns a probabilistic model from "Big Code" and uses this model to predict properties of new, unseen programs.

The key idea of our work is to transform the program into a representation that allows us to formulate the problem of inferring program properties as structured prediction in machine learning. This enables us to leverage powerful probabilistic models such as Conditional Random Fields (CRFs) and perform joint prediction of program properties.

As an example of our approach, we built a scalable prediction engine called JSNICE for solving two kinds of tasks in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNICE predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of cases. Since its public release at http://jsnice.org, JSNice has become a popular system with hundreds of thousands of uses.

By formulating the problem of inferring program properties as structured prediction, our work opens up the possibility for a range of new "Big Code" applications such as de-obfuscators, decompilers, invariant generators, and others.

References

Annotating javascript. https://github.com/google/closure-compiler/wiki/Annotating-JavaScript-for-the-Closure-Compiler.Google Scholar
Bitbucket. https://bitbucket.org/.Google Scholar
Facebook flow. https://github.com/facebook/flow.Google Scholar
Github. http://github.com/.Google Scholar
Google closure compiler. https://developers.google.com/closure/compiler/.Google Scholar
Shrink your code and resources. ProGuard for Android Applications: https://developer.android.com/studio/build/shrink-code.html.Google Scholar
Typescript. https://www.typescriptlang.org/.Google Scholar
Uglifyjs. https://github.com/mishoo/UglifyJS.Google Scholar
Bichsel, B., Raychev, V., Tsankov, P., Vechev, M. Statistical deobfuscation of android applications. CCS 2016. Google ScholarDigital Library
Bielik, P., Raychev, V., Vechev, M.T. PHOG: probabilistic model for code. In Proceedings of the 33<sup>nd</sup> International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (2016), pp. 2933--2942. Google ScholarDigital Library
DARPA. Mining and understanding software enclaves (muse). http://www.darpa.mil/news-events/2014-03-06a (2014).Google Scholar
He, X., Zemel, R.S., Carreira-Perpiñán, M.A. Multiscale conditional random fields for image labeling. CVPR 2004. Google ScholarDigital Library
Jensen, S.H., Møller, A., Thiemann, P. Type analysis for javascript. In Proceedings of the 16<sup>th</sup> International Symposium on Static Analysis, SAS 2009 (Berlin, Heidelberg, 2009), Springer-Verlag, pp. 238--255. Google ScholarDigital Library
Koller, D., Friedman, N. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, Cambridge, Massachusetts and London, England, 2009. Google ScholarDigital Library
Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001 (San Francisco, CA, USA, 2001), pp. 282--289. Google ScholarDigital Library
Quattoni, A., Collins, M., Darrell, T. Conditional random fields for object recognition. In NIPS (2004), 1097--1104. Google ScholarDigital Library
Ratliff, N.D., Bagnell, J.A., Zinkevich, M. (Approximate) subgradient methods for structured prediction. In AISTATS (2007), 380--387.Google Scholar
Raychev, V. Learning from Large Codebases. PhD dissertation, ETH Zurich, 2016.Google Scholar
Vechev, M., Yahav, E. Programming with "big code". Foundations and Trends in Programming Languages 3, 4 (2016), 231--284. Google ScholarDigital Library

Index Terms

Predicting program properties from 'big code'
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Predicting Program Properties from "Big Code"
POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages

We present a new approach for predicting program properties from massive codebases (aka "Big Code"). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.

The key idea ...
Read More
Wiki2Prop: A Multimodal Approach for Predicting Wikidata Properties from Wikipedia
WWW '21: Proceedings of the Web Conference 2021

Wikidata is rapidly emerging as a key resource for a multitude of online tasks such as Speech Recognition, Entity Linking, Question Answering, or Semantic Search. The value of Wikidata is directly linked to the rich information associated with each ...
Read More
A hybrid code representation learning approach for predicting method names
Abstract
Program semantic properties such as class names, method names, and variable names and types play an important role in software development and maintenance. Method names are of particular importance because they provide the cornerstone ...
Highlights
- It is difficult for AST-based code representation learning to learn code semantics.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 62, Issue 3
March 2019
109 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/3314328
Editor:
Andrew A. Chien
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 February 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 9,507
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Predicting program properties from 'big code'

Communications of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Predicting Program Properties from "Big Code"

Wiki2Prop: A Multimodal Approach for Predicting Wikidata Properties from Wikipedia

A hybrid code representation learning approach for predicting method names