Abstract
We present a new approach for predicting program properties from large codebases (aka "Big Code"). Our approach learns a probabilistic model from "Big Code" and uses this model to predict properties of new, unseen programs.
The key idea of our work is to transform the program into a representation that allows us to formulate the problem of inferring program properties as structured prediction in machine learning. This enables us to leverage powerful probabilistic models such as Conditional Random Fields (CRFs) and perform joint prediction of program properties.
As an example of our approach, we built a scalable prediction engine called JSNICE for solving two kinds of tasks in the context of JavaScript: predicting (syntactic) names of identifiers and predicting (semantic) type annotations of variables. Experimentally, JSNICE predicts correct names for 63% of name identifiers and its type annotation predictions are correct in 81% of cases. Since its public release at http://jsnice.org, JSNice has become a popular system with hundreds of thousands of uses.
By formulating the problem of inferring program properties as structured prediction, our work opens up the possibility for a range of new "Big Code" applications such as de-obfuscators, decompilers, invariant generators, and others.
- Annotating javascript. https://github.com/google/closure-compiler/wiki/Annotating-JavaScript-for-the-Closure-Compiler.Google Scholar
- Bitbucket. https://bitbucket.org/.Google Scholar
- Facebook flow. https://github.com/facebook/flow.Google Scholar
- Github. http://github.com/.Google Scholar
- Google closure compiler. https://developers.google.com/closure/compiler/.Google Scholar
- Shrink your code and resources. ProGuard for Android Applications: https://developer.android.com/studio/build/shrink-code.html.Google Scholar
- Typescript. https://www.typescriptlang.org/.Google Scholar
- Uglifyjs. https://github.com/mishoo/UglifyJS.Google Scholar
- Bichsel, B., Raychev, V., Tsankov, P., Vechev, M. Statistical deobfuscation of android applications. CCS 2016. Google ScholarDigital Library
- Bielik, P., Raychev, V., Vechev, M.T. PHOG: probabilistic model for code. In Proceedings of the 33<sup>nd</sup> International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 (2016), pp. 2933--2942. Google ScholarDigital Library
- DARPA. Mining and understanding software enclaves (muse). http://www.darpa.mil/news-events/2014-03-06a (2014).Google Scholar
- He, X., Zemel, R.S., Carreira-Perpiñán, M.A. Multiscale conditional random fields for image labeling. CVPR 2004. Google ScholarDigital Library
- Jensen, S.H., Møller, A., Thiemann, P. Type analysis for javascript. In Proceedings of the 16<sup>th</sup> International Symposium on Static Analysis, SAS 2009 (Berlin, Heidelberg, 2009), Springer-Verlag, pp. 238--255. Google ScholarDigital Library
- Koller, D., Friedman, N. Probabilistic Graphical Models: Principles and Techniques. The MIT Press, Cambridge, Massachusetts and London, England, 2009. Google ScholarDigital Library
- Lafferty, J.D., McCallum, A., Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML 2001 (San Francisco, CA, USA, 2001), pp. 282--289. Google ScholarDigital Library
- Quattoni, A., Collins, M., Darrell, T. Conditional random fields for object recognition. In NIPS (2004), 1097--1104. Google ScholarDigital Library
- Ratliff, N.D., Bagnell, J.A., Zinkevich, M. (Approximate) subgradient methods for structured prediction. In AISTATS (2007), 380--387.Google Scholar
- Raychev, V. Learning from Large Codebases. PhD dissertation, ETH Zurich, 2016.Google Scholar
- Vechev, M., Yahav, E. Programming with "big code". Foundations and Trends in Programming Languages 3, 4 (2016), 231--284. Google ScholarDigital Library
Index Terms
- Predicting program properties from 'big code'
Recommendations
Predicting Program Properties from "Big Code"
POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesWe present a new approach for predicting program properties from massive codebases (aka "Big Code"). Our approach first learns a probabilistic model from existing data and then uses this model to predict properties of new, unseen programs.
The key idea ...
Wiki2Prop: A Multimodal Approach for Predicting Wikidata Properties from Wikipedia
WWW '21: Proceedings of the Web Conference 2021Wikidata is rapidly emerging as a key resource for a multitude of online tasks such as Speech Recognition, Entity Linking, Question Answering, or Semantic Search. The value of Wikidata is directly linked to the rich information associated with each ...
A hybrid code representation learning approach for predicting method names
AbstractProgram semantic properties such as class names, method names, and variable names and types play an important role in software development and maintenance. Method names are of particular importance because they provide the cornerstone ...
Highlights- It is difficult for AST-based code representation learning to learn code semantics.
Comments