Some Code Generation papers

Code generation is not new, but it is having significant advances as of late. Coding assistants are a recent addition, although we could say that Microsoft's IntelliSense and JetBrains IDEs were their remote cousins. In any case, by the power of machine learning, and at times large language models, is advancing the field a lot.

This is one of the topics where I'm quite sure more and more changes are going to keep coming for a while. Although I think that the "no-code" concept is quite far away [1], when and where I can use code assist tools, I can already reap some benefits:

  • Easy switching between languages: Less searching, e.g. "how did you concatenate two arrays in language X? was it .append()? was it just +? did it mutated the target, or returned a new array?" [2]
  • Speed boost for auto-completions: Quicker imports, quicker autocompletes, quicker docstrings...
  • Easy tasks bootstrapping: This is still in the early days, but with some prompt engineering and clear descriptions of your intents, mostly acting as if you were writing a technical specification, it also works... As long as the logic/method is not too complex. e.g. GitHub Copilot is not that bad filling for you basic content [3]

But code generation is still not perfect, and depends a lot on the sources. Also, as many LLMs are good with multiple languages, there's a tendency to let them ingest content per-development language. This is not bad per-se, but depending on the corpus of the training datasets, the model will perform better or worse on different programming languages. Other approaches leverage using ASTs (Abstract Syntax Trees), or even analyzing the output machine code to seek better solutions.

Anyway, it is a topic that I find fascinating and I'm learning more about. I recently found that Meta has an in-house coding assistant, and researching about it got me into reading a few papers on the topic. I wanted to keep track of those sources, so why not also sharing them here?

These are papers and blog post I read, alongside a few notes:

  • InCoder - A Generative Model for Code Infilling and Synthesis (Meta)
    • works with spans (composed of tokens)
    • left-to-right generation vs infilling (this model combines both)
    • works directly with certain languages
    • runs ASTs for docstring & type hint generation (in Python at least)
    • optimization: when tokenizing, allow to include whitespaces (not newlines) in tokens, so import a from b becomes a single token
    • [in general] can train models to infill single tokens at a time, or token regions (e.g. 10 tokens)
  • CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring (Meta)
    • not too deep technically, mostly information regarding developer feedback
    • good source of related projects (MS Pythia, GH Copilot, ...)
  • ML-Enhanced Code Completion Improves Developer Productivity (Google)
    • lower requirements than Meta models (only 0.5B params)
    • uses cached ASTs to enable a "full" structural understanding
  • Structural Language Models of Code
    • most important points:
      • leverages the strict syntax of programming languages to model a code snippet as a tree-structural language modeling (SLM)
      • SLM estimates the probability of the program’s AST by decomposing it into a product of conditional probabilities over its nodes
      • neural model that computes these conditional probabilities by considering all AST paths leading to a target node
      • while prior work uses AST paths to read programs, this one generates code by predicting the next node along the set of paths, generating the target AST node-by-node
      • the intuition behind this idea is that a language model could generalize better by modeling the tree rather than the sequential form of the program. Further, learning from the AST allows a model to save learning capacity, instead of having to re-learn known syntactic patterns from the text
    • interesting tree generation logic, to control length (apparently) very easily
    • really small model: single GPU, 15M params
    • really nice demo: http://AnyCodeGen.org
  • Pythia: AI-assisted Code Completion System (Microsoft)
    • also uses partial file-level ASTs
    • used at VSCode (IntelliCode extension) for code completion (triggered by . or = )
    • each training sample is an AST serialized into sequence terminated with the . end-of-sequence character
    • able to generalize understanding of import aliases (as being the same as the un-aliased version)
    • normalizes variable names to reduce vocabulary size
    • tiny 38MB model after quantization! and yet top-5 accuracy is 89% (92% before quantization)

Other very interesting models, added later in post updates:

  • Code Llama - Open Foundation Models for Code (+ blog post) (Meta)
    • Interesting approach of using Llama2 as the base, then training and fine-tunning
    • Variants, more suited for certain scenarios, and with 3 "sizes" (7B, 13B, 34B)
    • Up to 100k tokens context 😮
  • StarCoder - may the source be with you! (+ blog post) (Hugging Face)
    • Really nice and detailed explanation of everything. e.g. the training data setup includes in-depth details of the data preparation & curation steps, and precise numbers of consumed compute resources
    • Base training + fine-tuning
    • Everything available, under the Open Responsible AI Model license
    • Thanks to this paper, I learned how Code LLMs can be good at math reasoning (TL;DR: They convert the reasoning problem into a program, and then execute it)


Foot notes:

[1]: If you ignore the hype and research a bit, or already work in software development, you will know that there is a big leap from building a hello world or on-rails demo, to building any real software project. Anybody can do a create-react-app bootstrap, but building a tiny-scale Shopify, Amazon or anything even small is going to take more effort.

[2]: Search engines are going to change forever very soon (Bing Chat as the earliest example), probably once we solve how to make LLMs more safe/less hallucinating. Fine-tuned chat bots and assistants are both more natural to use and faster to ask questions to. Instead of going to a search box and thinking in generic terms how do you formulate your question, you just dump it into a chat and it understands you quite well.

[3]: I no longer write file I/O manipulation sentences, because it perfectly auto-completes that, and 95% of the time also guesses my desired import statements. Same with capturing any intent of sanitizing input data, sorting arrays, or manipulating dictionaries/arrays/maps (as long as you have layered out the desired schema/structure before). But you'll have lots of "fun" if you let it extend a non-trivial regular expression, or do byte and bit level operations (even simple ones will cause it to generate absurd but compiling code). This is where you most feel that it still is relying way too much on prediction and still not enough on real analysis.


UPDATE:

  • Added Code Llama, from Meta, and StarCoder, from Hugging Face
Some Code Generation papers published @ . Author: