What is CodeBert:

  • CodeBERT is extension of BERT model developed by Microsoft in 2020.
  • It is a bimodal pre-trained model for programming language (PL) and natural language (NL), that can perform several downstream (NL-PL) tasks (refer below).
  • This model has been trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).

Here, I will be discussing the paper published by Microsoft (Feng et al.): CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Note: I’ll be touching up on brief overview, for more detail including mathematics behind model and detailed architecture please refer to the original paper.

Before diving into the paper, let us discuss some of these use cases and downstream tasks CodeBERT can support. It is interesting to note that some of these use cases are being implemented in MS tools such as visual studio- IntelliCode.

Use Cases of CodeBert:

  1. Code to Code Translation: can be used for code completion or code translation. For example, when developer wants to write a java code, that has existing python code, code to code translation can help translate this code block.
  2. Code to text: can aid developer in code summarization. For example, when developer looks at the unfamiliar piece of code, CodeAI models can translate the code to natural language and summarize them for the developer.
  3. Text-Code: this can provide code search like feature. This search can provide aid user can retrieve relevant code based on natural language query.
  4. Text to text: can help translate code domain text to different languages.
use cases summarized

Background of Bert:

BERT ((Bidirectional Encoder Representations from Transformers) is a self-supervised model proposed in 2018 by Google.

BERT model architecture

Bert Architecture

  • BERT in essence is a stack of Transformer encoder layers (Vaswani et al., 2017) that consist of multiple self-attention “heads”.
  • Consider a statement: “I like you” vs “Huh! As if I like you”. While a simple transformer will consider like as same, i.e in token embedding. Bert will also take into consideration Positional embedding and Segment Embedding.
  • For every input token in a sequence, each head computes key, value, and query vectors, used to create a weighted representation/embedding.
  • The outputs of all heads in the same layer are combined and run through a fully connected layer. Each layer is wrapped with a skip connection and followed by layer normalization.
  • The conventional workflow for BERT consists of two stages: pre-training and fine-tuning.
  • Pre- training uses two self-supervised tasks: masked language modelling (MLM, prediction of randomly masked input tokens) and next sentence prediction (NSP, predicting if two input sentences are adjacent to each other).
  • Fine-tuning is for downstream applications, one or more fully connected layers are typically added on top of the final encoder layer.

CodeBert Architecture:

  • BERT is easily extendable to multi-modality, i.e training with different types of dataset.
  • CodeBert is bi-modal extension of Bert. i.e CodeBERT utilizes both natural languages and source codes as its input. (Unlike, traditional BERT and RoBERTa that focus on natural language primarily)
codeBert Architechture, with bimodal inputs

Bimodal — NL — PL pairs:

  • The typical input on which CodeBERT is trained on is combination of code and well-defined text comments.(image taken from paper)

CodeBERT describes two pretrained objectives: masked language modelling (MLM) and Replaced Token Detection (RTD).

Training CodeBERT with Masked Language Modelling: a random set of positions are selected for both NL and PL to mask out, and then replace the selected positions with a special [MASK] token. The MLM objective is to predict the original tokens which are masked out

Training CodeBERT with Replaced Token Detection: Here, in original NL sequence and PL sequence few tokens will be randomly masked out. Then, a generator model is trained which is a n-gram like probabilistic model. Following with a discriminator model is trained to determine whether a word is the original one or not, which is a binary classification problem.

Training details of CodeBERT

  • 125M parameters, 12 layers
  • Takes 250 hours of taining on NVIDIA DGX-2 with FP16


  • Results summarizes, shows scores that CodeBert out performs it’s predecessors.
  • It is interesting to note that when CodeBERT is used with pretrained representations from RoBERTa model (this RoBERTa model has been trained on codes from Code-SearchNet) vs when it is trained from scratch. Initializing CodeBERT with RoBERTa performs better.

How to use CodeBERT:

CodeBERT is available through Huggingface (repository here):




Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store