Welcome to 🧬 BEND’s documentation!

BEND is a Benchmark collection for evaluating the performance of DNA language models (LMs). The BEND codebase serves three purposes:

  • Providing a unified interface for computing embeddings from pretrained DNA LMs.

  • Extracting sequences from reference genomes using coordinates listed in bed files, and computing embeddings for these sequences for training and evaluating models.

  • Training lightweight supervised CNN models that use DNA LM embeddings as input, and evaluating their performance on a variety of tasks.

The documentation covers the BEND codebase and includes instructions on how to extend it to new LMs and tasks. For a tutorial on how to run BEND on existing tasks, please refer to the README file on GitHub.


This module contains the implementations of the supervised models used in the paper.

  • ConvNetForSupervised: a ResNet that we train as baseline model on one-hot encodings, if no dedicated baseline architecture is available for a task.

  • CNN: a two-layer CNN used for all downstream tasks.


This module contains a collection of utilities used throughout the project for data processing, model training, and evaluation.

  • Annotation: a class for retrieving sequences from a reference genome based on a bed file.

  • TaskTrainer: a class for training a model on a given task.


I/O module for reading and writing data. This module provides utilities for processing genome coordinate-based sequence data in bed files to embeddings, and saving and loading embedding data to and from disk in tar format.

Indices and tables