Welcome to 🧬 BEND’s documentation!

BEND is a Benchmark collection for evaluating the performance of DNA language models (LMs). The BEND codebase serves three purposes:

  • Providing a unified interface for computing embeddings from pretrained DNA LMs.

  • Extracting sequences from reference genomes using coordinates listed in bed files, and computing embeddings for these sequences for training and evaluating models.

  • Training lightweight supervised CNN models that use DNA LM embeddings as input, and evaluating their performance on a variety of tasks.

The documentation covers the BEND codebase and includes instructions on how to extend it to new LMs and tasks. For a tutorial on how to run BEND on existing tasks, please refer to the README file on GitHub.

bend.models

This module contains the implementations of the supervised models used in the paper.

  • ConvNetForSupervised: a ResNet that we train as baseline model on one-hot encodings, if no dedicated baseline architecture is available for a task.

  • CNN: a two-layer CNN used for all downstream tasks.

bend.utils

This module contains a collection of utilities used throughout the project for data processing, model training, and evaluation.

  • Annotation: a class for retrieving sequences from a reference genome based on a bed file.

  • TaskTrainer: a class for training a model on a given task.

bend.io

I/O module for reading and writing data. This module provides utilities for processing genome coordinate-based sequence data in bed files to embeddings, and saving and loading embedding data to and from disk in tar format.

Indices and tables