bend.utils.embedders module

embedders.py

Wrapper classes for embedding sequences with pretrained DNA language models using a common interface. The wrapper classes handle loading the models and tokenizers, and embedding the sequences. As far as possible, models are downloaded automatically. They also handle removal of special tokens, and optionally upsample the embeddings to the original sequence length.

Embedders can be used as follows. Please check the individual classes for more details on the arguments.

embedder = EmbedderClass(model_name, some_additional_config_argument=6)

embedding = embedder(sequence, remove_special_tokens=True, upsample_embeddings=True)

class bend.utils.embedders.AWDLSTMEmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the AWD-LSTM (https://arxiv.org/abs/1708.02182) baseline LM trained in BEND.

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, upsample_embeddings: bool = False)[source]

Embed sequences using the AWD-LSTM baseline LM trained in BEND.

Parameters:

sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False. Only provided for compatibility with other embedders. GPN embeddings are already the same length as the input sequence.

Returns:

List of embeddings.

Return type:

List[np.ndarray]

load_model(model_path, **kwargs)[source]

Load the AWD-LSTM baseline LM trained in BEND.

Parameters:: model_path (str) – The path to the model directory. If the model path does not exist, it will be downloaded from https://sid.erda.dk/cgi-sid/ls.py?share_id=dbQM0pgSlM&current_dir=pretrained_models&flags=f

class bend.utils.embedders.BaseEmbedder(*args, **kwargs)[source]

Bases: object

Base class for embedders. All embedders should inherit from this class.

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: str, *args, **kwargs)[source]

Embed a sequence. Should be implemented by the inheriting class.

Parameters:: sequences (str) – The sequences to embed.

load_model(*args, **kwargs)[source]: Load the model. Should be implemented by the inheriting class.

class bend.utils.embedders.CaduceusEmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]

Embed sequences using the Caduceus model.

Parameters:

sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the CLS and SEP tokens from the embeddings. Defaults to True. Only provided for compatibility with other embedders.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False. Only provided for compatibility with other embedders. Caduceus embeddings are already the same length as the input sequence.

Returns:

List of embeddings.

Return type:

List[np.ndarray]

load_model(model_name: str = 'kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16', return_logits: bool = False, return_loss: bool = False, **kwargs)[source]

Load the Caduceus model (https://arxiv.org/abs/2403.03234).

Parameters:

model_name (str, optional) – The name of the model to load. Defaults to “kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16”. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory.
return_logits (bool, optional) – If True, returns logits instead of embeddings. Defaults to False.
return_loss (bool, optional) –
If True, returns the unreduced next token prediction loss. Incompatible with return_logits. We trim special tokens from the output so that the loss is only computed on the ACTGN vocabulary.

Defaults to False.

class bend.utils.embedders.ConvNetEmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the GPN-inspired ConvNet baseline LM trained in BEND.

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, upsample_embeddings: bool = False)[source]

Embed sequences using the GPN-inspired ConvNet baseline LM trained in BEND.

Parameters:

sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False. Only provided for compatibility with other embedders. GPN embeddings are already the same length as the input sequence.

Returns:

List of embeddings.

Return type:

List[np.ndarray]

load_model(model_path, **kwargs)[source]

Load the GPN-inspired ConvNet baseline LM trained in BEND.

Parameters:: model_path (str) – The path to the model directory. If the model path does not exist, it will be downloaded from https://sid.erda.dk/cgi-sid/ls.py?share_id=dbQM0pgSlM&current_dir=pretrained_models&flags=f

class bend.utils.embedders.DNABert2Embedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the DNABERT2 model https://arxiv.org/pdf/2306.15006.pdf

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]

Embeds a list sequences using the DNABERT2 model.

Parameters:

sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the CLS and SEP tokens from the embeddings. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False.

Returns:

embeddings – List of embeddings.

Return type:

List[np.ndarray]

load_model(model_name='zhihan1996/DNABERT-2-117M', return_logits: bool = False, return_loss: bool = False, **kwargs)[source]

Load the DNABERT2 model.

Parameters:

model_name (str, optional) – The name of the model to load. Defaults to “zhihan1996/DNABERT-2-117M”. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory.
return_logits (bool, optional) – If True, returns logits instead of embeddings. Defaults to False.
return_loss (bool, optional) – If True, returns the unreduced next token prediction loss. Incompatible with return_logits. If remove_special_tokens is True, the loss is only computed on the BPE vocabulary without the special tokens. Defaults to False.

class bend.utils.embedders.DNABertEmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the DNABert model https://doi.org/10.1093/bioinformatics/btab083

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]

Embed a list of sequences.

Parameters:

sequences (List[str]) – The sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the special tokens from the embeddings. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False.

Returns:

The embeddings of the sequences.

Return type:

List[np.ndarray]

load_model(model_path: str = '../../external-models/DNABERT/', kmer: int = 6, **kwargs)[source]

Load the DNABert model.

Parameters:

model_path (str) – The path to the model directory. Defaults to “../../external-models/DNABERT/”. The DNABERT models need to be downloaded manually as indicated in the DNABERT repository at https://github.com/jerryji1993/DNABERT.
kmer (int) – The kmer size of the model. Defaults to 6.

class bend.utils.embedders.GENALMEmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the GENA-LM model https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1.full

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]

Embed sequences using the GENA-LM model.

Parameters:

sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the [CLS] and [SEP] tokens from the output. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False.

Returns:

List of embeddings.

Return type:

List[np.ndarray]

load_model(model_name, **kwargs)[source]

Load the GENA-LM model.

Parameters:: model_name (str) – The name of the model to load. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory.

class bend.utils.embedders.GPNEmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the GPN model https://www.biorxiv.org/content/10.1101/2022.08.22.504706v1

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, upsample_embeddings: bool = False) → List[ndarray][source]

Embed a list of sequences.

Parameters:

sequences (List[str]) – The sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False. Only provided for compatibility with other embedders. GPN embeddings are already the same length as the input sequence.

Returns:

The embeddings of the sequences.

Return type:

List[np.ndarray]

load_model(model_name: str = 'songlab/gpn-brassicales', **kwargs)[source]

Load the GPN model.

Parameters:: model_name (str) – The name of the model to load. Defaults to “songlab/gpn-brassicales”. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory.
Raises:: ModuleNotFoundError – If the gpn module is not installed.

Notes

The gpn module can be installed with pip install git+https://github.com/songlab-cal/gpn.git

class bend.utils.embedders.GROVEREmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the GROVER model https://www.biorxiv.org/content/10.1101/2023.07.19.549677v2

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]

Embeds a list sequences using the GROVER model. Note that the BPE tokenizer that GROVER used is not provided, we only have access to the vocabulary used for tokenization. Instead, we use max match to tokenize the sequence, so that each subsequence gets tokenized as its longest token in the vocabulary. Not certain that this is identical to what a correctly instantiated BPE tokenizer would do.

Parameters:

sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the CLS and SEP tokens from the embeddings. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False.

Returns:

embeddings – List of embeddings.

Return type:

List[np.ndarray]

load_model(model_path: str = 'pretrained_models/grover', **kwargs)[source]

Load the GROVER model.

Parameters:: model_path (str) – The path to the model directory. If the model path does not exist, it will be downloaded from https://zenodo.org/records/8373117

max_match_tokenize(sequence: str) → List[str][source]

Tokenize a sequence using max match. We have to do this as we do not have access to the BPE tokenizer used by GROVER. We only have access to the vocabulary, so we find a sequence-to-token assignment that uses the longest possible tokens.

Parameters:: sequence (str) – The sequence to tokenize.
Returns:: The tokenized sequence.
Return type:: List[str]

class bend.utils.embedders.HyenaDNAEmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the HyenaDNA model https://arxiv.org/abs/2306.15794

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]

Embeds a list of sequences using the HyenaDNA model. :param sequences: List of sequences to embed. :type sequences: List[str] :param disable_tqdm: Whether to disable the tqdm progress bar. Defaults to False. :type disable_tqdm: bool, optional :param remove_special_tokens: Whether to remove the CLS and SEP tokens from the embeddings. Defaults to True. Cannot be set to False if

the return_loss option of the embedder is True (autoregression forces us to discard the BOS token position either way).

Parameters:: upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False. Only provided for compatibility with other embedders. HyenaDNA embeddings are already the same length as the input sequence.
Returns:: embeddings – List of embeddings.
Return type:: List[np.ndarray]

load_model(model_path='pretrained_models/hyenadna/hyenadna-tiny-1k-seqlen', return_logits: bool = False, return_loss: bool = False, **kwargs)[source]

Load the HyenaDNA model.

Parameters:

model_path (str, optional) – Path to the model checkpoint. Defaults to ‘pretrained_models/hyenadna/hyenadna-tiny-1k-seqlen’. If the path does not exist, the model will be downloaded from HuggingFace. Rather than just downloading the model, HyenaDNA’s from_pretrained method relies on cloning the HuggingFace-hosted repository, and using git lfs to download the model. This requires git lfs to be installed on your system, and will fail if it is not.
return_logits (bool, optional) – If True, returns logits instead of embeddings. Defaults to False.
return_loss (bool, optional) –
If True, returns the unreduced next token prediction loss. Incompatible with return_logits. We trim special tokens from the output so that the loss is only computed on the ACTGN vocabulary.

Defaults to False.

class bend.utils.embedders.NucleotideTransformerEmbedder(*args, **kwargs)[source]

Bases: BaseEmbedder

Embed using the Nuclieotide Transformer (NT) model https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2.full

Initialize the embedder. Calls load_model with the given arguments.

Parameters:

*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.

embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]

Embed sequences using the Nuclieotide Transformer (NT) model.

Parameters:

sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the special tokens from the embeddings. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False.

Returns:

List of embeddings.

Return type:

List[np.ndarray]

load_model(model_name, return_logits: bool = False, return_loss: bool = False, **kwargs)[source]

Load the Nuclieotide Transformer (NT) model.

Parameters:

model_name (str) – The name of the model to load. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory. We check whether the model_name contains ‘v2’ to determine whether we need to follow the V2 model API or not.
return_logits (bool, optional) – Whether to return the logits. Note that we do not apply any masking. Defaults to False.
return_loss (bool, optional) – Whether to return the loss. Note that we do not apply any masking. remove_special_tokens also ignores these dimensions when computing the loss. Defaults to False.

class bend.utils.embedders.OneHotEmbedder(nucleotide_categories=['A', 'C', 'G', 'N', 'T'])[source]

Bases: BaseEmbedder

Onehot encode sequences

Get an onehot encoder for nucleotide sequences.

Parameters:: nucleotide_categories (List[str], optional) – List of nucleotides in the alphabet. Defaults to [‘A’, ‘C’, ‘G’, ‘N’, ‘T’].

embed(sequences: List[str], disable_tqdm: bool = False, return_onehot: bool = False, upsample_embeddings: bool = False)[source]

Onehot encode sequences.

Parameters:

sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
return_onehot (bool, optional) – Whether to return onehot encoded sequences. Defaults to False. If false, returns integer encoded sequences.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False.

Returns:

embeddings – List of one-hot encodings or integer encodings, depending on return_onehot.

Return type:

List[np.ndarray]