bend.utils.embedders module
embedders.py
Wrapper classes for embedding sequences with pretrained DNA language models using a common interface. The wrapper classes handle loading the models and tokenizers, and embedding the sequences. As far as possible, models are downloaded automatically. They also handle removal of special tokens, and optionally upsample the embeddings to the original sequence length.
Embedders can be used as follows. Please check the individual classes for more details on the arguments.
embedder = EmbedderClass(model_name, some_additional_config_argument=6)
embedding = embedder(sequence, remove_special_tokens=True, upsample_embeddings=True)
- class bend.utils.embedders.AWDLSTMEmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the AWD-LSTM (https://arxiv.org/abs/1708.02182) baseline LM trained in BEND.
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, upsample_embeddings: bool = False)[source]
Embed sequences using the AWD-LSTM baseline LM trained in BEND.
- Parameters:
sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False. Only provided for compatibility with other embedders. GPN embeddings are already the same length as the input sequence.
- Returns:
List of embeddings.
- Return type:
List[np.ndarray]
- load_model(model_path, **kwargs)[source]
Load the AWD-LSTM baseline LM trained in BEND.
- Parameters:
model_path (str) – The path to the model directory. If the model path does not exist, it will be downloaded from https://sid.erda.dk/cgi-sid/ls.py?share_id=dbQM0pgSlM¤t_dir=pretrained_models&flags=f
- class bend.utils.embedders.BaseEmbedder(*args, **kwargs)[source]
Bases:
objectBase class for embedders. All embedders should inherit from this class.
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- class bend.utils.embedders.CaduceusEmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderInitialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]
Embed sequences using the Caduceus model.
- Parameters:
sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the CLS and SEP tokens from the embeddings. Defaults to True. Only provided for compatibility with other embedders.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False. Only provided for compatibility with other embedders. Caduceus embeddings are already the same length as the input sequence.
- Returns:
List of embeddings.
- Return type:
List[np.ndarray]
- load_model(model_name: str = 'kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16', return_logits: bool = False, return_loss: bool = False, **kwargs)[source]
Load the Caduceus model (https://arxiv.org/abs/2403.03234).
- Parameters:
model_name (str, optional) – The name of the model to load. Defaults to “kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16”. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory.
return_logits (bool, optional) – If True, returns logits instead of embeddings. Defaults to False.
return_loss (bool, optional) –
If True, returns the unreduced next token prediction loss. Incompatible with return_logits. We trim special tokens from the output so that the loss is only computed on the ACTGN vocabulary.
Defaults to False.
- class bend.utils.embedders.ConvNetEmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the GPN-inspired ConvNet baseline LM trained in BEND.
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, upsample_embeddings: bool = False)[source]
Embed sequences using the GPN-inspired ConvNet baseline LM trained in BEND.
- Parameters:
sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False. Only provided for compatibility with other embedders. GPN embeddings are already the same length as the input sequence.
- Returns:
List of embeddings.
- Return type:
List[np.ndarray]
- load_model(model_path, **kwargs)[source]
Load the GPN-inspired ConvNet baseline LM trained in BEND.
- Parameters:
model_path (str) – The path to the model directory. If the model path does not exist, it will be downloaded from https://sid.erda.dk/cgi-sid/ls.py?share_id=dbQM0pgSlM¤t_dir=pretrained_models&flags=f
- class bend.utils.embedders.DNABert2Embedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the DNABERT2 model https://arxiv.org/pdf/2306.15006.pdf
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]
Embeds a list sequences using the DNABERT2 model.
- Parameters:
sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the CLS and SEP tokens from the embeddings. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False.
- Returns:
embeddings – List of embeddings.
- Return type:
List[np.ndarray]
- load_model(model_name='zhihan1996/DNABERT-2-117M', return_logits: bool = False, return_loss: bool = False, **kwargs)[source]
Load the DNABERT2 model.
- Parameters:
model_name (str, optional) – The name of the model to load. Defaults to “zhihan1996/DNABERT-2-117M”. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory.
return_logits (bool, optional) – If True, returns logits instead of embeddings. Defaults to False.
return_loss (bool, optional) – If True, returns the unreduced next token prediction loss. Incompatible with return_logits. If
remove_special_tokensis True, the loss is only computed on the BPE vocabulary without the special tokens. Defaults to False.
- class bend.utils.embedders.DNABertEmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the DNABert model https://doi.org/10.1093/bioinformatics/btab083
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]
Embed a list of sequences.
- Parameters:
sequences (List[str]) – The sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the special tokens from the embeddings. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False.
- Returns:
The embeddings of the sequences.
- Return type:
List[np.ndarray]
- load_model(model_path: str = '../../external-models/DNABERT/', kmer: int = 6, **kwargs)[source]
Load the DNABert model.
- Parameters:
model_path (str) – The path to the model directory. Defaults to “../../external-models/DNABERT/”. The DNABERT models need to be downloaded manually as indicated in the DNABERT repository at https://github.com/jerryji1993/DNABERT.
kmer (int) – The kmer size of the model. Defaults to 6.
- class bend.utils.embedders.GENALMEmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the GENA-LM model https://www.biorxiv.org/content/10.1101/2023.06.12.544594v1.full
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]
Embed sequences using the GENA-LM model.
- Parameters:
sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the [CLS] and [SEP] tokens from the output. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False.
- Returns:
List of embeddings.
- Return type:
List[np.ndarray]
- class bend.utils.embedders.GPNEmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the GPN model https://www.biorxiv.org/content/10.1101/2022.08.22.504706v1
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, upsample_embeddings: bool = False) List[ndarray][source]
Embed a list of sequences.
- Parameters:
sequences (List[str]) – The sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False. Only provided for compatibility with other embedders. GPN embeddings are already the same length as the input sequence.
- Returns:
The embeddings of the sequences.
- Return type:
List[np.ndarray]
- load_model(model_name: str = 'songlab/gpn-brassicales', **kwargs)[source]
Load the GPN model.
- Parameters:
model_name (str) – The name of the model to load. Defaults to “songlab/gpn-brassicales”. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory.
- Raises:
ModuleNotFoundError – If the gpn module is not installed.
Notes
The gpn module can be installed with pip install git+https://github.com/songlab-cal/gpn.git
- class bend.utils.embedders.GROVEREmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the GROVER model https://www.biorxiv.org/content/10.1101/2023.07.19.549677v2
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]
Embeds a list sequences using the GROVER model. Note that the BPE tokenizer that GROVER used is not provided, we only have access to the vocabulary used for tokenization. Instead, we use max match to tokenize the sequence, so that each subsequence gets tokenized as its longest token in the vocabulary. Not certain that this is identical to what a correctly instantiated BPE tokenizer would do.
- Parameters:
sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the CLS and SEP tokens from the embeddings. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False.
- Returns:
embeddings – List of embeddings.
- Return type:
List[np.ndarray]
- load_model(model_path: str = 'pretrained_models/grover', **kwargs)[source]
Load the GROVER model.
- Parameters:
model_path (str) – The path to the model directory. If the model path does not exist, it will be downloaded from https://zenodo.org/records/8373117
- max_match_tokenize(sequence: str) List[str][source]
Tokenize a sequence using max match. We have to do this as we do not have access to the BPE tokenizer used by GROVER. We only have access to the vocabulary, so we find a sequence-to-token assignment that uses the longest possible tokens.
- Parameters:
sequence (str) – The sequence to tokenize.
- Returns:
The tokenized sequence.
- Return type:
List[str]
- class bend.utils.embedders.HyenaDNAEmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the HyenaDNA model https://arxiv.org/abs/2306.15794
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]
Embeds a list of sequences using the HyenaDNA model. :param sequences: List of sequences to embed. :type sequences: List[str] :param disable_tqdm: Whether to disable the tqdm progress bar. Defaults to False. :type disable_tqdm: bool, optional :param remove_special_tokens: Whether to remove the CLS and SEP tokens from the embeddings. Defaults to True. Cannot be set to False if
the return_loss option of the embedder is True (autoregression forces us to discard the BOS token position either way).
- Parameters:
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False. Only provided for compatibility with other embedders. HyenaDNA embeddings are already the same length as the input sequence.
- Returns:
embeddings – List of embeddings.
- Return type:
List[np.ndarray]
- load_model(model_path='pretrained_models/hyenadna/hyenadna-tiny-1k-seqlen', return_logits: bool = False, return_loss: bool = False, **kwargs)[source]
Load the HyenaDNA model.
- Parameters:
model_path (str, optional) – Path to the model checkpoint. Defaults to ‘pretrained_models/hyenadna/hyenadna-tiny-1k-seqlen’. If the path does not exist, the model will be downloaded from HuggingFace. Rather than just downloading the model, HyenaDNA’s from_pretrained method relies on cloning the HuggingFace-hosted repository, and using git lfs to download the model. This requires git lfs to be installed on your system, and will fail if it is not.
return_logits (bool, optional) – If True, returns logits instead of embeddings. Defaults to False.
return_loss (bool, optional) –
If True, returns the unreduced next token prediction loss. Incompatible with return_logits. We trim special tokens from the output so that the loss is only computed on the ACTGN vocabulary.
Defaults to False.
- class bend.utils.embedders.NucleotideTransformerEmbedder(*args, **kwargs)[source]
Bases:
BaseEmbedderEmbed using the Nuclieotide Transformer (NT) model https://www.biorxiv.org/content/10.1101/2023.01.11.523679v2.full
Initialize the embedder. Calls load_model with the given arguments.
- Parameters:
*args – Positional arguments. Passed to load_model.
**kwargs – Keyword arguments. Passed to load_model.
- embed(sequences: List[str], disable_tqdm: bool = False, remove_special_tokens: bool = True, upsample_embeddings: bool = False)[source]
Embed sequences using the Nuclieotide Transformer (NT) model.
- Parameters:
sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
remove_special_tokens (bool, optional) – Whether to remove the special tokens from the embeddings. Defaults to True.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to the length of the input sequence. Defaults to False.
- Returns:
List of embeddings.
- Return type:
List[np.ndarray]
- load_model(model_name, return_logits: bool = False, return_loss: bool = False, **kwargs)[source]
Load the Nuclieotide Transformer (NT) model.
- Parameters:
model_name (str) – The name of the model to load. When providing a name, the model will be loaded from the HuggingFace model hub. Alternatively, you can provide a path to a local model directory. We check whether the model_name contains ‘v2’ to determine whether we need to follow the V2 model API or not.
return_logits (bool, optional) – Whether to return the logits. Note that we do not apply any masking. Defaults to False.
return_loss (bool, optional) – Whether to return the loss. Note that we do not apply any masking.
remove_special_tokensalso ignores these dimensions when computing the loss. Defaults to False.
- class bend.utils.embedders.OneHotEmbedder(nucleotide_categories=['A', 'C', 'G', 'N', 'T'])[source]
Bases:
BaseEmbedderOnehot encode sequences
Get an onehot encoder for nucleotide sequences.
- Parameters:
nucleotide_categories (List[str], optional) – List of nucleotides in the alphabet. Defaults to [‘A’, ‘C’, ‘G’, ‘N’, ‘T’].
- embed(sequences: List[str], disable_tqdm: bool = False, return_onehot: bool = False, upsample_embeddings: bool = False)[source]
Onehot encode sequences.
- Parameters:
sequences (List[str]) – List of sequences to embed.
disable_tqdm (bool, optional) – Whether to disable the tqdm progress bar. Defaults to False.
return_onehot (bool, optional) – Whether to return onehot encoded sequences. Defaults to False. If false, returns integer encoded sequences.
upsample_embeddings (bool, optional) – Whether to upsample the embeddings to match the length of the input sequences. Defaults to False.
- Returns:
embeddings – List of one-hot encodings or integer encodings, depending on return_onehot.
- Return type:
List[np.ndarray]