openprompt

Contents

prompt_base

class Template(tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})

Base class for all the templates. Most of methods are abstract, with some expections to hold the common methods for all template, such as loss_ids, save, load.

Parameters
  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • placeholder_mapping (dict) – A place holder to represent the original input text.

classmethod from_config(config: yacs.config.CfgNode, **kwargs)

load a template from template’s configuration node.

Parameters
  • config (CfgNode) – the sub-configuration of template, i.e. config[config.template] if config is a global config node.

  • kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of __init__ functions.

from_file(path: str, choice: int = 0)

Read the template from a local file.

Parameters
  • path (str) – The path of the local template file.

  • choice (int) – The id-th line of the file.

get_default_loss_ids() List[int]

Get the loss indices for the template using mask. e.g. when self.text is '{"placeholder": "text_a"}. {"meta": "word"} is {"mask"}.', output is [0, 0, 0, 0, 1, 0].

Returns

A list of integers in the range [0, 1]:

  • 1 for a masked tokens.

  • 0 for a sequence tokens.

Return type

List[int]

get_default_shortenable_ids() List[int]

Every template needs shortenable_ids, denoting which part of the template can be trucate to fit the language model’s max_seq_length. Default: the input text is shortenable, while the template text and other special tokens are not shortenable.

e.g. when self.text is '{"placeholder": "text_a"} {"placeholder": "text_b", "shortenable": False} {"meta": "word"} is {"mask"}.', output is [1, 0, 0, 0, 0, 0, 0].

Returns

A list of integers in the range [0, 1]:

  • 1 for the input tokens.

  • 0 for the template sequence tokens.

Return type

List[int]

get_default_soft_token_ids() List[int]

This function identifies which tokens are soft tokens.

Sometimes tokens in the template are not from the vocabulary, but a sequence of soft tokens. In this case, you need to implement this function

Raises

NotImplementedError – if needed, add soft_token_ids into registered_inputflag_names attribute of Template class and implement this method.

incorporate_text_example(example: openprompt.data_utils.utils.InputExample)
abstract on_text_set()

A hook to do something when template text was set. The designer of the template should explictly know what should be down when the template text is set.

parse_text(text: str) List[Dict]
post_processing_outputs(outputs)

Post processing the outputs of language models according to the need of template. Most templates don’t need post processing, The template like SoftTemplate, which appends soft template as a module (rather than a sequence of input tokens) to the input, should remove the outputs on these positions to keep the seq_len the same

abstract process_batch(batch)

Template should rewrite this method if you need to process the batch input such as substituting embeddings.

registered_inputflag_names = ['loss_ids', 'shortenable_ids']
safe_on_text_set() None

With this wrapper function, setting text inside on_text_set() will not trigger on_text_set() again to prevent endless recursion.

save(path: str, **kwargs) None

A save method API.

Parameters

path (str) – A path to save your template.

property text
wrap_one_example(example: openprompt.data_utils.utils.InputExample) List[Dict]

Given an input example which contains input text, which can be referenced by self.template.placeholder_mapping ‘s value. This function process the example into a list of dict, Each dict functions as a group, which has the sample properties, such as whether it’s shortenable, whether it’s the masked position, whether it’s soft token, etc. Since a text will be tokenized in the subsequent processing procedure, these attributes are broadcasted along the tokenized sentence.

Parameters

example (InputExample) – An InputExample object, which should have attributes that are able to be filled in the template.

Returns

A list of dict of the same length as self.text. e.g. [{"loss_ids": 0, "text": "It was"}, {"loss_ids": 1, "text": "<mask>"}, ]

Return type

List[Dict]

training: bool
class Verbalizer(tokenizer: Optional[transformers.tokenization_utils.PreTrainedTokenizer] = None, classes: Optional[Sequence[str]] = None, num_classes: Optional[int] = None)

Base class for all the verbalizers.

Parameters
  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • classes (Sequence[str]) – A sequence of classes that need to be projected.

static aggregate(label_words_logits: torch.Tensor) torch.Tensor

To aggregate logits on multiple label words into the label’s logits Basic aggregator: mean of each label words’ logits to a label’s logits Can be re-implemented in advanced verbaliezer.

Parameters

label_words_logits (torch.Tensor) – The logits of the label words only.

Returns

The final logits calculated by the label words.

Return type

torch.Tensor

classmethod from_config(config: yacs.config.CfgNode, **kwargs)

load a verbalizer from verbalizer’s configuration node.

Parameters
  • config (CfgNode) – the sub-configuration of verbalizer, i.e. config[config.verbalizer] if config is a global config node.

  • kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of __init__ functions.

from_file(path: str, choice: Optional[int] = 0)

Load the predefined label words from verbalizer file. Currently support three types of file format: 1. a .jsonl or .json file, in which is a single verbalizer in dict format. 2. a .jsonal or .json file, in which is a list of verbalizers in dict format 3. a .txt or a .csv file, in which is the label words of a class are listed in line, seperated by commas. Begin a new verbalizer by an empty line. This format is recommended when you don’t know the name of each class.

The details of verbalizer format can be seen in How to Write a Verbalizer?.

Parameters
  • path (str) – The path of the local template file.

  • choice (int) – The choice of verbalizer in a file containing multiple verbalizers.

Returns

self object

Return type

Template

gather_outputs(outputs: transformers.file_utils.ModelOutput)

retrieve useful output for the verbalizer from the whole model ouput By default, it will only retrieve the logits

Parameters

outputs (ModelOutput) –

Returns

torch.Tensor The gathered output, should be of shape (batch_size, seq_len, any)

abstract generate_parameters(**kwargs) List

The verbalizer can be seen as an extra layer on top of the originial pre-trained models. In manual verbalizer, it is a fixed one-hot vector of dimension vocab_size, with the position of the label word being 1 and 0 everywhere else. In other situation, the parameters may be a continuous vector over the vocab, with each dimension representing a weight of that token. Moreover, the parameters may be set to trainable to allow label words selection.

Therefore, this function serves as an abstract methods for generating the parameters of the verbalizer, and must be instantiated in any derived class.

Note that the parameters need to be registered as a part of pytorch’s module to It can be acheived by wrapping a tensor using nn.Parameter().

handle_multi_token(label_words_logits, mask)

Support multiple methods to handle the multi tokens produced by the tokenizer. We suggest using ‘first’ or ‘max’ if the some parts of the tokenization is not meaningful. Can broadcast to 3-d tensor.

Parameters

label_words_logits (torch.Tensor) –

Returns

torch.Tensor

property label_words

Label words means the words in the vocabulary projected by the labels. E.g. if we want to establish a projection in sentiment classification: positive \(\rightarrow\) {wonderful, good}, in this case, wonderful and good are label words.

normalize(logits: torch.Tensor) torch.Tensor

Given logits regarding the entire vocab, calculate the probs over the label words set by softmax.

Parameters

logits (Tensor) – The logits of the entire vocab.

Returns

The probability distribution over the label words set.

Return type

Tensor

on_label_words_set()

A hook to do something when textual label words were set.

process_outputs(outputs: torch.Tensor, batch: Union[Dict, openprompt.data_utils.utils.InputFeatures], **kwargs)

By default, the verbalizer will process the logits of the PLM’s output.

Parameters
  • logits (torch.Tensor) – The current logits generated by pre-trained language models.

  • batch (Union[Dict, InputFeatures]) – The input features of the data.

abstract project(logits: torch.Tensor, **kwargs) torch.Tensor

This method receives input logits of shape [batch_size, vocab_size], and use the parameters of this verbalizer to project the logits over entire vocab into the logits of labels words.

Parameters

logits (Tensor) – The logits over entire vocab generated by the pre-trained lanuage model with shape [batch_size, max_seq_length, vocab_size]

Returns

The normalized probs (sum to 1) of each label .

Return type

Tensor

register_calibrate_logits(logits: torch.Tensor)

This function aims to register logits that need to be calibrated, and detach the orginal logits from the current graph.

safe_on_label_words_set()
property vocab: Dict
property vocab_size: int
training: bool
class Template(tokenizer: transformers.tokenization_utils.PreTrainedTokenizer, placeholder_mapping: dict = {'<text_a>': 'text_a', '<text_b>': 'text_b'})

Base class for all the templates. Most of methods are abstract, with some expections to hold the common methods for all template, such as loss_ids, save, load.

Parameters
  • tokenizer (PreTrainedTokenizer) – A tokenizer to appoint the vocabulary and the tokenization strategy.

  • placeholder_mapping (dict) – A place holder to represent the original input text.

classmethod from_config(config: yacs.config.CfgNode, **kwargs)

load a template from template’s configuration node.

Parameters
  • config (CfgNode) – the sub-configuration of template, i.e. config[config.template] if config is a global config node.

  • kwargs – Other kwargs that might be used in initialize the verbalizer. The actual value should match the arguments of __init__ functions.

from_file(path: str, choice: int = 0)

Read the template from a local file.

Parameters
  • path (str) – The path of the local template file.

  • choice (int) – The id-th line of the file.

get_default_loss_ids() List[int]

Get the loss indices for the template using mask. e.g. when self.text is '{"placeholder": "text_a"}. {"meta": "word"} is {"mask"}.', output is [0, 0, 0, 0, 1, 0].

Returns

A list of integers in the range [0, 1]:

  • 1 for a masked tokens.

  • 0 for a sequence tokens.

Return type

List[int]

get_default_shortenable_ids() List[int]

Every template needs shortenable_ids, denoting which part of the template can be trucate to fit the language model’s max_seq_length. Default: the input text is shortenable, while the template text and other special tokens are not shortenable.

e.g. when self.text is '{"placeholder": "text_a"} {"placeholder": "text_b", "shortenable": False} {"meta": "word"} is {"mask"}.', output is [1, 0, 0, 0, 0, 0, 0].

Returns

A list of integers in the range [0, 1]:

  • 1 for the input tokens.

  • 0 for the template sequence tokens.

Return type

List[int]

get_default_soft_token_ids() List[int]

This function identifies which tokens are soft tokens.

Sometimes tokens in the template are not from the vocabulary, but a sequence of soft tokens. In this case, you need to implement this function

Raises

NotImplementedError – if needed, add soft_token_ids into registered_inputflag_names attribute of Template class and implement this method.

incorporate_text_example(example: openprompt.data_utils.utils.InputExample)
abstract on_text_set()

A hook to do something when template text was set. The designer of the template should explictly know what should be down when the template text is set.

parse_text(text: str) List[Dict]
post_processing_outputs(outputs)

Post processing the outputs of language models according to the need of template. Most templates don’t need post processing, The template like SoftTemplate, which appends soft template as a module (rather than a sequence of input tokens) to the input, should remove the outputs on these positions to keep the seq_len the same

abstract process_batch(batch)

Template should rewrite this method if you need to process the batch input such as substituting embeddings.

registered_inputflag_names = ['loss_ids', 'shortenable_ids']
safe_on_text_set() None

With this wrapper function, setting text inside on_text_set() will not trigger on_text_set() again to prevent endless recursion.

save(path: str, **kwargs) None

A save method API.

Parameters

path (str) – A path to save your template.

property text
wrap_one_example(example: openprompt.data_utils.utils.InputExample) List[Dict]

Given an input example which contains input text, which can be referenced by self.template.placeholder_mapping ‘s value. This function process the example into a list of dict, Each dict functions as a group, which has the sample properties, such as whether it’s shortenable, whether it’s the masked position, whether it’s soft token, etc. Since a text will be tokenized in the subsequent processing procedure, these attributes are broadcasted along the tokenized sentence.

Parameters

example (InputExample) – An InputExample object, which should have attributes that are able to be filled in the template.

Returns

A list of dict of the same length as self.text. e.g. [{"loss_ids": 0, "text": "It was"}, {"loss_ids": 1, "text": "<mask>"}, ]

Return type

List[Dict]

training: bool