Data Processors

Basic Processor

Abstract class that provides methods for loading train/dev/test/unlabeled examples for a given task.

class DataProcessor(labels: Optional[Sequence[Any]] = None, labels_path: Optional[str] = None)

labels of the dataset is optional

here’s the examples of loading the labels:

I: DataProcessor(labels = ['positive', 'negative'])

II: DataProcessor(labels_path = 'datasets/labels.txt') labels file should have label names seperated by any blank characters, such as

positive neutral
negative
Parameters
  • labels (Sequence[Any], optional) – class labels of the dataset. Defaults to None.

  • labels_path (str, optional) – Defaults to None. If set and labels is None, load labels from labels_path.

get_dev_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample]

get dev examples from the development file under data_dir

call get_examples(data_dir, "dev"), see get_examples()

abstract get_examples(data_dir: Optional[str] = None, split: Optional[str] = None) List[openprompt.data_utils.utils.InputExample]

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

get_label_id(label: Any) int

get label id of the corresponding label

Parameters

label – label in dataset

Returns

the index of label

Return type

int

get_labels() List[Any]

get labels of the dataset

Returns

labels of the dataset

Return type

List[Any]

get_num_labels()

get the number of labels in the dataset

Returns

number of labels in the dataset

Return type

int

get_test_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample]

get test examples from the test file under data_dir

call get_examples(data_dir, "test"), see get_examples()

get_train_examples(data_dir: Optional[str] = None) openprompt.data_utils.utils.InputExample

get train examples from the training file under data_dir

call get_examples(data_dir, "train"), see get_examples()

get_unlabeled_examples(data_dir: Optional[str] = None) List[openprompt.data_utils.utils.InputExample]

get unlabeled examples from the unlabeled file under data_dir

call get_examples(data_dir, "unlabeled"), see get_examples()

Text Classification Processor

AgnewsProcessor

class AgnewsProcessor

AG News is a News Topic classification dataset

we use dataset provided by LOTClass

Examples:

from openprompt.data_utils.text_classification_dataset import PROCESSORS

base_path = "datasets/TextClassification"

dataset_name = "agnews"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
trainvalid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 4
assert processor.get_labels() == ["World", "Sports", "Business", "Tech"]
assert len(trainvalid_dataset) == 120000
assert len(test_dataset) == 7600
assert test_dataset[0].text_a == "Fears for T N pension after talks"
assert test_dataset[0].text_b == "Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul."
assert test_dataset[0].label == 2
get_examples(data_dir, split)

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

DBpediaProcessor

class DBpediaProcessor

Dbpedia is a Wikipedia Topic Classification dataset.

we use dataset provided by LOTClass

Examples:

from openprompt.data_utils.text_classification_dataset import PROCESSORS

base_path = "datasets/TextClassification"

dataset_name = "dbpedia"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
trainvalid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 14
assert len(trainvalid_dataset) == 560000
assert len(test_dataset) == 70000
get_examples(data_dir, split)

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

ImdbProcessor

class ImdbProcessor

IMDB is a Movie Review Sentiment Classification dataset.

we use dataset provided by LOTClass

Examples:

from openprompt.data_utils.text_classification_dataset import PROCESSORS

base_path = "datasets/TextClassification"

dataset_name = "imdb"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
trainvalid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 2
assert len(trainvalid_dataset) == 25000
assert len(test_dataset) == 25000
get_examples(data_dir, split)

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

SST2Processor

class SST2Processor

SST-2 dataset is a dataset for sentiment analysis. It is a modified version containing only binary labels (negative or somewhat negative vs somewhat positive or positive with neutral sentences discarded) on top of the original 5-labeled dataset released first in Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

We use the data released in Making Pre-trained Language Models Better Few-shot Learners (Gao et al. 2020)

Examples:

from openprompt.data_utils.lmbff_dataset import PROCESSORS

base_path = "datasets/TextClassification"

dataset_name = "SST-2"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 2
assert processor.get_labels() == ['0','1']
assert len(train_dataset) == 6920
assert len(dev_dataset) == 872
assert len(test_dataset) == 1821
assert train_dataset[0].text_a == 'a stirring , funny and finally transporting re-imagining of beauty and the beast and 1930s horror films'
assert train_dataset[0].label == 1
get_examples(data_dir, split)

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]

Entity Typing Processor

FewNERDProcessor

class FewNERDProcessor

Few-NERD a large-scale, fine-grained manually annotated named entity recognition dataset

It was released together with Few-NERD: Not Only a Few-shot NER Dataset (Ning Ding et al. 2021)

Examples:

from openprompt.data_utils.typing_dataset import PROCESSORS

base_path = "datasets/Typing"

dataset_name = "FewNERD"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 66
assert processor.get_labels() == [
    "person-actor", "person-director", "person-artist/author", "person-athlete", "person-politician", "person-scholar", "person-soldier", "person-other",
    "organization-showorganization", "organization-religion", "organization-company", "organization-sportsteam", "organization-education", "organization-government/governmentagency", "organization-media/newspaper", "organization-politicalparty", "organization-sportsleague", "organization-other",
    "location-GPE", "location-road/railway/highway/transit", "location-bodiesofwater", "location-park", "location-mountain", "location-island", "location-other",
    "product-software", "product-food", "product-game", "product-ship", "product-train", "product-airplane", "product-car", "product-weapon", "product-other",
    "building-theater", "building-sportsfacility", "building-airport", "building-hospital", "building-library", "building-hotel", "building-restaurant", "building-other",
    "event-sportsevent", "event-attack/battle/war/militaryconflict", "event-disaster", "event-election", "event-protest", "event-other",
    "art-music", "art-writtenart", "art-film", "art-painting", "art-broadcastprogram", "art-other",
    "other-biologything", "other-chemicalthing", "other-livingthing", "other-astronomything", "other-god", "other-law", "other-award", "other-disease", "other-medical", "other-language", "other-currency", "other-educationaldegree",
]
assert dev_dataset[0].text_a == "The final stage in the development of the Skyfox was the production of a model with tricycle landing gear to better cater for the pilot training market ."
assert dev_dataset[0].meta["entity"] == "Skyfox"
assert dev_dataset[0].label == 30

Relation Classification Processor

TACREDProcessor

class TACREDProcessor

TAC Relation Extraction Dataset (TACRED) is one of the largest and most widely used datasets for relation classification. It was released together with the paper Position-aware Attention and Supervised Data Improve Slot Filling (Zhang et al. 2017) This processor is also inherited by TACREVProcessor and ReTACREDProcessor.

Examples:

from openprompt.data_utils.relation_classification_dataset import PROCESSORS

base_path = "datasets/RelationClassification"

dataset_name = "TACRED"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 42
assert processor.get_labels() == ["no_relation", "org:founded", "org:subsidiaries", "per:date_of_birth", "per:cause_of_death", "per:age", "per:stateorprovince_of_birth", "per:countries_of_residence", "per:country_of_birth", "per:stateorprovinces_of_residence", "org:website", "per:cities_of_residence", "per:parents", "per:employee_of", "per:city_of_birth", "org:parents", "org:political/religious_affiliation", "per:schools_attended", "per:country_of_death", "per:children", "org:top_members/employees", "per:date_of_death", "org:members", "org:alternate_names", "per:religion", "org:member_of", "org:city_of_headquarters", "per:origin", "org:shareholders", "per:charges", "per:title", "org:number_of_employees/members", "org:dissolved", "org:country_of_headquarters", "per:alternate_names", "per:siblings", "org:stateorprovince_of_headquarters", "per:spouse", "per:other_family", "per:city_of_death", "per:stateorprovince_of_death", "org:founded_by"]
assert len(train_dataset) == 68124
assert len(dev_dataset) == 22631
assert len(test_dataset) == 15509
assert train_dataset[0].text_a == 'Tom Thabane resigned in October last year to form the All Basotho Convention -LRB- ABC -RRB- , crossing the floor with 17 members of parliament , causing constitutional monarch King Letsie III to dissolve parliament and call the snap election .'
assert train_dataset[0].meta["head"] == "All Basotho Convention"
assert train_dataset[0].meta["tail"] == "Tom Thabane"
assert train_dataset[0].label == 41

TACREVProcessor

class TACREVProcessor

TACRED Revisted (TACREV) is a variant of the TACRED dataset

It was proposed by the paper TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task (Alt et al. 2020)

This processor inherit TACREDProcessor and can be used similarly

Examples:

from openprompt.data_utils.relation_classification_dataset import PROCESSORS

base_path = "datasets/RelationClassification"

dataset_name = "TACREV"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)
assert processor.get_num_labels() == 42
assert processor.get_labels() == ["no_relation", "org:founded", "org:subsidiaries", "per:date_of_birth", "per:cause_of_death", "per:age", "per:stateorprovince_of_birth", "per:countries_of_residence", "per:country_of_birth", "per:stateorprovinces_of_residence", "org:website", "per:cities_of_residence", "per:parents", "per:employee_of", "per:city_of_birth", "org:parents", "org:political/religious_affiliation", "per:schools_attended", "per:country_of_death", "per:children", "org:top_members/employees", "per:date_of_death", "org:members", "org:alternate_names", "per:religion", "org:member_of", "org:city_of_headquarters", "per:origin", "org:shareholders", "per:charges", "per:title", "org:number_of_employees/members", "org:dissolved", "org:country_of_headquarters", "per:alternate_names", "per:siblings", "org:stateorprovince_of_headquarters", "per:spouse", "per:other_family", "per:city_of_death", "per:stateorprovince_of_death", "org:founded_by"]
assert len(train_dataset) == 68124
assert len(dev_dataset) == 22631
assert len(test_dataset) == 15509

ReTACREDProcessor

class ReTACREDProcessor

Re-TACRED is a variant of the TACRED dataset

It was proposed by the paper Re-TACRED: Addressing Shortcomings of the TACRED Dataset (Stoica et al. 2021)

This processor inherit TACREDProcessor and can be used similarly

Examples:

from openprompt.data_utils.relation_classification_dataset import PROCESSORS

base_path = "datasets/RelationClassification"

dataset_name = "ReTACRED"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)
assert processor.get_num_labels() == 40
assert processor.get_labels() == ["no_relation", "org:members", "per:siblings", "per:spouse", "org:country_of_branch", "per:country_of_death", "per:parents", "per:stateorprovinces_of_residence", "org:top_members/employees", "org:dissolved", "org:number_of_employees/members", "per:stateorprovince_of_death", "per:origin", "per:children", "org:political/religious_affiliation", "per:city_of_birth", "per:title", "org:shareholders", "per:employee_of", "org:member_of", "org:founded_by", "per:countries_of_residence", "per:other_family", "per:religion", "per:identity", "per:date_of_birth", "org:city_of_branch", "org:alternate_names", "org:website", "per:cause_of_death", "org:stateorprovince_of_branch", "per:schools_attended", "per:country_of_birth", "per:date_of_death", "per:city_of_death", "org:founded", "per:cities_of_residence", "per:age", "per:charges", "per:stateorprovince_of_birth"]
assert len(train_dataset) == 58465
assert len(dev_dataset) == 19584
assert len(test_dataset) == 13418

SemEvalProcessor

class SemEvalProcessor

SemEval-2010 Task 8 is a a traditional dataset in relation classification.

It was released together with the paper SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals (Hendrickx et al. 2010)

Examples:

from openprompt.data_utils.relation_classification_dataset import PROCESSORS

base_path = "datasets/RelationClassification"

dataset_name = "SemEval"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)
assert processor.get_num_labels() == 19
assert processor.get_labels() == ["Other", "Member-Collection(e1,e2)", "Entity-Destination(e1,e2)", "Content-Container(e1,e2)", "Message-Topic(e1,e2)", "Entity-Origin(e1,e2)", "Cause-Effect(e1,e2)", "Product-Producer(e1,e2)", "Instrument-Agency(e1,e2)", "Component-Whole(e1,e2)", "Member-Collection(e2,e1)", "Entity-Destination(e2,e1)", "Content-Container(e2,e1)", "Message-Topic(e2,e1)", "Entity-Origin(e2,e1)", "Cause-Effect(e2,e1)", "Product-Producer(e2,e1)", "Instrument-Agency(e2,e1)", "Component-Whole(e2,e1)"]
assert len(train_dataset) == 6507
assert len(dev_dataset) == 1493
assert len(test_dataset) == 2717
assert dev_dataset[0].text_a == 'the system as described above has its greatest application in an arrayed configuration of antenna elements .'
assert dev_dataset[0].meta["head"] == "configuration"
assert dev_dataset[0].meta["tail"] == "elements"
assert dev_dataset[0].label == 18

Language Inference Processor

SNLIProcessor

class SNLIProcessor

The Stanford Natural Language Inference (SNLI) corpus is a dataset for natural language inference. It is first released in A large annotated corpus for learning natural language inference (Bowman et al. 2015)

We use the data released in Making Pre-trained Language Models Better Few-shot Learners (Gao et al. 2020)

Examples:

from openprompt.data_utils.lmbff_dataset import PROCESSORS

base_path = "datasets"

dataset_name = "SNLI"
dataset_path = os.path.join(base_path, dataset_name, '16-13')
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
dev_dataset = processor.get_dev_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert processor.get_num_labels() == 3
assert processor.get_labels() == ['entailment', 'neutral', 'contradiction']
assert len(train_dataset) == 549367
assert len(dev_dataset) == 9842
assert len(test_dataset) == 9824
assert train_dataset[0].text_a == 'A person on a horse jumps over a broken down airplane.'
assert train_dataset[0].text_b == 'A person is training his horse for a competition.'
assert train_dataset[0].label == 1

Conditional Generation Processor

WebNLGProcessor

class WebNLGProcessor

# TODO citation

Examples:

from openprompt.data_utils.conditional_generation_dataset import PROCESSORS

base_path = "datasets/CondGen"

dataset_name = "webnlg_2017"
dataset_path = os.path.join(base_path, dataset_name)
processor = PROCESSORS[dataset_name.lower()]()
train_dataset = processor.get_train_examples(dataset_path)
valid_dataset = processor.get_train_examples(dataset_path)
test_dataset = processor.get_test_examples(dataset_path)

assert len(train_dataset) == 18025
assert len(valid_dataset) == 18025
assert len(test_dataset) == 4928
assert test_dataset[0].text_a == " | Abilene_Regional_Airport : cityServed : Abilene,_Texas"
assert test_dataset[0].text_b == ""
assert test_dataset[0].tgt_text == "Abilene, Texas is served by the Abilene regional airport."
get_examples(data_dir: str, split: str) List[openprompt.data_utils.utils.InputExample]

get the split of dataset under data_dir

data_dir is the base path of the dataset, for example:

training file could be located in data_dir/train.txt

Parameters
  • data_dir (str) – the base path of the dataset

  • split (str) – train / dev / test / unlabeled

Returns

return a list of InputExample

Return type

List[InputExample]