Train a Hugging Face model on Programmatic Data
You can export Programmatic data directly into formats that are compatible with Hugging Face Transformers.
In this guide you'll take an existing Programmatic project, export its dataset to JSON, and then use that data to train a model from the Hugging Face library.
This guide assumes you already have a Programmatic project. If you're starting for the first time, we recommend following our Build a Legal NER Model tutorial.
This tutorial is also available as a Google Colab notebook.

1. Install the Hugging Face datasets library and some prerequisites

If you haven't already, install the Hugging Face πŸ€— Datasets library by opening your terminal and running (you may wish to do this in a virtual environment):
1
pip install datasets spacy==3.3.0 transformers torch seqeval numpy pandas
Copied!
Further installation instructions and documentation can be found here.

2. Make sure your labelling function results are up-to-date

To make sure you're getting the latest results from all of your labelling functions, run the following command in the terminal:
1
$ humanloop run-all-functions <your project directory>
Copied!
This will re-run all of your labelling functions and persist the results to the database. Depending on how many functions you have and how complex they are, this may take some time.

3. Export your data from the command line

This step will create a JSON file containing your entire dataset, including labels from the labelling functions. In a terminal, run the following command:
1
$ humanloop export <your project directory>
Copied!
By default, this will create a file named export_<your project name>_<timestamp>.json, but you can override this by passing the --output <filename> option to the command.

4. Load the file using the datasets library

Open a Jupyter notebook or other Python environment. First, import the datasets library:
1
from datasets import load_dataset
2
dataset = load_dataset("json", data_files='<path to your exported file>', field='datapoints', split="train")
Copied!
1
dataset = load_dataset(
2
'json',
3
data_files='<path to your exported file>',
4
field='datapoints',
5
split="train"
6
)
Copied!
Note the specification of field='datapoints'. There is other information contained within the export file, but the datapoints array is most useful for actually training models.
To make it easier to work with your datapoints, there is also the .flatten() method which decreases the nesting:
1
>>> dataset.flatten()
2
Dataset({
3
features: ['id', 'programmatic.aggregateResults', 'programmatic.groundTruths', 'programmatic.results', 'data.intent', 'data.text'],
4
num_rows: 2000
5
})
Copied!
These operations are only scratching the surface. Via the datasets API you can transform your data in many ways to prepare it for model training. For more details, see the Hugging Face Process documentation.

5. Explore your dataset

By default, your dataset will have the following three columns:
  • Your original data
  • Programmatic-derived data
  • Datapoint IDs
1
>>> dataset.column_names
2
['programmatic', 'data', 'id']
Copied!
You can inspect a single datapoint to see its original fields:
1
>>> dataset[0]['data']
2
{
3
'intent': 'GetWeather',
4
'text': 'What will the weather be this year in Horseshoe Lake State Fish and Wildlife Area?'
5
}
Copied!
Or see the results of labelling functions applied to it. Information from the programmatic app is stored under the programmatic key in the dictionary of a single datapoint:
1
>>> dataset[0]['programmatic']
2
​
3
{
4
'aggregateResults': [],
5
'groundTruths': [
6
{
7
'end': 34,
8
'label': 'timeRange',
9
'labelId': '9',
10
'start': 25,
11
'text': 'this year'
12
},
13
]
14
...
15
}
Copied!

6. Prepare the dataset for training

The dataset already has span information in the form of character offsets into a string of text, but to train a model we will first tokenize the text, then convert the spans to BILOU tag format and encode them numerically.
For example, the data might look like this:
1
{
2
"text": "I am going to Nevada",
3
"spans": [{"start": 14, "end": 20, "label": "location"}]
4
}
Copied!
But after conversion it will look like:
1
{
2
"tokens": ["I", "am", "going", "to", "Nevada"],
3
"labels": ["O", "O", "O", "O", "U-Location"]
4
}
Copied!
In order to do this, we use spaCy for tokenization plus some helper methods:
1
import spacy
2
from spacy.training import offsets_to_biluo_tags
3
​
4
nlp = spacy.load("en_core_web_sm")
5
​
6
def filter_overlapping_spans(spans):
7
"""Filter a set of spans so that they don't overlap each other"""
8
spans = sorted(spans, key=lambda t: t[0])
9
end_char: int = -1
10
final_spans = []
11
12
for span in spans:
13
if span[0] > end_char:
14
final_spans.append(span)
15
end_char = span[1]
16
​
17
return final_spans
18
​
19
def as_offsets(data):
20
return (data['start'], data['end'], data['label'])
21
​
22
def expand_to_tokens(doc, offsets):
23
"""Make sure offset boundaries align with token boundaries"""
24
span = doc.char_span(offsets[0], offsets[1], alignment_mode="expand")
25
26
first_token, last_token = span[0], span[-1]
27
​
28
return (first_token.idx, last_token.idx + len(last_token), offsets[2])
29
​
30
def tokenize_with_labels(example, labels_from):
31
"""Add tokens / labels fields to a dataset"""
32
tokens = nlp.make_doc(example['data.text'])
33
34
offsets = [as_offsets(result) for result in example[labels_from]]
35
filtered_offsets = filter_overlapping_spans(offsets)
36
filtered_offsets = [expand_to_tokens(tokens, offsets) for offsets in filtered_offsets]
37
38
example["tokens"] = [t.text for t in tokens]
39
example["raw_labels"] = offsets_to_biluo_tags(tokens, filtered_offsets)
40
41
return example
Copied!
We apply this transformation and split the datapoints into train and test sets using datasets.train_test_split()
1
datasets = dataset.train_test_split()
2
train_dataset, test_dataset = datasets["train"], datasets["test"]
3
​
4
train_dataset = train_dataset.map(lambda example: tokenize_with_labels(example, "programmatic.results"))
5
test_dataset = test_dataset.map(lambda example: tokenize_with_labels(example, "programmatic.groundTruths"))
6
​
7
train_dataset = train_dataset.remove_columns(['programmatic.aggregateResults', 'programmatic.groundTruths', 'programmatic.results', 'data.intent', 'data.text'])
8
test_dataset = test_dataset.remove_columns(['programmatic.aggregateResults', 'programmatic.groundTruths', 'programmatic.results', 'data.intent', 'data.text'])
Copied!

7. Train the model

The base model we're using is distilbert-base-uncased, which is available via the Hugging Face πŸ€— transformers library. The first step is to create a tokenizer for the model we want to use, with which we can preprocess the text:
1
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
2
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
Copied!
Although the data is now in tokenized and BILOU-tagged format, there's one more stage of conversion to get it into the format the tokenizer expects.
1
label_list = list({l for datapoint in [*train_dataset, *test_dataset] for l in datapoint["raw_labels"]})
2
label_encoding_dict = {label: label_index for label_index, label in enumerate(label_list)}
3
​
4
def tokenize_and_align_labels(examples):
5
"""Encode BILUO labels as integers for model training, and pad start/end of sequences"""
6
label_all_tokens = True
7
tokenized_inputs = tokenizer(list(examples["tokens"]), truncation=True, is_split_into_words=True)
8
​
9
labels = []
10
for i, label in enumerate(examples["raw_labels"]):
11
word_ids = tokenized_inputs.word_ids(batch_index=i)
12
previous_word_idx = None
13
label_ids = []
14
for word_idx in word_ids:
15
if word_idx is None:
16
label_ids.append(-100)
17
elif label[word_idx] == '0':
18
label_ids.append(0)
19
elif word_idx != previous_word_idx:
20
label_ids.append(label_encoding_dict[label[word_idx]])
21
else:
22
label_ids.append(label_encoding_dict[label[word_idx]] if label_all_tokens else -100)
23
previous_word_idx = word_idx
24
labels.append(label_ids)
25
26
tokenized_inputs["labels"] = labels
27
return tokenized_inputs
28
​
29
​
30
train_tokenized_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
31
test_tokenized_dataset = test_dataset.map(tokenize_and_align_labels, batched=True)
Copied!
The datasets now have a labels column containing labels encoded as numeric values, plus an input_ids column containing the numeric IDs the tokenizer assigns to different words. This is all we need in order to feed the dataset into a model:
1
from datasets import load_metric
2
import numpy as np
3
​
4
model = AutoModelForTokenClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_list))
5
batch_size = 16
6
​
7
args = TrainingArguments(
8
"programmatic-ner",
9
evaluation_strategy = "epoch",
10
learning_rate=1e-4,
11
per_device_train_batch_size=batch_size,
12
per_device_eval_batch_size=batch_size,
13
num_train_epochs=3,
14
weight_decay=1e-5,
15
)
16
​
17
data_collator = DataCollatorForTokenClassification(tokenizer)
18
metric = load_metric("seqeval")
19
​
20
def compute_metrics(p):
21
"""Compute some quick metrics for model performance"""
22
predictions, labels = p
23
predictions = np.argmax(predictions, axis=2)
24
​
25
# Skip -100 (the padding token for start and end of sequences)
26
true_predictions = [[label_list[p] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
27
true_labels = [[label_list[l] for (p, l) in zip(prediction, label) if l != -100] for prediction, label in zip(predictions, labels)]
28
​
29
results = metric.compute(predictions=true_predictions, references=true_labels)
30
return {"precision": results["overall_precision"], "recall": results["overall_recall"], "f1": results["overall_f1"], "accuracy": results["overall_accuracy"]}
31
32
trainer = Trainer(
33
model,
34
args,
35
train_dataset=train_tokenized_dataset,
36
eval_dataset=test_tokenized_dataset,
37
data_collator=data_collator,
38
tokenizer=tokenizer,
39
compute_metrics=compute_metrics
40
)
41
​
42
trainer.train()
43
trainer.evaluate()
Copied!

8. Test the model

Performance stats are nice, but it's also good to see how the model feels qualitatively. Let's see how it handles individual texts:
1
import torch
2
import pandas as pd
3
​
4
cuda_is_available = torch.cuda.is_available()
5
device = torch.device("cuda:0" if cuda_is_available else "cpu")
6
​
7
​
8
def predict(text: str):
9
tokens = tokenizer(text)
10
​
11
input_ids = torch.tensor(tokens['input_ids']).unsqueeze(0).to(device)
12
attention_mask = torch.tensor(tokens['attention_mask']).unsqueeze(0).to(device)
13
​
14
predictions = model.forward(input_ids=input_ids, attention_mask=attention_mask)
15
predictions = torch.argmax(predictions.logits.squeeze(), axis=1)
16
predictions = [label_list[i] for i in predictions]
17
​
18
words = tokenizer.batch_decode(tokens['input_ids'])
19
return pd.DataFrame({ 'words': words, 'ner': predictions })
20
​
21
predict(random.choice(dataset['data.text']))
Copied!
Calls to the predict() function with different texts should give you back your model's predictions of tags for each word in the text - try it out! πŸŽ‰
1
words ner
2
0 [CLS] O
3
1 need O
4
2 weather O
5
3 forecast O
6
4 for O
7
5 stacy O
8
6 in O
9
7 vanuatu U-geographic_poi
10
8 [SEP] O
Copied!