The Label Model
This section explains how Programmatic takes your noisy rules and creates high quality training data.
One of the core benefits of programmatic labelling is that you don't have to write perfect labelling functions. You can write a lot of approximate labelling functions and then programmatic learns how to combine each of the labelling functions together. How does it work though?

A simple example

Imagine that you were labelling data for named entity recognition (NER), and had created a handful of rules that had varying accuracies and coverage. Shown below is an example passage that's been tagged up by 6 different weak labelling functions.
For each word or token, there is a correct label but we might never observe what it is. Instead what we observe is just the sequence of labelling function outputs. Programmatic labelling combines the observed outputs to produce a probability over what the most likely true label is. Shown in green above is the label that results from combining each of the labelling functions.

The Sequence Model

For NER, the way that Programmatic combines the noisy outputs is using a simple unsupervised Bayesian model called a Hidden Markov Model (HMM). The HMM is well established in machine learning applications.
Diagram showing the structure of the HMM. For each token there is an unobserved true label. The model assumes that the true labels appear in sequence and that the labelling functions themselves are noisy corruptions of the true label.
The HMM is a generative model that assumes that for each token there is a true unobserved label and that the labelling functions are noisy corruptions of that true label. After observing the label function results, we perform Bayesian inference in this model to find the probabilities over the unobserved labels. The model leans both the class-conditioned accuracies of the labelling functions and also which labels typically follow one another.
When you press "Run all" in Programmatic, we use the skweak library to train an HMM model and show you the results of taking the most likely label according to this label model.

Using Your Own Label Model

Programmatic ships with two default label models that are configured for you automatically. You're not forced to use our choices though. If you want to have more control over how the label functions are aggregated you can.
To get the raw label function results, click "Export" button in the bottom left, click "Export data & functions", then click "Export JSON". This will give you the raw labelling function results as a big JSON where each datapoint contains its labelling function results under results. This information is the input to any de-noising algorithm and so you can use it with other methods.