The Workflow
Programmatic labelling is a three-step process that lets you train supervised machine learning models starting fromvery little annotated data.

## 1. You create a set of “labelling functions” that can be applied across your dataset

Labelling functions are simple rules that can guess the correct label for a data point. For example, if you were trying to classify news headlines as "clickbait" or "legitimate news" you might use a rule that says:
"If the headline starts with a number predict that the label is clickbait and if not don't predict a label."
or in code:
1
2
3
4
return 1
5
else:
6
return 0
Copied!
It's clear that this rule won't be 100% accurate and won't capture all clickbait headlines. That's ok. What's important is that you can come up with a few different rules and that each rule is pretty good. Once you have a collection of rules, we can then learn which ones to trust and when.

## 2. Programmatic uses a Bayesian model to work out the most likely label for each data point

The good news for a practitioner is that all of the hardest math is taken care of by Programmatic or other open source packages.
It's helpful to keep an example in mind, so let's stick to thinking about classifying news headlines into one of two categories. Imagine that you have 10000 headlines and 5 labelling functions. Each labelling function tries to guess whether the headline is "news" or "clickbait". If you wanted to visualise it, you could put it together into a large table with one row per headline and one column for each labelling function. If you did that you'd get a table like this with 10,000 rows and 5 columns:
Labelling functions with votes for either news or clickbait
The goal now is to work out the most likely label for each datapoint. The simplest thing you could do would be to just take the majority vote in each row. So if four labelling functions voted for clickbait, we'd assume the label is clickbait.
The problem with this approach is that we know that some of the rules are going to be bad and some are going to be good. We also know that the rules might be correlated. The way we get around this is to first to train a probabilistic model that learns an estimate for the accuracy of each rule. Once we've trained this model we can calculate the distribution
$p\left(y=\textsf{clickbait} \, \middle\vert \, \textsf{labelling functions}\right)$
for each of our datapoints. This is a more intelligent weighting of all the 5 votes we get for each datapoint.
Intuitively a model can tell that a rule is accurate if it consistently votes with the majority. Conversely a rule that is very erratic and only sometimes votes with the majority is less likely to be good. The more approximate rules we have, the better we can clean them up.
There is a lot of freedom in what model you choose to clean up the labels, and how you train it. Most of the research in the early days of weak supervision was in improving the model used and the efficiency of training. The original paper used a naive Bayes model and SGD with Gibbs sampling to learn the accuracy of each rule. Later methods were developed that can learn the correlations between labelling functions too and recent work has used matrix completion methods to efficiently train the model.

## 3. You train a normal machine learning model on the dataset produced by steps one and two

Once you have your estimates for the probability of each label, you can either threshold that to get a hard label for each datapoint or you can use the probability distribution as the target for a model. Either way, you now have a labelled dataset and you can use it just like any other labelled dataset!
You can go ahead and train a machine learning model and get good performance, even though you might have had no annotated data when you started!