Build a Clickbait Headline Classifier
In this tutorial you'll use Programmatic to train a machine learning model to classify news headlines vs. clickbait.
In this tutorial you'll use Programmatic to train a machine learning model to classify news headlines as legitimate news vs. clickbait. You'll create labelling functions, de-noise your data and then export your labelled data to train a machine learning model. In just 10 minutes you'll do what would have taken hours with just manual annotation.

1. Install

If you haven't already, install Programmatic by opening your terminal and running:
1
pip install --upgrade pip
2
pip install --upgrade programmatic --extra-index-url "https://pypi.humanloop.com/simple"
Copied!
This will install the Humanloop Command Line Interface (CLI) as well as the Programmatic app. To test your installation run the command humanloop --help.

2. Create your Project

You can create a new project in two easy steps. Open your terminal and then:
Step 1: Type humanloop init clickbait_project
Step 2: Select classification then select the clickbait project template.
Step 3: Type humanloop run clickbait_project
This will start a web server running and open your browser on the Humanloop Programmatic app. It will be loaded with the example dataset of clickbait and news headlines and ready for you to iterate on creating labelling functions.
In this tutorial we're going to work with an existing demo dataset but if you want to work on your own data you can learn how to do that here.

3. Explore your Dataset.

The dataset we're working with contains news headlines from sites such as the New York Times and the Guardian, as well as clickbait headlines from sites such as BuzzFeed, Upworthy, and ViralStories.
There are 3000 headlines in our dataset, of which 50 have existing labels.
  1. 1.
    On the right side of the page, click on the tab that says Misses.
  2. 2.
    At the top of the page select Filter -> Labels -> Clickbait
Look at the highlighted ground truths. Can you spot any patterns? Are there keywords that would help you write a labelling rule? Use the search bar at the top left to explore further.

4. Create your Labelling Functions

Once you have a good idea of your dataset, you're ready to write some labelling functions.
  1. 1.
    On the left panel click on the "Clickbait" label and this will open a dropdown. Click on the "New function" button
2. Replace <your string here> with the word You. This labelling function will tag any headline containing the word "You". Press the Run button or press [Ctrl + Enter] (or [Cmd + Enter] on Mac).
Well done! you've just run your first labelling function! Look at the results. How well does it work? What have you still missed?
3. Try to come up with a few more labelling functions of your own or paste in some of the examples from below (one at a time). See the results. Which labelling functions are good? Which are less good?
The first time you run a labelling function that uses spaCy it might take a while but the processed spacy.Doc objects are cached so subsequent runs will be faster.
Searching for Love
Quiz
Who, what, when, where, how, and why
Has "You"
Profanity Check
News Subjects
1
import re
2
3
search_pattern = re.compile(r"\bLove\b")
4
5
def regex_love(row: Datapoint) -> Optional[bool]:
6
"""Look for a regex pattern within the document text"""
7
match = re.search(search_pattern, row.text)
8
if match:
9
start = match.start()
10
end = match.end()
11
print(f"Found regex match from {start} to {end}:")
12
return True
Copied!
1
import re
2
3
search_pattern = re.compile("Quiz")
4
5
def regex_quiz(row: Datapoint) -> Optional[bool]:
6
match = re.search(search_pattern, row.text)
7
if match:
8
start = match.start()
9
end = match.end()
10
print(f"Found regex match from {start} to {end}:")
11
print(row.text[start - 12: end + 12])
12
return True
Copied!
1
def who_what_when_where_how_why(row: Datapoint) -> Optional[bool]:
2
if row.text.startswith("Who "):
3
return True
4
if row.text.startswith("What "):
5
return True
6
if row.text.startswith("When "):
7
return True
8
if row.text.startswith("Where "):
9
return True
10
if row.text.startswith("How "):
11
return True
12
if row.text.startswith("Why "):
13
return True
Copied!
1
import re
2
3
search_pattern = re.compile(r"\bYou\b")
4
5
def has_you(row: Datapoint) -> Optional[bool]:
6
match = re.search(search_pattern, row.text)
7
if match:
8
start = match.start()
9
end = match.end()
10
print(f"Found regex match from {start} to {end}:")
11
print(row.text[start - 12: end + 12])
12
return True
Copied!
1
def alt_profanity_check(row: Datapoint) -> Optional[bool]:
2
"""
3
Returns True if the text contains profanity.
4
5
This labelling function requires this external dependency:
6
https://pypi.org/project/alt-profanity-check/
7
Install it with `pip install alt-profanity-check`.
8
"""
9
from profanity_check import predict_prob
10
score = predict_prob([row.text])[0]
11
return bool(score > 0.5)
Copied!
1
import re
2
3
words = [
4
"Prime Minister",
5
"PM",
6
"President",
7
"U.S.",
8
"US",
9
"EU",
10
"Governor",
11
"war", "murder", "hospital", "weapons",
12
]
13
14
search_patterns = [
15
re.compile(rf"\b{word}\b", flags=re.IGNORECASE) for word in words
16
]
17
18
def keywords_match(row: Datapoint) -> Optional[bool]:
19
for search_pattern in search_patterns:
20
match = re.search(search_pattern, row.text)
21
if match:
22
start = match.start()
23
end = match.end()
24
print(f"Found regex match from {start} to {end}:")
25
print(row.text[start - 12: end + 12])
26
return True
Copied!

5. De-noise your Rules

Once you have written a handful of high precision and coverage rules, you can de-noise and export your dataset.
  1. 1.
    Press the button that says Run all in the lower left corner. This will train the label model.
  2. 2.
    Look at the results of combining all the labelling functions for governing law. Try toggling a labelling function on or off and running again. You do this by pressing "Disable"/"Enable" in the function menu.
Disable a function and it will not be included when the rules are combined together

6. Export your Data

Now that you've built a set of labelling functions and de-noised them, it's time to get them out of Programmatic and into a model. There are a variety of export formats but in this tutorial we're going to export to the Humanloop platform.
  1. 1.
    Press the button that says Export in the lower left hand of the app homepage.
  2. 2.
    Click Export to Humanloop. You'll be prompted to log in with your Humanloop account. If you don't have one you can create one here.
  3. 3.
    Back in the Programmatic app you'll be asked to choose how much data to be marked for review and how much to be immediately used as training data. For now set 0% of the data for review. The ground truth data we have will be automatically selected for test data.
  4. 4.
    Press Create project.
This will create a Humanloop project for you and start training a model on your data. Since we have ground-truth data you'll get test statistics as soon as your model trains.

7. Review your Model

Have a look at your model performance in the Humanloop app. You can explore your annotations or add new ones to further improve your model by clicking Label/Train in the top right.
Model Overview on Humanloop

8. Go Home Happy!

If everything has gone to plan then you should now have a good sized labelled dataset for machine learning development. With Programmatic, you will have gotten this in less than an hour compared to weeks or months for manual annotation. You also have a hosted model that is ready to go and can be rapidly improved by adding some ground truth labels.
It's likely that you'll want to keep iterating on and improving your dataset (perhaps using active learning) but for now, take a break and go home happy!