Build a Legal NER Model
In this tutorial you'll use Programmatic to train a machine learning model to extract clauses from legal contracts.
In this tutorial you'll use Programmatic to train a machine learning model to extract clauses from legal contracts. You'll create labelling functions, de-noise your data and then export your labelled data to train a machine learning model. In just 30 minutes you'll do what would have taken weeks with just manual annotation.

1. Install

If you haven't already, install Programmatic by opening your terminal and running:
1
pip install --upgrade pip
2
pip install --upgrade programmatic --extra-index-url "https://pypi.humanloop.com/simple"
Copied!
This will install the Humanloop Command Line Interface (CLI) as well as the Programmatic app. To test your installation run the command humanloop --help.

2. Create your Project

You can create a new project in two easy steps. Open your terminal and then:
Step 1: Type humanloop init legal_contract_project
Step 2: Select span extraction then select the CUAD project template.
Step 3: Type humanloop run legal_contract_project
This will start a web server running and open your browser on the Humanloop Programmatic app. It will be loaded with the example CUAD dataset (a collection of legal contracts) and ready for you to iterate on creating labelling functions.
In this tutorial we're going to work with an existing demo dataset but if you want to work on your own data you can learn how to do that here.

3. Explore your Dataset.

The dataset we're working with is known as CUAD; it's a public academic dataset that was created by the Atticus project. It required significant domain expertise to label with 10s of hours of training and very detailed annotation guides.
In this tutorial we're going to focus on training a model to extract the "Governing Law Clause".
There are 510 annotated documents in the CUAD dataset and we've included about 20% of the manual labels in the example dataset. To start with let's look at some example ground truth datapoints.
  1. 1.
    On the right side of the page, click on the tab that says Misses.
  2. 2.
    At the top of the page select Filter -> Labels -> Governing Law
Look at the highlighted ground truths. Can you spot any patterns? Are there keywords that would help you write a labelling rule? Use the search bar at the top left to explore further.

4. Create your Labelling Functions

Once you have a good idea of your dataset, you're ready to write some labelling functions. For CUAD we've created a couple of examples to help you get started.
  1. 1.
    On the left panel click on the "Governing Law" label and this will open a dropdown. Click on the labelling function agreement_shall_be....
2. This labelling function will tag any sentence containing the phrase "This Agreement shall be construed". Press the Run button or press [Ctrl + Enter] (or [Cmd + Enter] on Mac).
Well done! you've just run your first labelling function! Look at the results. How well does it work? What have you still missed?
3. Try to come up with a few more labelling functions of your own or paste in some of the examples from below (one at a time). See the results. Which labelling functions are good? Which are less good?
The first time you run a labelling function that uses spaCy it might take a while but the processed spacy.Doc objects are cached so subsequent runs will be faster.
Construed search
Governed search
Governed in spaCy sentences
Governing law search
Law application search
Law-of search
1
def agreement_shall_be_construed(row: Datapoint) -> List[Span]:
2
# Mark any sentence starting with 'This Agreement shall be construed'
3
# as "Governing Law"
4
5
search_term = "This Agreement shall be construed"
6
7
start = row.text.find(search_term)
8
9
if start != -1:
10
end = start + 1
11
while end < len(row.text) and row.text[end] != '.':
12
end += 1
13
end += 1
14
15
return Span(start=start, end=end)
Copied!
1
def agreement_shall_be_governed(row: Datapoint) -> List[Span]:
2
# Mark any sentence starting with 'This Agreement shall be governed'
3
# as "Governing Law"
4
5
search_term = "This Agreement shall be governed"
6
7
start = row.text.find(search_term)
8
9
if start != -1:
10
end = start + 1
11
while end < len(row.text) and row.text[end] != '.':
12
end += 1
13
end += 1
14
15
return Span(start=start, end=end)
Copied!
1
def shall_be_governed_spacy(row: Datapoint) -> List[Span]:
2
# Use the spaCy sentencizer to iterate over sentences and
3
# check for 'shall be governed'.
4
5
for sentence in row.doc.sents:
6
print(sentence)
7
8
if "shall be governed" in sentence.text:
9
return sentence.start_char, sentence.end_char
Copied!
1
def contains_governing_law(row: Datapoint) -> List[Span]:
2
# Mark any sentence containing 'governing law.'
3
# as `Governing Law`
4
5
search_term = "governing law."
6
7
start = row.text.lower().find(search_term)
8
9
if start != -1:
10
end = start + len(search_term) + 1
11
while end < len(row.text) and row.text[end] != '.':
12
end += 1
13
end += 1
14
15
return Span(start=start + len(search_term) + 1, end=end)
Copied!
1
def starts_with_law_application(row: Datapoint) -> List[Span]:
2
# Mark any sentence containing 'law application'
3
# as `Governing Law`
4
5
search_term = "law application"
6
7
start = row.text.lower().find(search_term)
8
9
if start != -1:
10
end = start + len(search_term) + 1
11
while end < len(row.text) and row.text[end] != '.':
12
end += 1
13
end += 1
14
15
return Span(start=start + len(search_term) + 1, end=end)
Copied!
1
def contains_the_law_of(row: Datapoint) -> List[Span]:
2
# Use the spaCy sentencizer to iterate over sentences and
3
# check for 'the law of or 'the laws of'.
4
5
for sentence in row.doc.sents:
6
print(sentence)
7
8
if "the law of" in sentence.text or "the laws of" in sentence.text:
9
return sentence.start_char, sentence.end_char
Copied!

5. De-noise your Rules

Once you have written a handful of high precision and coverage rules, you can de-noise and export your dataset.
  1. 1.
    Press the button that says Run all in the lower left corner. This will train the label model.
  2. 2.
    Look at the results of combining all the labelling functions for governing law. Try toggling a labelling function on or off and running again. You do this by pressing "disable"/"enable" in the function menu.
Disable a function in the function menu and it will not be included when the rules are combined together

6. Export your Data

Now that you've built a set of labelling functions and de-noised them, it's time to get them out of Programmatic and into a model. There are a variety of export formats but in this tutorial we're going to export to the Humanloop platform.
  1. 1.
    Press the button that says Export in the lower left hand of the app homepage.
  2. 2.
    Click Export to Humanloop. You'll be prompted to log in with your Humanloop account. If you don't have one you can create one here.
  3. 3.
    Back in the Programmatic app you'll be asked to choose how much data to be marked for review and how much to be immediately used as training data. For now set 0% of the data for review. The ground truth data we have will be automatically selected for test data.
  4. 4.
    Press Create project.
This will create a Humanloop project for you and start training a model on your data. Since we have ground-truth data you'll get test statistics as soon as your model trains. The CUAD dataset is large so this may take a while the first time. Feel free to grab a coffee.

7. Review your Model

Have a look at your model performance in the Humanloop app. You can explore your annotations or add new ones to further improve your model by clicking Label/Train in the top right.
Model Overview on Humanloop

8. Go Home Happy!

If everything has gone to plan then you should now have a good sized labelled dataset for machine learning development. With Programmatic, you will have gotten this in less than an hour compared to weeks or months for manual annotation. You also have a hosted model that is ready to go and can be rapidly improved by adding some ground truth labels.
It's likely that you'll want to keep iterating on and improving your dataset (perhaps using active learning) but for now, take a break and go home happy!