How to Import and Export Data
You can load both labelled and unlabelled data easily

Loading your data

To load in a dataset you simply need to load it into a pandas DataFrame. You can include any other data you want in this dataframe, such as metadata or extra features that you think might be useful.
When you run humanloop init <my-project-name> at the command line, we create a folder called "my-project-name". Inside that folder is a "main.py" file like the one shown below. To include your dataset in the application replace the dataset on line 12 with a dataset of your choice. You could for example load this from a CSV file by replacing line 12 with dataset = pd.read_csv('path_to_your_file/your_data.csv').
It's very common in machine learning development to have a dataset split into a training set, validation set, and testing set. It's best not to look at any test data when developing labelling functions and so we recommend doing this split before loading the data into the Programmatic app.

Specifying Label Classes

To specify what labels you want to tag your data with, replace the label list on line 28 with a list of strings that are the names of your labels.

Specifying Ground Truths

If you have some manually annotated data that you are confident of, you can also load this into the app by editing ground_truths on line 31. For token-level classification, the annotations should be a dataframe where each row has 4 columns:
  • datapoint_id: is an integer that specifies the row of the dataset this label applies to
  • label: __ is a string corresponding to the label you want to apply
  • start: is an integer specifying the start of a span
  • end: is an integer specifying the end of a span
It can be very helpful to have a small amount of manually-annotated "ground truth" data in you training set. This lets you get better feedback on how good your labelling functions are as you can estimate precision and recall.
main.py
1
"""This is an autogenerated main.py created by the Humanloop CLI tool"""
2
import pandas as pd
3
4
from programmatic.main import start_humanloop
5
6
# Edit the lines below to use your own dataset!
7
8
"""
9
Dataset is a pd.DataFrame. The only mandatory column is `text`,
10
which should have a `"string"` dtype. The rest is up to you.
11
"""
12
dataset = pd.DataFrame(
13
{
14
"text": [
15
"Disney cuts ad spending on Facebook amid growing boycott: WSJ",
16
"TikTok considers London and other locations for headquarters",
17
"KPMG hit by Hong Kong High Court in $400 million China Medical fraud",
18
"Abu Dhabi's Etihad Airways restructures Airbus, Boeing jet orders",
19
"Trail of missing Wirecard executive leads to Belarus, Der Spiegel reports",
20
]
21
}
22
)
23
24
25
"""
26
Labels is a simple list of strings
27
"""
28
labels = ["organisation", "geo-political entity", "money"]
29
30
"""
31
ground_truths is another pd.DataFrame. It should have these columns:
32
`datapoint_id`: row index into `dataset`.
33
`text`: text of the highlighted span (between `start` and `end` in dataset `text` column)
34
`label`: string matching something in the `labels` list
35
`start`: text start offset
36
`end`: text end offset
37
"""
38
ground_truths = pd.DataFrame(
39
[
40
# First datapoint
41
{ "datapoint_id": 0, "text": "Disney", "label": "organisation", "start": 0, "end": 6, },
42
{ "datapoint_id": 0, "text": "Facebook", "label": "organisation", "start": 27, "end": 35, },
43
{ "datapoint_id": 0, "text": "WSJ", "label": "organisation", "start": 58, "end": 61, },
44
# Second datapoint
45
{ "datapoint_id": 1, "text": "TikTok", "label": "organisation", "start": 0, "end": 6,},
46
{ "datapoint_id": 1, "text": "London", "label": "geo-political entity", "start": 17, "end": 23,},
47
# Third datapoint
48
{ "datapoint_id": 2, "text": "KPMG", "label": "organisation", "start": 0, "end": 4,},
49
{ "datapoint_id": 2, "text": "Hong Kong High Court", "label": "organisation", "start": 12, "end": 32,},
50
{ "datapoint_id": 2, "text": "$400 million", "label": "money", "start": 36, "end": 48,},
51
{ "datapoint_id": 2, "text": "China Medical", "label": "organisation", "start": 49, "end": 62,},
52
# Fourth datapoint
53
{ "datapoint_id": 3, "text": "Abu Dhabi's", "label": "geo-political entity", "start": 0, "end": 11,},
54
{ "datapoint_id": 3, "text": "Etihad Airways", "label": "organisation", "start": 12, "end": 26,},
55
{ "datapoint_id": 3, "text": "Airbus", "label": "organisation", "start": 40, "end": 46,},
56
{ "datapoint_id": 3, "text": "Boeing", "label": "organisation", "start": 48, "end": 54,},
57
# Fifth datapoint
58
{ "datapoint_id": 4, "text": "Wirecard", "label": "organisation", "start": 17, "end": 25,},
59
{ "datapoint_id": 4, "text": "Belarus", "label": "geo-political entity", "start": 45, "end": 52,},
60
{ "datapoint_id": 4, "text": "Der Spiegel", "label": "organisation", "start": 54, "end": 65,},
61
]
62
)
63
64
65
def start(host=None, port=None, log_level="warning", debug=False):
66
start_humanloop(
67
host=host,
68
port=port,
69
dataset=dataset,
70
labels=labels,
71
ground_truths=ground_truths,
72
log_level=log_level,
73
debug=debug,
74
)
Copied!

Exporting your data

You can export your data as JSON or you can directly export it to Humanloop Platform.
To export your data or functions from the app click the button in the bottom left of the home screen that says "Export".
You will be given the option to continue to Humanloop or to export all your data as JSON.

Exporting your functions

You can export all of your labelling functions as Python modules. You can check these into your version control system or use them as you wish.
  1. 1.
    Click on Export on the bottom left of the top level label explorer panel.
2. Click 'next' under Export data & functions
3. Create a snapshot of your functions
A snapshot of all of your functions as python modules will be put into the export directory within your project directory.