How to Use spaCy in Labelling Functions
You can use features from spaCy pipelines to more easily build rules.
By default, Programmatic will augment your dataset with a column named doc containing the spaCy tokenization of the text column, using the en_core_web_sm model. This allows for more complex labelling functions which use the doc object to decide which spans to label.
The use of spaCy can be customised by declaring the doc column yourself in main.py using the SpacyTokenization object. To do this add the following lines to your main.py in the project directory.
1
from programmatic.integrations import SpacyTokenization
2
...
3
4
my_dataset["doc"] = SpacyTokenization(model="de_core_news_md")
Copied!
For example, this will override the default tokenization to use the de_core_news_md model, for example.
N.B.: The SpacyTokenization object is a placeholder object and does not perform tokenization itself. Internally, Programmatic will use the corresponding spaCy model with smart caching behaviour to make accessing the tokenized representation of your documents as fast as possible.
spaCy environment variables
The behaviour of the spaCy integration can be controlled by the following environment variables:
  • HUMANLOOP_MAX_SPACY_RAM_DOC_LENGTH, the maximum size of a document before it will be cached only on disk and not in memory. Default 20_000 characters.
  • HUMANLOOP_MAX_SPACY_RAM_DOC_COUNT, the maximum number of spaCy doc objects to cache in RAM simultaneously. Default 10_000 documents.
  • HUMANLOOP_MAX_SPACY_DOC_LENGTH, the maximum size of a document that spaCy will attempt to parse. See https://spacy.io/api/language. Default 10_000_000 characters.
Copy link