doc
containing the spaCy
tokenization of the text
column, using the en_core_web_sm
model. This allows for more complex labelling functions which use the doc
object to decide which spans to label.doc
column yourself in main.py
using the SpacyTokenization
object. To do this add the following lines to your main.py
in the project directory.de_core_news_md
model, for example.SpacyTokenization
object is a placeholder object and does not perform tokenization itself. Internally, Programmatic will use the corresponding spaCy model with smart caching behaviour to make accessing the tokenized representation of your documents as fast as possible.HUMANLOOP_MAX_SPACY_RAM_DOC_LENGTH
, the maximum size of a document before it will be cached only on disk and not in memory. Default 20_000
characters.HUMANLOOP_MAX_SPACY_RAM_DOC_COUNT
, the maximum number of spaCy doc objects to cache in RAM simultaneously. Default 10_000
documents.HUMANLOOP_MAX_SPACY_DOC_LENGTH
, the maximum size of a document that spaCy will attempt to parse. See https://spacy.io/api/language. Default 10_000_000
characters.