Named Entity Recognition(NER)

You can use textacy_parsing script for generating document metadata keys automatically. The scripts are a modified version of textacy code updated to run with the current spacy version. The script uses a spacy embeddings model to process a text document for a json metadata keyfile. The keys are parsed based on a config file in run_files/parse_configs/ner_types.json or run_files/parse_configs/ner_types_full.json. You can give your own config file if you want.

The new parse script uses multiprocess to improve performance. The default process pool number is 6. You should change the process number based on the number of cores your machine has.

The available configs are

Ngrams Description
PROPN Proper Noun
ADJ Adjective
NNP Noun proper singular
NN Noun, singular or mass
AUX Auxiliary
VBZ Verb, 3rd person singular present
ADP Adposition
SYM Symbol
NUM Numeral
CD Cardinal number
VBG verb, gerund or present participle
Entities Description
FAC Buildings, airports, highways, bridges, etc.
NORP Nationalities or religious or political groups
GPE Countries, cities, states
PRODUCT Objects, vehicles, foods, etc. (not services)
EVENT Named hurricanes, battles, wars, sports events, etc.
PERSON People, including fictional
ORG Companies, agencies, institutions, etc.
LOC Non-GPE locations, mountain ranges, bodies of water
DATE Absolute or relative dates or periods
TIME Times smaller than a day
WORK_OF_ART Titles of books, songs, etc.
Extract type Description
orth Terms are represented by their text exactly as written
lower Lowercased form of the text
lemma Base form w/o inflectional suffixes

For details see Spacy linguistic features and Model NER labels. The instructions expect en model, but spacy supports a wide range of models. You can also specify Noun chunks. Noun chunk of 2 for example would create keys like "Yellow House" or "Blond Hair".

You can create ner metadata list with

python -m document_parsing.parse_ner
Optional param Description
--data-directory The directory where your text files are stored. Default "./run_files/documents/skynet"
--collection-name The name of the collection Will be used as name and location for the keyfile. Default "skynet"
--key-storage The directory for the collection metadata keys. Default "./run_files/key_storage/"
--threads The number of multiprocess threads. Default 6.