Named Entity Recognition(NER)
You can use textacy_parsing script for generating document metadata keys automatically. The scripts are a modified version of textacy code updated to run with the current spacy version. The script uses a spacy embeddings model to process a text document for a json metadata keyfile. The keys are parsed based on a config file in run_files/parse_configs/ner_types.json or run_files/parse_configs/ner_types_full.json. You can give your own config file if you want.
The new parse script uses multiprocess to improve performance. The default process pool number is 6. You should change the process number based on the number of cores your machine has.
The available configs are
Ngrams | Description |
---|---|
PROPN | Proper Noun |
NOUN | Noun |
ADJ | Adjective |
NNP | Noun proper singular |
NN | Noun, singular or mass |
AUX | Auxiliary |
VBZ | Verb, 3rd person singular present |
VERB | Verb |
ADP | Adposition |
SYM | Symbol |
NUM | Numeral |
CD | Cardinal number |
VBG | verb, gerund or present participle |
ROOT | Root |
Entities | Description |
---|---|
FAC | Buildings, airports, highways, bridges, etc. |
NORP | Nationalities or religious or political groups |
GPE | Countries, cities, states |
PRODUCT | Objects, vehicles, foods, etc. (not services) |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
PERSON | People, including fictional |
ORG | Companies, agencies, institutions, etc. |
LOC | Non-GPE locations, mountain ranges, bodies of water |
DATE | Absolute or relative dates or periods |
TIME | Times smaller than a day |
WORK_OF_ART | Titles of books, songs, etc. |
Extract type | Description |
---|---|
orth | Terms are represented by their text exactly as written |
lower | Lowercased form of the text |
lemma | Base form w/o inflectional suffixes |
For details see Spacy linguistic features and Model NER labels. The instructions expect en model, but spacy supports a wide range of models. You can also specify Noun chunks. Noun chunk of 2 for example would create keys like "Yellow House" or "Blond Hair".
You can create ner metadata list with
python -m document_parsing.parse_ner
Optional param | Description |
---|---|
--data-directory | The directory where your text files are stored. Default "./run_files/documents/skynet" |
--collection-name | The name of the collection Will be used as name and location for the keyfile. Default "skynet" |
--key-storage | The directory for the collection metadata keys. Default "./run_files/key_storage/" |
--threads | The number of multiprocess threads. Default 6. |