Information Extraction

Key Phrase Extraction

Graph algorithms are used in unsupervised fashion. The nodes are weighted according to the frequency of the words and their connection to other words in the text. The top N nodes which are important are returned as key phrases. textacy is a package which can be used for key phrase extraction.

Named Entity Recognition (NER)

It is a sequence labeling challenge. The context of the surronding words and their POS tags are considered for creating features. conditional random fields (CRFs) algorithm is one of the popular choices for training NER. BIO scheme is used to annotate the text. B - Beggining, I - Intermediate and O - other. For example, First Name and last Name are B and I respectively. MITIE is a library to train NER systems. Stanford NER, spaCy and AllenNLP have pre-trained NER models.

To customize NER model to a domain or use case, use customized heuristics for the problem domain (using tools such as RegexNER and EntityRuler) or use active learning tools like Prodigy

Named Entity Disambiguation (NED) and Linking

NER and NED together are known as named entity linking (NEL). NEL needs to go beyond POS tagging and require parsing to identify items like subject, verb and object. It also requires coreference resolution to resolve and link multiple references to the same entity. This is modeled as supervised ML problem. It is common to use off-the-shelf services for NEL.

DBpedia Spotlight is a popular tool for entity linking.

Relationship Extraction

It is the task of extracting entities and relationshops between them from text documents.

It can be treated as supervised classification. It can be modeled as a two step classification problem

  • If two entities are related
  • If they are related what is the relation between them

Temporal Information Extraction

Duckling library can be used to extract temporal events. when we run the sentence “Let us meet at 3 p.m. today and decide on what to present at the meeting on Friday” through Duckling. It’s able to map “3 p.m. today” to the correct time on a given day.