Abstract

Recent advances in natural language processing have produced libraries that extract low level features from a collection of raw texts. These features, known as annotations, are usually stored internally in hierarchical, tree-based data structures. This paper proposes a data model to represent annotations as a collection of normalized relational data tables optimized for exploratory data analysis and predictive modeling. The R package cleanNLP, which calls one of two state of the art NLP libraries (CoreNLP or spaCy), is presented as an implementation of this data model. It takes raw text as an input and returns a list of normalized tables. Speciﬁc annotations provided include tokenization, part of speech tagging, named entity recognition, sentiment analysis, dependency parsing,coreference resolution, and word embeddings. The package currently supports input text in English, German, French, and Spanish.

Document Type

Article

Publication Date

12-2017

Publisher Statement

Please note that downloads of the article are for private/personal use only.

Recommended Citation

Arnold, Taylor. "A Tidy Data Model for Natural Language Processing Using CleanNLP." The R Journal, 9:2 (2017): 248-267.

Download

Find in your library

Included in

Computer Sciences Commons

COinS

Department of Math & Statistics Faculty Publications

A Tidy Data Model for Natural Language Processing Using CleanNLP

Abstract

Document Type

Publication Date

Publisher Statement

Recommended Citation

Included in

Search

Browse

Author Corner

Links

Department of Math & Statistics Faculty Publications

A Tidy Data Model for Natural Language Processing Using CleanNLP

Authors

Abstract

Document Type

Publication Date

Publisher Statement

Recommended Citation

Included in

Share

Search

Browse

Author Corner

Links