Spark NLP

Saturday, February 08, 2020

License

To the extent possible under law, Spark NLP has waived all copyright and related or neighboring rights to this work.

What is Spark NLP?

John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment.

Features

Tokenization
Stop Words Removal
Normalizer
Stemmer
Lemmatizer
NGrams
Regex Matching
Text Matching
Chunking
Date Matcher
Sentence Detector
Part-of-speech tagging
Sentiment Detection (ML models)
Spell Checker (ML and DL models)
Word Embeddings (GloVe and Word2Vec)
BERT Embeddings (TF Hub models)
ELMO Embeddings (TF Hub models)
Universal Sentence Encoder (TF Hub models)
Sentence Embeddings
Chunk Embeddings
Multi-class Text Classification (Deep learning)
Named entity recognition (Deep learning)
Dependency parsing (Labeled/unlabled)
Easy TensorFlow integration
Full integration with Spark ML functions
+30 pre-trained models in 6 languages (English, French, German, Italian, Spanish, and Russian)
+30 pre-trained pipelines!

Requirements

In order to use Spark NLP you need the following requirements:

Java 8
Apache Spark 2.4.x

Installation

$ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.6 -y
$ conda activate sparknlp
$ pip install spark-nlp==2.4.5 pyspark==2.4.4

Quickstart

# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark Session with Spark NLP
spark = sparknlp.start()

# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en')

# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo.
It's held at the Louvre in Paris.
"""

# Annotate your testing dataset
result = pipeline.annotate(text)

# What's in the pipeline
list(result.keys())
Output: ['entities', 'stem', 'checked', 'lemma', 'document',
'pos', 'token', 'ner', 'embeddings', 'sentence']

# Check the results
result['entities']
Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']

Spark NLP Workshop

← Previous