Spark NLP

License

To the extent possible under law, Spark NLP has waived all copyright and related or neighboring rights to this work.


What is Spark NLP?

John Snow Labs Spark NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment.


Features

  • Tokenization
  • Stop Words Removal
  • Normalizer
  • Stemmer
  • Lemmatizer
  • NGrams
  • Regex Matching
  • Text Matching
  • Chunking
  • Date Matcher
  • Sentence Detector
  • Part-of-speech tagging
  • Sentiment Detection (ML models)
  • Spell Checker (ML and DL models)
  • Word Embeddings (GloVe and Word2Vec)
  • BERT Embeddings (TF Hub models)
  • ELMO Embeddings (TF Hub models)
  • Universal Sentence Encoder (TF Hub models)
  • Sentence Embeddings
  • Chunk Embeddings
  • Multi-class Text Classification (Deep learning)
  • Named entity recognition (Deep learning)
  • Dependency parsing (Labeled/unlabled)
  • Easy TensorFlow integration
  • Full integration with Spark ML functions
  • +30 pre-trained models in 6 languages (English, French, German, Italian, Spanish, and Russian)
  • +30 pre-trained pipelines!

Requirements

In order to use Spark NLP you need the following requirements:

  • Java 8
  • Apache Spark 2.4.x

Installation

$ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.6 -y
$ conda activate sparknlp
$ pip install spark-nlp==2.4.5 pyspark==2.4.4

Quickstart

# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark Session with Spark NLP
spark = sparknlp.start()

# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en')

# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo.
It's held at the Louvre in Paris.
"""

# Annotate your testing dataset
result = pipeline.annotate(text)

# What's in the pipeline
list(result.keys())
Output: ['entities', 'stem', 'checked', 'lemma', 'document',
'pos', 'token', 'ner', 'embeddings', 'sentence']

# Check the results
result['entities']
Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']

Spark NLP Workshop