Spark NLP - Features

 

Text Preprocessing

  • Tokenization
  • Trainable Word Segmentation
  • Stop Words Removal
  • Token Normalizer
  • Document Normalizer
  • Document & Text Splitter
  • Stemmer
  • Lemmatizer
  • NGrams
  • Regex Matching
  • Text Matching
  • Spell Checker (ML and DL models)

Parsing and Analysis

  • Chunking
  • Date Matcher
  • Sentence Detector
  • Deep Sentence Detector (Deep learning)
  • Dependency parsing (Labeled/unlabeled)
  • SpanBertCorefModel (Coreference Resolution)
  • Part-of-speech tagging
  • Named entity recognition (Deep learning)
  • Unsupervised keywords extraction
  • Language Detection & Identification (up to 375 languages)

Sentiment and Classification

  • Sentiment Detection (ML models)
  • Multi-class & Multi-label Sentiment analysis (Deep learning)
  • Multi-class Text Classification (Deep learning)
  • Zero-Shot NER Model
  • Zero-Shot Text Classification by Transformers (ZSL)

Embeddings

  • Word Embeddings (GloVe and Word2Vec)
  • Doc2Vec (based on Word2Vec)
  • BERT Embeddings (TF Hub & HuggingFace models)
  • DistilBERT Embeddings (HuggingFace models)
  • CamemBERT Embeddings (HuggingFace models)
  • RoBERTa Embeddings (HuggingFace models)
  • DeBERTa Embeddings (HuggingFace v2 & v3 models)
  • XLM-RoBERTa Embeddings (HuggingFace models)
  • Longformer Embeddings (HuggingFace models)
  • ALBERT Embeddings (TF Hub & HuggingFace models)
  • XLNet Embeddings
  • ELMO Embeddings (TF Hub models)
  • Universal Sentence Encoder (TF Hub models)
  • BERT Sentence Embeddings (TF Hub & HuggingFace models)
  • RoBerta Sentence Embeddings (HuggingFace models)
  • XLM-RoBerta Sentence Embeddings (HuggingFace models)
  • INSTRUCTOR Embeddings (HuggingFace models)
  • E5 Embeddings (HuggingFace models)
  • MPNet Embeddings (HuggingFace models)
  • UAE Embeddings (HuggingFace models)
  • OpenAI Embeddings
  • Sentence & Chunk Embeddings

Classification and Question Answering Models

  • BERT for Token & Sequence Classification & Question Answering
  • DistilBERT for Token & Sequence Classification & Question Answering
  • CamemBERT for Token & Sequence Classification & Question Answering
  • ALBERT for Token & Sequence Classification & Question Answering
  • RoBERTa for Token & Sequence Classification & Question Answering
  • DeBERTa for Token & Sequence Classification & Question Answering
  • XLM-RoBERTa for Token & Sequence Classification & Question Answering
  • Longformer for Token & Sequence Classification & Question Answering
  • MPnet for Token & Sequence Classification & Question Answering
  • XLNet for Token & Sequence Classification

Machine Translation and Generation

  • Neural Machine Translation (MarianMT)
  • Many-to-Many multilingual translation model (Facebook M2M100)
  • Table Question Answering (TAPAS)
  • Text-To-Text Transfer Transformer (Google T5)
  • Generative Pre-trained Transformer 2 (OpenAI GPT2)
  • Seq2Seq for NLG, Translation, and Comprehension (Facebook BART)
  • Chat and Conversational LLMs (Facebook Llama-2)

Image and Speech

  • Vision Transformer (Google ViT)
  • Swin Image Classification (Microsoft Swin Transformer)
  • ConvNext Image Classification (Facebook ConvNext)
  • Vision Encoder Decoder for image-to-text like captioning
  • Zero-Shot Image Classification by OpenAI’s CLIP
  • Automatic Speech Recognition (Wav2Vec2)
  • Automatic Speech Recognition (HuBERT)
  • Automatic Speech Recognition (OpenAI Whisper)

Integration and Interoperability

Pre-trained Models

  • +31000 pre-trained models in +200 languages!
  • +6000 pre-trained pipelines in +200 languages!

Please check out our Models Hub for the full list of pre-trained models with examples, demo, benchmark, and more

Multi-lingual Support

  • Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more.
Last updated