Text Preprocessing
- Tokenization
- Trainable Word Segmentation
- Stop Words Removal
- Token Normalizer
- Document Normalizer
- Document & Text Splitter
- Stemmer
- Lemmatizer
- NGrams
- Regex Matching
- Text Matching
- Spell Checker (ML and DL models)
Parsing and Analysis
- Chunking
- Date Matcher
- Sentence Detector
- Deep Sentence Detector (Deep learning)
- Dependency parsing (Labeled/unlabeled)
- SpanBertCorefModel (Coreference Resolution)
- Part-of-speech tagging
- Named entity recognition (Deep learning)
- Unsupervised keywords extraction
- Language Detection & Identification (up to 375 languages)
Sentiment and Classification
- Sentiment Detection (ML models)
- Multi-class & Multi-label Sentiment analysis (Deep learning)
- Multi-class Text Classification (Deep learning)
- Zero-Shot NER Model
- Zero-Shot Text Classification by Transformers (ZSL)
Embeddings
- Word Embeddings (GloVe and Word2Vec)
- Doc2Vec (based on Word2Vec)
- BERT Embeddings (TF Hub & HuggingFace models)
- DistilBERT Embeddings (HuggingFace models)
- CamemBERT Embeddings (HuggingFace models)
- RoBERTa Embeddings (HuggingFace models)
- DeBERTa Embeddings (HuggingFace v2 & v3 models)
- XLM-RoBERTa Embeddings (HuggingFace models)
- Longformer Embeddings (HuggingFace models)
- ALBERT Embeddings (TF Hub & HuggingFace models)
- XLNet Embeddings
- ELMO Embeddings (TF Hub models)
- Universal Sentence Encoder (TF Hub models)
- BERT Sentence Embeddings (TF Hub & HuggingFace models)
- RoBerta Sentence Embeddings (HuggingFace models)
- XLM-RoBerta Sentence Embeddings (HuggingFace models)
- INSTRUCTOR Embeddings (HuggingFace models)
- E5 Embeddings (HuggingFace models)
- MPNet Embeddings (HuggingFace models)
- UAE Embeddings (HuggingFace models)
- OpenAI Embeddings
- Sentence & Chunk Embeddings
Classification and Question Answering Models
- BERT for Token & Sequence Classification & Question Answering
- DistilBERT for Token & Sequence Classification & Question Answering
- CamemBERT for Token & Sequence Classification & Question Answering
- ALBERT for Token & Sequence Classification & Question Answering
- RoBERTa for Token & Sequence Classification & Question Answering
- DeBERTa for Token & Sequence Classification & Question Answering
- XLM-RoBERTa for Token & Sequence Classification & Question Answering
- Longformer for Token & Sequence Classification & Question Answering
- MPnet for Token & Sequence Classification & Question Answering
- XLNet for Token & Sequence Classification
Machine Translation and Generation
- Neural Machine Translation (MarianMT)
- Many-to-Many multilingual translation model (Facebook M2M100)
- Table Question Answering (TAPAS)
- Text-To-Text Transfer Transformer (Google T5)
- Generative Pre-trained Transformer 2 (OpenAI GPT2)
- Seq2Seq for NLG, Translation, and Comprehension (Facebook BART)
- Chat and Conversational LLMs (Facebook Llama-2)
Image and Speech
- Vision Transformer (Google ViT)
- Swin Image Classification (Microsoft Swin Transformer)
- ConvNext Image Classification (Facebook ConvNext)
- Vision Encoder Decoder for image-to-text like captioning
- Zero-Shot Image Classification by OpenAI’s CLIP
- Automatic Speech Recognition (Wav2Vec2)
- Automatic Speech Recognition (HuBERT)
- Automatic Speech Recognition (OpenAI Whisper)
Integration and Interoperability
- Easy ONNX, OpenVINO, and TensorFlow integrations
- Full integration with Spark ML functions
- GPU Support
Pre-trained Models
- +31000 pre-trained models in +200 languages!
- +6000 pre-trained pipelines in +200 languages!
Please check out our Models Hub for the full list of pre-trained models with examples, demo, benchmark, and more
Multi-lingual Support
- Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Urdu, and more.
PREVIOUSAdvanced Settings