Basic General Purpose Pipeline for Catalan

Description

Model for Catalan language processing based on models by Barcelona SuperComputing Center and the AINA project (Generalitat de Catalunya), following POS and tokenization guidelines from ANCORA Universal Dependencies corpus.

Open in Colab Download Copy S3 URI

How to use

pipeline = PretrainedPipeline("pipeline_md", "ca", "@cayorodriguez")

result = pipeline.annotate("El català ja és a SparkNLP.")

Results

{'chunk': ['El català ja', 'SparkNLP', 'és'],
 'entities': ['SparkNLP'],
 'lemma': ['el', 'català', 'ja', 'ser', 'a', 'sparknlp', '.'],
 'document': ['El català ja es a SparkNLP.'],
 'pos': ['DET', 'NOUN', 'ADV', 'AUX', 'ADP', 'PROPN', 'PUNCT'],
 'sentence_embeddings': ['El català ja és a SparkNLP.'],
 'cleanTokens': ['català', 'SparkNLP', '.'],
 'token': ['El', 'català', 'ja', 'és', 'a', 'SparkNLP', '.'],
 'ner': ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'O'],
 'embeddings': ['El', 'català', 'ja', 'és', 'a', 'SparkNLP', '.'],
 'form': ['el', 'català', 'ja', 'és', 'a', 'sparknlp', '.'],
 'sentence': ['El català ja és a SparkNLP.']}

Model Information

Model Name: pipeline_md
Type: pipeline
Compatibility: Spark NLP 3.4.4+
License: Open Source
Edition: Community
Language: ca
Size: 756.1 MB

Included Models

  • DocumentAssembler
  • SentenceDetectorDLModel
  • TokenizerModel
  • NormalizerModel
  • StopWordsCleaner
  • RoBertaEmbeddings
  • SentenceEmbeddings
  • EmbeddingsFinisher
  • LemmatizerModel
  • PerceptronModel
  • RoBertaForTokenClassification
  • NerConverter
  • Chunker