Description
CamemBERT is a state-of-the-art language model for French based on the RoBERTa model. For further information or requests, please go to Camembert Website
Predicted Entities
How to use
embeddings = CamemBertEmbeddings.pretrained("camembert_base_ccnet", "fr") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
val embeddings = CamemBertEmbeddings.pretrained("camembert_base_ccnet", "fr")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
import nlu
nlu.load("fr.embed.camembert_base_ccnet").predict("""Put your text here.""")
Model Information
Model Name: | camembert_base_ccnet |
Compatibility: | Spark NLP 3.4.4+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [token, sentence] |
Output Labels: | [embeddings] |
Language: | fr |
Size: | 266.1 MB |
Case sensitive: | true |
References
https://huggingface.co/camembert/camembert-base-ccnet
Benchmarking
| Model | #params | Arch. | Training data |
|--------------------------------|--------------------------------|-------|-----------------------------------|
| `camembert-base` | 110M | Base | OSCAR (138 GB of text) |
| `camembert/camembert-large` | 335M | Large | CCNet (135 GB of text) |
| `camembert/camembert-base-ccnet` | 110M | Base | CCNet (135 GB of text) |
| `camembert/camembert-base-wikipedia-4gb` | 110M | Base | Wikipedia (4 GB of text) |
| `camembert/camembert-base-oscar-4gb` | 110M | Base | Subsample of OSCAR (4 GB of text) |
| `camembert/camembert-base-ccnet-4gb` | 110M | Base | Subsample of CCNet (4 GB of text) |
PREVIOUSCamemBERT Subsample of CCNet