Description
CamemBERT is a state-of-the-art language model for French based on the RoBERTa model. For further information or requests, please go to Camembert Website
Predicted Entities
How to use
embeddings = CamemBertEmbeddings.pretrained("camembert_base", "fr") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
val embeddings = CamemBertEmbeddings.pretrained("camembert_base", "fr")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
import nlu
nlu.load("fr.embed.camembert_base").predict("""Put your text here.""")
Model Information
| Model Name: | camembert_base | 
| Compatibility: | Spark NLP 5.5.0+ | 
| License: | Open Source | 
| Edition: | Official | 
| Input Labels: | [document, token] | 
| Output Labels: | [embeddings] | 
| Language: | fr | 
| Size: | 263.6 MB | 
| Case sensitive: | true | 
| Max sentence length: | 512 | 
References
https://huggingface.co/almanach/camembert-base
Benchmarking
| Model                          | #params                        | Arch. | Training data                     |
|--------------------------------|--------------------------------|-------|-----------------------------------|
| `camembert-base` | 110M   | Base  | OSCAR (138 GB of text)            |
| `camembert/camembert-large`              | 335M    | Large | CCNet (135 GB of text)            |
| `camembert/camembert-base-ccnet`         | 110M    | Base  | CCNet (135 GB of text)            |
| `camembert/camembert-base-wikipedia-4gb` | 110M    | Base  | Wikipedia (4 GB of text)          |
| `camembert/camembert-base-oscar-4gb`     | 110M    | Base  | Subsample of OSCAR (4 GB of text) |
| `camembert/camembert-base-ccnet-4gb`     | 110M    | Base  | Subsample of CCNet (4 GB of text) |