Description
WordSegmenterModel-WSM is based on maximum entropy probability model to detect word boundaries in Chinese text. Chinese text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.
Open in Colab Download Copy S3 URI
How to use
word_segmenter = WordSegmenterModel.pretrained("wordseg_pku", "zh") .setInputCols(["sentence"]) .setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame([['从John Snow Labs你好! ']], ["text"])
result = ws_model.transform(example)
val word_segmenter = WordSegmenterModel.pretrained("wordseg_pku", "zh")
.setInputCols(Array("sentence"))
.setOutputCol("token")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val data = Seq("从John Snow Labs你好! ").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
text = [""从John Snow Labs你好! ""]
token_df = nlu.load('zh.segment_words.pku').predict(text)
token_df
Results
0 从
1 Jo
2 hn
3 Sn
4 ow
5 La
6 bs
7 你
8 好
9 !
Name: token, dtype: object
Model Information
Model Name: | wordseg_pku |
Compatibility: | Spark NLP 3.0.0+ |
License: | Open Source |
Edition: | Official |
Input Labels: | [document] |
Output Labels: | [words_segmented] |
Language: | zh |