Automatic Speech Recognition (ASR) is the technology that enables computers to recognize and process human speech into text. ASR plays a vital role in numerous applications, from voice-activated assistants to transcription services, making it an essential part of modern natural language processing (NLP) solutions. Spark NLP provides powerful tools for implementing ASR systems effectively.
In this context, ASR involves converting spoken language into text by analyzing audio signals. Common use cases include:
- Voice Assistants: Enabling devices like smartphones and smart speakers to understand and respond to user commands.
- Transcription Services: Automatically converting audio recordings from meetings, interviews, or lectures into written text.
- Accessibility: Helping individuals with disabilities interact with technology through voice commands.
By leveraging ASR, organizations can enhance user experience, improve accessibility, and streamline workflows that involve audio data.
Picking a Model
When selecting a model for Automatic Speech Recognition, it’s essential to evaluate several factors to ensure optimal performance for your specific use case. Begin by analyzing the nature of your audio data, considering the accent, language, and quality of the recordings. Determine if your task requires real-time transcription or if batch processing is sufficient, as some models excel in specific scenarios.
Next, assess the model complexity; simpler models may suffice for straightforward tasks, while more sophisticated models are better suited for nuanced speech recognition. Consider the availability of diverse audio data for training, as larger datasets can significantly enhance model performance. Define key performance metrics (e.g., word error rate, accuracy) to guide your choice, and ensure the model’s interpretability meets your requirements. Finally, account for resource constraints, as advanced models typically demand more memory and processing power.
To explore and select from a variety of models, visit Spark NLP Models, where you can find models tailored for different ASR tasks and languages.
Recommended Models for Automatic Speech Recognition Tasks
- General Speech Recognition: Use models like
asr_wav2vec2_large_xlsr_53_english_by_jonatasgrosman
for general-purpose transcription. - Multilingual Support: For applications requiring support for multiple languages, consider using models like
asr_wav2vec2_large_xlsr_53_portuguese_by_jonatasgrosman
from theWav2Vec2ForCTC
transformer.
By thoughtfully considering these factors and using the right models, you can enhance your ASR applications significantly.
How to use
import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
# Step 1: Assemble the raw audio content into a suitable format
audioAssembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")
# Step 2: Load a pre-trained Wav2Vec2 model for automatic speech recognition (ASR)
speechToText = Wav2Vec2ForCTC \
.pretrained() \
.setInputCols(["audio_assembler"]) \
.setOutputCol("text")
# Step 3: Define the pipeline with audio assembler and speech-to-text model
pipeline = Pipeline().setStages([audioAssembler, speechToText])
# Step 4: Create a DataFrame containing the raw audio content (as floats)
processedAudioFloats = spark.createDataFrame([[rawFloats]]).toDF("audio_content")
# Step 5: Fit the pipeline and transform the audio data
result = pipeline.fit(processedAudioFloats).transform(processedAudioFloats)
# Step 6: Display the transcribed text from the audio
result.select("text.result").show(truncate = False)
+------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------+
|[MISTER QUILTER IS THE APOSTLE OF THE MIDLE CLASES AND WE ARE GLAD TO WELCOME HIS GOSPEL ]|
+------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotators._
import com.johnsnowlabs.nlp.annotators.audio.Wav2Vec2ForCTC
import org.apache.spark.ml.Pipeline
// Step 1: Assemble the raw audio content into a suitable format
val audioAssembler: AudioAssembler = new AudioAssembler()
.setInputCol("audio_content")
.setOutputCol("audio_assembler")
// Step 2: Load a pre-trained Wav2Vec2 model for automatic speech recognition (ASR)
val speechToText: Wav2Vec2ForCTC = Wav2Vec2ForCTC
.pretrained()
.setInputCols("audio_assembler")
.setOutputCol("text")
// Step 3: Define the pipeline with audio assembler and speech-to-text model
val pipeline: Pipeline = new Pipeline().setStages(Array(audioAssembler, speechToText))
// Step 4: Load raw audio floats from a CSV file
val bufferedSource =
scala.io.Source.fromFile("src/test/resources/audio/csv/audio_floats.csv")
// Step 5: Extract raw audio floats from CSV and convert to an array of floats
val rawFloats = bufferedSource
.getLines()
.map(_.split(",").head.trim.toFloat)
.toArray
bufferedSource.close
// Step 6: Create a DataFrame with raw audio content (as floats)
val processedAudioFloats = Seq(rawFloats).toDF("audio_content")
// Step 7: Fit the pipeline and transform the audio data
val result = pipeline.fit(processedAudioFloats).transform(processedAudioFloats)
// Step 8: Display the transcribed text from the audio
result.select("text.result").show(truncate = false)
+------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------+
|[MISTER QUILTER IS THE APOSTLE OF THE MIDLE CLASES AND WE ARE GLAD TO WELCOME HIS GOSPEL ]|
+------------------------------------------------------------------------------------------+
Try Real-Time Demos!
If you want to see the outputs of ASR models in real time, visit our interactive demos:
- Wav2Vec2ForCTC – Try this powerful model for real-time speech-to-text from raw audio.
- WhisperForCTC – Test speech recognition in multiple languages and noisy environments.
- HubertForCTC – Experience quick and accurate voice command recognition.
Useful Resources
Want to dive deeper into Automatic Speech Recognition with Spark NLP? Here are somText Preprocessinge curated resources to help you get started and explore further:
Articles and Guides
- Converting Speech to Text with Spark NLP and Python
- Simplify Your Speech Recognition Workflow with SparkNLP
- Vision Transformers and Automatic Speech Recognition in Spark NLP
Notebooks