LLM4S Speech Module
A comprehensive speech recognition and text-to-speech synthesis module for the LLM4S project, built with Scala and functional programming principles.
Features
Speech Recognition (STT)
Vosk : Lightweight, offline speech recognition engine
Whisper : High-accuracy transcription via CLI integration
Audio Preprocessing : Resampling, channel conversion, silence trimming
Multiple Input Formats : File, bytes, and stream audio support
Text-to-Speech (TTS)
Tacotron2 : Neural speech synthesis via CLI integration
Voice Customization : Language, speaking rate, pitch, volume control
Output Formats : WAV and raw PCM16 audio support
Cross-platform : Works on Windows, Linux, and macOS
Architecture
The module follows functional programming principles with:
Result Types : Either[LLMError, T] for error handling
Pure Functions : Immutable audio transformations
ADTs : Algebraic Data Types for type-safe modeling
Composition : Functional composition for audio processing pipelines
Quick Start
Basic Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import org.llm4s.speech._
import org.llm4s.speech.stt. { WhisperSpeechToText , STTOptions }
import org.llm4s.speech.tts. { Tacotron2TextToSpeech , TTSOptions }
import org.llm4s.speech.util.PlatformCommands
// Speech Recognition
val stt = new WhisperSpeechToText ( PlatformCommands . mockSuccess )
val audioInput = AudioInput . FileAudio ( Paths . get ( "audio.wav" ))
val options = STTOptions ( language = Some ( "en" ))
val result = stt . transcribe ( audioInput , options )
// Text-to-Speech
val tts = new Tacotron2TextToSpeech ( PlatformCommands . echo )
val ttsOptions = TTSOptions (
voice = Some ( "en-female" ),
language = Some ( "en" ),
speakingRate = Some ( 1.2 )
)
val audio = tts . synthesize ( "Hello, world!" , ttsOptions )
Audio Preprocessing
1
2
3
4
5
6
7
8
9
10
11
import org.llm4s.speech.processing.AudioPreprocessing
val audioBytes = // ... your audio data
val audioMeta = AudioMeta ( sampleRate = 44100 , numChannels = 2 , bitDepth = 16 )
// Convert to mono, resample to 16kHz for STT
val processed = AudioPreprocessing . standardizeForSTT (
audioBytes ,
audioMeta ,
targetRate = 16000
)
Configuration
Vosk Configuration
1
2
3
4
5
6
7
import org.llm4s.speech.stt.VoskSpeechToText
// Use default English model
val stt = new VoskSpeechToText ()
// Use custom model path
val stt = new VoskSpeechToText ( modelPath = Some ( "/path/to/vosk-model" ))
Environment Variables
1
2
# Vosk Model Path (optional)
VOSK_MODEL_PATH = /path/to/vosk-model
File Structure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
src/main/scala/org/llm4s/speech/
├── Audio.scala # Core audio data structures
├── stt/ # Speech-to-Text implementations
│ ├── SpeechToText.scala # STT trait interface
│ ├── VoskSpeechToText.scala # Vosk integration
│ └── WhisperSpeechToText.scala # Whisper CLI integration
├── tts/ # Text-to-Speech implementations
│ ├── TextToSpeech.scala # TTS trait interface
│ └── Tacotron2TextToSpeech.scala # Tacotron2 CLI integration
├── processing/ # Audio preprocessing utilities
│ └── AudioPreprocessing.scala # Audio transformation functions
├── io/ # Audio I/O operations
│ └── AudioIO.scala # File saving utilities
└── util/ # Cross-platform utilities
└── PlatformCommands.scala # OS-agnostic command helpers
The PlatformCommands utility automatically provides the right commands:
Platform
Echo
File Reader
Directory Listing
Windows
cmd /c echo
cmd /c type
cmd /c dir
POSIX
echo
cat
ls
Whisper
Installation : pip install openai-whisper
Usage : The module integrates with Whisper CLI for transcription
Models : Supports various model sizes (tiny, base, small, medium, large)
Tacotron2
Installation : Requires Tacotron2 CLI tool
Usage : The module integrates with Tacotron2 CLI for synthesis
Features : Voice customization, language support, audio output
Error Handling
The module uses Result[T] (alias for Either[LLMError, T]) for robust error handling:
1
2
3
4
5
6
7
8
val result : Result [ Transcription ] = stt . transcribe ( audioInput , options )
result match {
case Right ( transcription ) =>
println ( s "Transcript: ${transcription.text}" )
case Left ( error ) =>
println ( s "Error: ${error.formatted}" )
}
Testing
The module includes comprehensive tests that work cross-platform:
1
2
3
4
5
6
7
8
# Run all tests
sbt test
# Run specific test suites
sbt "testOnly org.llm4s.speech.*"
# Cross-compile and test
sbt +test
See Also