LLM4S Speech Module
LLM4S Speech Module
A comprehensive speech recognition and text-to-speech synthesis module for the LLM4S project, built with Scala and functional programming principles.
๐ฏ Features
Speech Recognition (STT)
- Vosk: Lightweight, offline speech recognition engine (replaces Sphinx4)
- Whisper: High-accuracy transcription via CLI integration
- Audio Preprocessing: Resampling, channel conversion, silence trimming
- Multiple Input Formats: File, bytes, and stream audio support
Text-to-Speech (TTS)
- Tacotron2: Neural speech synthesis via CLI integration
- Voice Customization: Language, speaking rate, pitch, volume control
- Output Formats: WAV and raw PCM16 audio support
- Cross-platform: Works on Windows, Linux, and macOS
๐๏ธ Architecture
The module follows functional programming principles with:
- Result Types:
Either[LLMError, T]
for error handling - Pure Functions: Immutable audio transformations
- ADTs: Algebraic Data Types for type-safe modeling
- Composition: Functional composition for audio processing pipelines
๐ File Structure
src/main/scala/org/llm4s/speech/
โโโ Audio.scala # Core audio data structures
โโโ stt/ # Speech-to-Text implementations
โ โโโ SpeechToText.scala # STT trait interface
โ โโโ VoskSpeechToText.scala # Vosk integration (replaces Sphinx4)
โ โโโ WhisperSpeechToText.scala # Whisper CLI integration
โโโ tts/ # Text-to-Speech implementations
โ โโโ TextToSpeech.scala # TTS trait interface
โ โโโ Tacotron2TextToSpeech.scala # Tacotron2 CLI integration
โโโ processing/ # Audio preprocessing utilities
โ โโโ AudioPreprocessing.scala # Audio transformation functions
โโโ io/ # Audio I/O operations
โ โโโ AudioIO.scala # File saving utilities
โโโ util/ # Cross-platform utilities
โโโ PlatformCommands.scala # OS-agnostic command helpers
๐ Quick Start
Basic Usage
import org.llm4s.speech._
import org.llm4s.speech.stt.{WhisperSpeechToText, STTOptions}
import org.llm4s.speech.tts.{Tacotron2TextToSpeech, TTSOptions}
import org.llm4s.speech.util.PlatformCommands
// Speech Recognition
val stt = new WhisperSpeechToText(PlatformCommands.mockSuccess)
val audioInput = AudioInput.FileAudio(Paths.get("audio.wav"))
val options = STTOptions(language = Some("en"))
val result = stt.transcribe(audioInput, options)
// Text-to-Speech
val tts = new Tacotron2TextToSpeech(PlatformCommands.echo)
val ttsOptions = TTSOptions(
voice = Some("en-female"),
language = Some("en"),
speakingRate = Some(1.2)
)
val audio = tts.synthesize("Hello, world!", ttsOptions)
Audio Preprocessing
import org.llm4s.speech.processing.AudioPreprocessing
val audioBytes = // ... your audio data
val audioMeta = AudioMeta(sampleRate = 44100, numChannels = 2, bitDepth = 16)
// Convert to mono, resample to 16kHz for STT
val processed = AudioPreprocessing.standardizeForSTT(
audioBytes,
audioMeta,
targetRate = 16000
)
๐ง Configuration
Vosk Configuration
import org.llm4s.speech.stt.VoskSpeechToText
// Use default English model
val stt = new VoskSpeechToText()
// Use custom model path
val stt = new VoskSpeechToText(modelPath = Some("/path/to/vosk-model"))
Environment Variables
# Vosk Model Path (optional)
VOSK_MODEL_PATH=/path/to/vosk-model
๐งช Testing
The module includes comprehensive tests that work cross-platform:
# Run all tests
sbt test
# Run specific test suites
sbt "testOnly org.llm4s.speech.*"
# Cross-compile and test
sbt +test
Test Features
- Cross-platform Commands: Automatically detects OS and uses appropriate commands
- Mock Commands: Safe testing without external dependencies
- Audio Processing Tests: Validates audio transformation functions
- Integration Tests: End-to-end STT/TTS workflow testing
๐ Cross-Platform Support
The PlatformCommands
utility automatically provides the right commands:
Platform | Echo | File Reader | Directory Listing |
---|---|---|---|
Windows | cmd /c echo |
cmd /c type |
cmd /c dir |
POSIX | echo |
cat |
ls |
๐ฆ Dependencies
// Vosk for lightweight STT (replaces Sphinx4)
"net.java.dev.jna" % "jna" % "5.13.0"
"com.alphacephei" % "vosk" % "0.3.45"
// Core dependencies
"org.typelevel" %% "cats-core" % "2.9.0"
"com.lihaoyi" %% "upickle" % "3.1.0"
"ch.qos.logback" % "logback-classic" % "1.4.7"
๐ External Tools
Whisper
- Installation:
pip install openai-whisper
- Usage: The module integrates with Whisper CLI for transcription
- Models: Supports various model sizes (tiny, base, small, medium, large)
Tacotron2
- Installation: Requires Tacotron2 CLI tool
- Usage: The module integrates with Tacotron2 CLI for synthesis
- Features: Voice customization, language support, audio output
๐ Examples
See samples/src/main/scala/org/llm4s/samples/SpeechSamples.scala
for complete usage examples.
๐จ Error Handling
The module uses Result[T]
(alias for Either[LLMError, T]
) for robust error handling:
val result: Result[Transcription] = stt.transcribe(audioInput, options)
result match {
case Right(transcription) =>
println(s"Transcript: ${transcription.text}")
case Left(error) =>
println(s"Error: ${error.formatted}")
}
๐ฎ Future Enhancements
- Real-time audio streaming support
- Additional STT backends (Google Speech, Azure Speech)
- Additional TTS backends (Festival, eSpeak)
- Audio format conversion utilities
- Voice cloning capabilities
- Multi-language model support
๐ค Contributing
- Follow Scala functional programming principles
- Use
Result
types for error handling - Write comprehensive tests
- Ensure cross-platform compatibility
- Update documentation for new features
๐ License
Part of the LLM4S project - see main project license for details.