November 5, 2025

Introducing Continue-TTS

High-quality text-to-speech model based on Continue-1-OSS, featuring 8 unique voices, emotional expression, and real-time generation—Open Source.

Today, SVECTOR announces Continue-TTS, a fine-tuned text-to-speech model based on the Continue-1-OSS architecture. This model is specifically trained for high-quality speech synthesis and delivers exceptional voice generation capabilities.

Continue-TTS provides natural speech with human-like intonation, emotion, and rhythm. With 8 unique voices, real-time generation capabilities (~200ms latency), and built-in emotional expression support, this model rivals commercial solutions while remaining fully open source.

Released under Apache 2.0 license, Continue-TTS is freely available for both research and commercial applications, bringing professional-grade speech synthesis to everyone.

Model Specifications

Continue-TTS combines the Continue-1-OSS language model with advanced neural audio codecs to generate exceptionally natural speech from text. The model features 8 professionally designed voices with distinct personalities and characteristics.

Model Architecture

Base Model: Continue-1-OSS
Type: Text-to-Speech (TTS)
Parameters: 3 Billion
Audio Codec: SNAC (24kHz)
License: Apache 2.0

Performance

Latency: ~200ms (GPU streaming)
Sample Rate: 24kHz
Memory: ~7GB GPU (FP16)
Voices: 8 unique voices
Vocabulary: 156,940 tokens

Available Voices

Continue-TTS includes 8 professionally designed voices with unique personalities:

Nova (Female)

Conversational and natural, perfect for general use

Aurora (Female)

Warm and friendly, excellent for storytelling

Stellar (Female)

Energetic and bright, great for upbeat content

Atlas (Male)

Deep and authoritative, ideal for narration

Orion (Male)

Friendly and casual, perfect for conversational content

Luna (Female)

Soft and gentle, excellent for calm narration

Phoenix (Male)

Dynamic and expressive, great for engaging content

Ember (Female)

Warm and engaging, perfect for emotional expression

Key Features

Natural Speech Synthesis

Continue-TTS generates human-like speech with natural intonation, emotion, and rhythm. The model produces high-quality audio that rivals commercial text-to-speech solutions while remaining fully open source and accessible.

Each voice is carefully trained to maintain consistent personality and characteristics across different types of content, from conversational dialogue to professional narration.

Emotional Expression

Built-in support for natural emotions adds depth and authenticity to generated speech. Continue-TTS understands emotion tags and seamlessly integrates them into speech output:

<laugh> - Natural laughter
<chuckle> - Light laugh
<sigh> - Expressive sigh
<gasp> - Surprised gasp
<cough>, <yawn>, <groan>, <sniffle> - Additional natural sounds

Real-Time Generation

With streaming support and low latency (~200ms on GPU), Continue-TTS enables interactive applications like voice assistants, real-time narration, and conversational AI. The model generates audio chunks progressively, allowing immediate playback without waiting for complete generation.

Easy Integration

Simple Python package for quick integration into applications:

pip install continue-speech

from continue_tts import Continue1Model

model = Continue1Model(
  model_name="SVECTOR-CORPORATION/Continue-TTS"
)

audio = model.generate_speech(
  prompt="Hello from Continue-TTS!",
  voice="nova"
)

Use Cases

Continue-TTS excels across diverse applications requiring natural speech synthesis:

Audiobook Narration

Natural storytelling with emotional expression, perfect for creating engaging audiobooks with professional narration quality and character voices.

Virtual Assistants

Conversational AI with personality for voice-enabled applications, customer service bots, and interactive digital assistants.

Accessibility

Text-to-speech for visually impaired users, screen readers, and accessibility tools requiring natural, easy-to-understand speech.

Content Creation

Voiceovers for videos, podcasts, presentations, and multimedia content with multiple voice options for diverse characters.

Gaming

Dynamic character voices and dialogue generation for interactive gaming experiences with emotional and contextual speech.

Education

Interactive learning materials with voice, language learning applications, and automated tutoring systems with clear pronunciation.

Customer Service

Natural-sounding automated responses for phone systems, chatbots, and support applications requiring professional voice interaction.

Language Learning

Clear pronunciation models for language education, vocabulary training, and conversational practice with natural intonation.

Model Architecture

Continue-TTS combines the Continue-1-OSS language model with the SNAC multi-scale neural audio codec. The model generates audio tokens autoregressively, which are then decoded into waveforms using the neural codec for high-quality 24kHz audio output.

Training Process

Continue-TTS was fine-tuned from Continue-1-OSS using:

High-Quality Datasets: Diverse speech datasets covering multiple accents, speaking styles, and emotional expressions
Multi-Speaker Training: Recordings from multiple speakers to create distinct voice personalities
Emotional Data: Specialized training on emotional speech for authentic expression
Multi-Stage Process: Base model pretraining followed by voice-specific fine-tuning

Technical Details

The model uses 7 audio tokens per frame in a hierarchical encoding structure, with a total vocabulary of 156,940 tokens (including 28,672 audio-specific tokens). This architecture enables efficient, high-quality speech generation with natural prosody and emotional expressiveness.

Ethical Considerations & Responsible Use

SVECTOR is committed to responsible AI development. Continue-TTS users should follow ethical guidelines:

Transparency & Disclosure

Always disclose when audio is AI-generated. Users should clearly indicate that content is synthesized and not recorded from real human voices to maintain transparency and trust.

Consent & Voice Cloning

Do not clone voices without explicit permission from the individual. Voice impersonation without consent is unethical and may be illegal in many jurisdictions.

Misinformation Prevention

Implement safeguards against deepfakes, misinformation, and deceptive content. Verify important audio content and use authentication mechanisms where appropriate.

Responsible Content

Avoid generating harmful, deceptive, or illegal content. Users are responsible for ensuring their applications comply with applicable laws and regulations.

Limitations

As with any text-to-speech model, Continue-TTS has certain limitations:

Pronunciation: May struggle with unusual names, technical terms, non-English words, or specialized vocabulary without phonetic guidance.
Consistency: Long-form generation may have minor quality variations. Consider breaking very long texts into segments.
Accents: Primarily trained on specific accent patterns. Performance may vary for regional or non-native accents.
Compute Requirements: Requires GPU for real-time generation. CPU inference is significantly slower and not suitable for interactive applications.
Language Support: Currently optimized for English. Performance in other languages is limited and not officially supported.
Context Understanding: While emotion tags work well, complex contextual emotional cues may not always be interpreted accurately.

Explore Additional SVECTOR Resources and Research

Explore More