skillZsskillZsskillZs
HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY HAND-TAGGED >>> 991 SKILLS LIVE <<<* OPEN SOURCE *NO LOGIN, NO TRACKING FRESH DROPS WEEKLY
← back to zine
voice-agentsSKILL #ENTS
Coding

voice-agents

Voice agents represent the frontier of AI interaction - humans speaking naturally with AI systems. The challenge isn't just speech recognition and synthesis, it's achieving natural conversation flow with sub-800ms latency while handling interruptions, background noise, and emotional nuance. This skill covers two architectures: speech-to-speech (OpenAI Realtime API, lowest latency, most natural) and pipeline (STT→LLM→TTS, more control, easier to debug). Key insight: latency is the constraint. Hu

↗ github · ★ 27k·src: davila7/claude-code-templates

the manual

Voice Agents

You are a voice AI architect who has shipped production voice agents handling millions of calls. You understand the physics of latency - every component adds milliseconds, and the sum determines whether conversations feel natural or awkward.

Your core insight: Two architectures exist. Speech-to-speech (S2S) models like OpenAI Realtime API preserve emotion and achieve lowest latency but are less controllable. Pipeline architectures (STT→LLM→TTS) give you control at each step but add latency. Mos

Capabilities

  • voice-agents
  • speech-to-speech
  • speech-to-text
  • text-to-speech
  • conversational-ai
  • voice-activity-detection
  • turn-taking
  • barge-in-detection
  • voice-interfaces

Patterns

Speech-to-Speech Architecture

Direct audio-to-audio processing for lowest latency

Pipeline Architecture

Separate STT → LLM → TTS for maximum control

Voice Activity Detection Pattern

Detect when user starts/stops speaking

Anti-Patterns

❌ Ignoring Latency Budget

❌ Silence-Only Turn Detection

❌ Long Responses

⚠️ Sharp Edges

IssueSeveritySolution
Issuecritical# Measure and budget latency for each component:
Issuehigh# Target jitter metrics:
Issuehigh# Use semantic VAD:
Issuehigh# Implement barge-in detection:
Issuemedium# Constrain response length in prompts:
Issuemedium# Prompt for spoken format:
Issuemedium# Implement noise handling:
Issuemedium# Mitigate STT errors:

Related Skills

Works well with: agent-tool-builder, multi-agent-orchestration, llm-architect, backend

more coding

Request code reviews to catch issues early
Coding
NEWHOT
Request code reviews to catch issues early
requesting-code-review
1@ 1 181k
Debug systematically to save time
Coding
NEWHOT
Debug systematically to save time
systematic-debugging
0@ 0 181k
Verify feedback before you implement changes
Coding
NEWHOT
Verify feedback before you implement changes
receiving-code-review
0@ 0 181k
Write tests first, code with confidence
Coding
NEWHOT
Write tests first, code with confidence
test-driven-development
0@ 0 181k
Execute plans flawlessly and efficiently
Coding
NEWHOT
Execute plans flawlessly and efficiently
executing-plans
0@ 0 181k
Create and optimize skills effortlessly
Coding
NEWHOT
Create and optimize skills effortlessly
skill-creator
0@ 0 129k
Transform messy data into clean spreadsheets
Coding
NEWHOT
Transform messy data into clean spreadsheets
xlsx
0@ 0 129k
Build powerful MCP servers fast
Coding
NEWHOT
Build powerful MCP servers fast
mcp-builder
0@ 0 129k