用 ElevenLabs 声音克隆搭建 AI 电话接待员
大多数 AI 电话接待员听起来很机器腔,因为用了通用 TTS 声音。ElevenLabs 即时声音克隆可在 30 秒内克隆真实人声,彻底改变这一现状。
本教程介绍如何将克隆声音与 Twilio 入站电话和 VAPI 结合,搭建听起来像真人的 AI 接待员。完整技术架构:ElevenLabs 声音克隆 → VAPI 对话引擎 → Twilio 电话路由。
作者在文章中提供了完整的实现代码和步骤说明,读者可以按照教程一步步复现。文章结合实际项目经验,深入浅出地讲解了技术原理和实践中的常见陷阱。评论区也有不少有价值的补充讨论,建议对该技术感兴趣的开发者深入阅读原文。
How to Set Up ElevenLabs Voice Cloning for AI Phone Receptionists - DEV Community
• Originally published at callstack.tech
How to Set Up ElevenLabs Voice Cloning for AI Phone Receptionists
How to Set Up ElevenLabs Voice Cloning for AI Phone Receptionists
Most AI receptionists sound robotic because they use generic TTS voices. ElevenLabs instant voice cloning fixes this—clone a real voice in 30 seconds, then route Twilio inbound calls through VAPI with that cloned voice as your assistant. Result: callers hear a consistent, professional receptionist instead of a synthesized bot. Setup: ElevenLabs API key + voice ID + VAPI assistant config + Twilio webhook. Production-ready in under 10 minutes.
You need active accounts with three services: ElevenLabs (voice cloning), Twilio (phone infrastructure), and VAPI (orchestration). Generate API keys from each dashboard—store them in .env files, never hardcode them. ElevenLabs requires a paid tier (Starter or higher) to access voice cloning; free tier blocks instant voice cloning features.
Node.js 16+ with npm or yarn. A machine with at least 512MB free RAM for session management. HTTPS endpoint (ngrok or production domain) for webhook callbacks—Twilio and VAPI reject HTTP.
For professional voice stability, provide 1-2 minute reference audio samples in WAV or MP3 format (16kHz mono, noise-free). Background noise degrades cloning quality significantly.
Credentials to Gather
ElevenLabs API key and Voice ID (generated after cloning)
Twilio Account SID, Auth Token, and phone number
VAPI API key and assistant configuration access
VAPI: Get Started with VAPI → Get VAPI
Step-by-Step Tutorial
Configuration & Setup
Voice cloning breaks when you skip the recording quality check. ElevenLabs requires noise-free audio samples (minimum 1 minute, ideally 5-10 minutes) recorded at 44.1kHz or higher. Background hum, keyboard clicks, or mouth sounds will degrade voice stability below 70% - making your AI receptionist sound robotic.
Critical environment variables:
// .env - Production secrets
VAPI_API_KEY=your_vapi_private_key
ELEVENLABS_API_KEY=your_elevenlabs_api_key
TWILIO_ACCOUNT_SID=your_twilio_sid
TWILIO_AUTH_TOKEN=your_twilio_token
TWILIO_PHONE_NUMBER=+1234567890
WEBHOOK_SECRET=generate_random_32_char_string
Enter fullscreen mode
Install dependencies for webhook handling and voice synthesis:
npm install express body-parser dotenv node-fetch
Enter fullscreen mode
A[Caller] -->|Dials Number| B[Twilio]
B -->|Webhook POST| C[Your Server]
C -->|Create Assistant| D[VAPI]
D -->|Voice Config| E[ElevenLabs API]
E -->|Cloned Voice Audio| D
Enter fullscreen mode
The flow separates responsibilities: Twilio handles telephony, VAPI manages conversation state, ElevenLabs synthesizes cloned voice. Your server bridges them via webhooks. Do NOT configure VAPI to call ElevenLabs directly AND build server-side synthesis - this creates double audio where the bot talks over itself.
Step-by-Step Implementation
Step 1: Clone the target voice in ElevenLabs
Record clean audio samples (no background noise, consistent tone). Upload to ElevenLabs dashboard → Voice Lab → Add Instant Voice Clone. Note the voice_id - you'll need this for VAPI configuration.
Step 2: Configure VAPI assistant with cloned voice
// assistantConfig.js - VAPI assistant with ElevenLabs voice
const assistantConfig = {
systemPrompt: "You are a professional receptionist for Acme Corp. Greet callers warmly, ask how you can help, and route calls appropriately."
voiceId: "your_cloned_voice_id_here", // From ElevenLabs Voice Lab
stability: 0.75, // Higher = more consistent, lower = more expressive
similarityBoost: 0.85, // Higher = closer to original voice
model: "eleven_turbo_v2" // Lowest latency for phone calls
provider: "deepgram",
model: "nova-2-phonecall",
firstMessage: "Thank you for calling Acme Corp. How may I assist you today?"
module.exports = assistantConfig;
Enter fullscreen mode
Step 3: Set up webhook server for Twilio integration
// server.js - Express webhook handler
const express = require('express');
const bodyParser = require('body-parser');
const fetch = require('node-fetch');
require('dotenv').config();
const app = express();
app.use(bodyParser.json());
app.use(bodyParser.urlencoded({ extended: true }));