Text-to-Speech integration with Fish Audio's API, supporting multiple voices, streaming, and real-time playback
An MCP (Model Context Protocol) server that provides seamless integration between Fish Audio's Text-to-Speech API and LLMs like Claude, enabling natural language-driven speech synthesis.
Fish Audio is a cutting-edge Text-to-Speech platform that offers:
This MCP server brings Fish Audio's powerful capabilities directly to your LLM workflows.
You can run this MCP server directly using npx:
npx @alanse/fish-audio-mcp-server
Or install it globally:
npm install -g @alanse/fish-audio-mcp-server
Get your Fish Audio API key from Fish Audio
Set up environment variables:
export FISH_API_KEY=your_fish_audio_api_key_here
{
"mcpServers": {
"fish-audio": {
"command": "npx",
"args": ["-y", "@alanse/fish-audio-mcp-server"],
"env": {
"FISH_API_KEY": "your_fish_audio_api_key_here",
"FISH_MODEL_ID": "speech-1.6",
"FISH_REFERENCE_ID": "your_voice_reference_id_here",
"FISH_OUTPUT_FORMAT": "mp3",
"FISH_STREAMING": "false",
"FISH_LATENCY": "balanced",
"FISH_MP3_BITRATE": "128",
"FISH_AUTO_PLAY": "false",
"AUDIO_OUTPUT_DIR": "~/.fish-audio-mcp/audio_output"
}
}
}
}
{
"mcpServers": {
"fish-audio": {
"command": "npx",
"args": ["-y", "@alanse/fish-audio-mcp-server"],
"env": {
"FISH_API_KEY": "your_fish_audio_api_key_here",
"FISH_MODEL_ID": "speech-1.6",
"FISH_REFERENCES": "[{'reference_id':'id1','name':'Alice','tags':['female','english']},{'reference_id':'id2','name':'Bob','tags':['male','japanese']},{'reference_id':'id3','name':'Carol','tags':['female','japanese','anime']}]",
"FISH_DEFAULT_REFERENCE": "id1",
"FISH_OUTPUT_FORMAT": "mp3",
"FISH_STREAMING": "false",
"FISH_LATENCY": "balanced",
"FISH_MP3_BITRATE": "128",
"FISH_AUTO_PLAY": "false",
"AUDIO_OUTPUT_DIR": "~/.fish-audio-mcp/audio_output"
}
}
}
}
Variable | Description | Default | Required |
---|---|---|---|
FISH_API_KEY |
Your Fish Audio API key | - | Yes |
FISH_MODEL_ID |
TTS model to use (s1, speech-1.5, speech-1.6) | s1 |
Optional |
FISH_REFERENCE_ID |
Default voice reference ID (single reference mode) | - | Optional |
FISH_REFERENCES |
Multiple voice references (see below) | - | Optional |
FISH_DEFAULT_REFERENCE |
Default reference ID when using multiple references | - | Optional |
FISH_OUTPUT_FORMAT |
Default audio format (mp3, wav, pcm, opus) | mp3 |
Optional |
FISH_STREAMING |
Enable streaming mode (HTTP/WebSocket) | false |
Optional |
FISH_LATENCY |
Latency mode (normal, balanced) | balanced |
Optional |
FISH_MP3_BITRATE |
MP3 bitrate (64, 128, 192) | 128 |
Optional |
FISH_AUTO_PLAY |
Auto-play audio and enable real-time playback | false |
Optional |
AUDIO_OUTPUT_DIR |
Directory for audio file output | ~/.fish-audio-mcp/audio_output |
Optional |
You can configure multiple voice references in two ways:
Use the FISH_REFERENCES
environment variable with a JSON array:
FISH_REFERENCES='[
{"reference_id":"id1","name":"Alice","tags":["female","english"]},
{"reference_id":"id2","name":"Bob","tags":["male","japanese"]},
{"reference_id":"id3","name":"Carol","tags":["female","japanese","anime"]}
]'
FISH_DEFAULT_REFERENCE="id1"
Use numbered environment variables:
FISH_REFERENCE_1_ID=id1
FISH_REFERENCE_1_NAME=Alice
FISH_REFERENCE_1_TAGS=female,english
FISH_REFERENCE_2_ID=id2
FISH_REFERENCE_2_NAME=Bob
FISH_REFERENCE_2_TAGS=male,japanese
Once configured, the Fish Audio MCP server provides two tools to LLMs.
fish_audio_tts
Generates speech from text using Fish Audio's TTS API.
text
(required): Text to convert to speech (max 10,000 characters)reference_id
(optional): Voice model reference IDreference_name
(optional): Select voice by namereference_tag
(optional): Select voice by tagstreaming
(optional): Enable streaming modeformat
(optional): Output format (mp3, wav, pcm, opus)mp3_bitrate
(optional): MP3 bitrate (64, 128, 192)normalize
(optional): Enable text normalization (default: true)latency
(optional): Latency mode (normal, balanced)output_path
(optional): Custom output file pathauto_play
(optional): Automatically play the generated audiowebsocket_streaming
(optional): Use WebSocket streaming instead of HTTPrealtime_play
(optional): Play audio in real-time during WebSocket streamingVoice Selection Priority: reference_id > reference_name > reference_tag > default
fish_audio_list_references
Lists all configured voice references.
No parameters required.
User: "Generate speech saying 'Hello, world! Welcome to Fish Audio TTS.'"
Claude: I'll generate speech for that text using Fish Audio TTS.
[Uses fish_audio_tts tool with text parameter]
Result: Audio file saved to ./audio_output/tts_2025-01-03T10-30-00.mp3
User: "Generate speech with voice model xyz123 saying 'This is a custom voice test'"
Claude: I'll generate speech using the specified voice model.
[Uses fish_audio_tts tool with text and reference_id parameters]
Result: Audio generated with custom voice model xyz123
User: "Use Alice's voice to say 'Hello from Alice'"
Claude: I'll generate speech using Alice's voice.
[Uses fish_audio_tts tool with reference_name: "Alice"]
Result: Audio generated with Alice's voice
User: "Generate Japanese speech saying 'ใใใซใกใฏ' with an anime voice"
Claude: I'll generate Japanese speech with an anime-style voice.
[Uses fish_audio_tts tool with reference_tag: "anime"]
Result: Audio generated with anime voice style
User: "What voices are available?"
Claude: I'll list all configured voice references.
[Uses fish_audio_list_references tool]
Result:
- Alice (id: id1) - Tags: female, english [Default]
- Bob (id: id2) - Tags: male, japanese
- Carol (id: id3) - Tags: female, japanese, anime
User: "Generate a long speech in streaming mode about the benefits of AI"
Claude: I'll generate the speech in streaming mode for faster response.
[Uses fish_audio_tts tool with streaming: true]
Result: Streaming audio saved to ./audio_output/tts_2025-01-03T10-35-00.mp3
User: "Stream and play in real-time: 'Welcome to the future of AI'"
Claude: I'll stream the speech via WebSocket and play it in real-time.
[Uses fish_audio_tts tool with websocket_streaming: true, realtime_play: true]
Result: Audio streamed and played in real-time via WebSocket
git clone https://github.com/da-okazaki/mcp-fish-audio-server.git
cd mcp-fish-audio-server
npm install
.env
file:cp .env.example .env
# Edit .env with your API key
npm run build
npm run dev
Run the test suite:
npm test
mcp-fish-audio-server/
โโโ src/
โ โโโ index.ts # MCP server entry point
โ โโโ tools/
โ โ โโโ tts.ts # TTS tool implementation
โ โโโ services/
โ โ โโโ fishAudio.ts # Fish Audio API client
โ โโโ types/
โ โ โโโ index.ts # TypeScript definitions
โ โโโ utils/
โ โโโ config.ts # Configuration management
โโโ tests/ # Test files
โโโ audio_output/ # Default audio output directory
โโโ package.json
โโโ tsconfig.json
โโโ README.md
The service provides two main methods:
generateSpeech: Standard TTS generation
generateSpeechStream: Streaming TTS generation
The server handles various error scenarios:
"FISH_API_KEY environment variable is required"
FISH_API_KEY
environment variable"Network error: Unable to reach Fish Audio API"
"Text length exceeds maximum limit"
Audio files not appearing
AUDIO_OUTPUT_DIR
path existsContributions are welcome! Please feel free to submit a Pull Request.
git checkout -b feature/AmazingFeature
)git commit -m 'Add some AmazingFeature'
)git push origin feature/AmazingFeature
)This project is licensed under the MIT License - see the LICENSE file for details.
For issues, questions, or contributions, please visit the GitHub repository.
See CHANGELOG.md for a detailed list of changes.