
We are thrilled to announce a new collaboration between Dify and Open Audio. The versatile Fish Audio toolset plugin from Open Audio is now available on the Dify Marketplace. This integration enables Dify users to seamlessly incorporate high-quality text-to-speech and voice cloning into their AI applications.
Core Functions of Fish Audio
Fish Audio excels in speech generation and processing, offering the following key capabilities:

Speech Generation (TTS): Fish Audio provides robust real-time text-to-speech conversion. It features a WebSocket API for streaming audio output, giving users control over parameters like speed and volume. It supports common audio formats including Opus, MP3, and WAV.
Voice Cloning: The tool also features excellent voice cloning abilities. Users can perform fast cloning with just 30-45 seconds of voice samples. For superior results, advanced cloning is available, requiring 30-180 minutes of high-quality audio training. This cloning supports multiple languages and emotional expressions, relying on strict input audio quality standards for the best possible output.
Getting Started
To begin using Fish Audio tools in Dify, find and install the "Fish Audio" plugin from the Dify Marketplace.

Next, configure the plugin with your Fish Audio API key and endpoint URL, which you can obtain from here. You'll also need to select the balance mode during this setup.

Using the Fish Audio TTS Tool in a Dify Chatflow
For instance, you can build a Dify chatflow where a Large Language Model (LLM) generates text. You can then use the Fish Audio Text-to-Speech (TTS) tool node to automatically convert that text output into an audio segment.
To configure the Fish Audio TTS node within your workflow:
Input Text: Specify the text you want to convert to speech. In this case, you would link the text output from the LLM node to the input field of the TTS node.
Select Voice: Choose the desired voice by selecting the appropriate Voice ID.
Output Format: Set your preferred output audio file type.
This setup allows the workflow to seamlessly generate speech from the LLM's written response using the specific voice and format you've chosen.


Understanding Voice ID
A Voice ID is the unique identifier for a specific voice model on the Fish Audio platform. It essentially represents a distinct voice profile that you can select for text-to-speech generation.

Creating and Using Custom Voices
You aren't limited to the standard voices. You can train your own unique voice model using the "Build Voice" feature within Fish Audio. Once the training process is complete, you can find your custom trained voice listed in your "My Library". Simply copy the Voice ID associated with your custom voice from there to use it in your Dify workflows.

Real-World Use Cases
Multilingual Customer Support Scenarios Using Fish Audio's voice cloning feature, businesses can create custom voice models based on recordings of their top customer service representatives. The system then automatically turns written customer service replies into natural-sounding audio using these custom voices. It can even switch to the appropriate voice and language automatically based on the customer's language. This whole process leverages Fish Audio's core capabilities: voice cloning, automatic speech recognition (ASR), and text-to-speech (TTS), leading to more natural and efficient customer interactions.
Creating Educational and Training Content For education and training, Fish Audio helps quickly create standardized course materials. For instance, in language learning, it can clone the voices of native speakers to provide clear pronunciation examples, while also using ASR technology to give real-time feedback on a learner's pronunciation. Furthermore, TTS technology can generate consistent audio explanations for course content. This streamlines both the creation and delivery of educational materials, ensuring consistency.
Podcast and Media Content Creation Fish Audio offers media creators a flexible solution for producing content. Creators can use samples of their own voice to create a personalized digital voice and then use this model to turn written scripts into audio recordings. In post-production, the ASR feature can quickly generate transcripts and subtitles, making the content more accessible. The platform also allows adjusting things like speaking speed and emotional tone to ensure the final audio perfectly fits their creative needs.
About Open Audio
Open Audio is a Research lab belonging to Hanabi AI Inc, dedicated to providing better audio-related projects for the open-source community. Currently, its product Fish Audio offers audio synthesis and speech recognition capabilities that have reached industry-leading levels in both open-source and closed-source domains.
Website | Github | FishAudio | X | Discord
About Dify.AI
Dify.AI is revolutionizing AI-native application development by providing an open-source platform that simplifies the entire lifecycle of AI application creation, deployment, and management. With its extensible plugin ecosystem, Dify.AI enables developers and businesses to seamlessly integrate AI capabilities, customize workflows, and accelerate innovation. By lowering the barriers to AI adoption, Dify.AI empowers users to build intelligent applications with greater efficiency and flexibility.