How to

Adding MultiModal Capabilities to Deepseek R1 using Dify

On Dify, you can quickly build a bidirectional collaborative system based on DeepSeek R1 and multi-modal models through visual workflow design.

Steven

Technical Writer

Written on

Feb 8, 2025

Share

Share to Twitter
Share to LinkedIn
Share to Hacker News

How to

·

Feb 8, 2025

Adding MultiModal Capabilities to Deepseek R1 using Dify

On Dify, you can quickly build a bidirectional collaborative system based on DeepSeek R1 and multi-modal models through visual workflow design.

Steven

Technical Writer

Share to Twitter
Share to LinkedIn
Share to Hacker News

How to

Adding MultiModal Capabilities to Deepseek R1 using Dify

On Dify, you can quickly build a bidirectional collaborative system based on DeepSeek R1 and multi-modal models through visual workflow design.

Steven

Technical Writer

Written on

Feb 8, 2025

Share

Share to Twitter
Share to LinkedIn
Share to Hacker News

How to

·

Feb 8, 2025

Adding MultiModal Capabilities to Deepseek R1 using Dify

Share to Twitter
Share to LinkedIn
Share to Hacker News

How to

·

Feb 8, 2025

Adding MultiModal Capabilities to Deepseek R1 using Dify

Share to Twitter
Share to LinkedIn
Share to Hacker News

Introduction

Less than a month after the DeepSeek V3 model sparked heated discussions in the industry, DeepSeek has once again launched a new model, R1, setting off another wave in the global artificial intelligence field. If V3 demonstrated that top-tier model performance could be achieved with low-cost training thanks to its impressive cost-effectiveness, R1 represents a qualitative leap in terms of technology. This open-source model not only inherits the characteristic of high cost-effectiveness but also attracts the attention of leading AI researchers worldwide with its unique training methods and emergent reasoning abilities.

In many tests, DeepSeek R1 has demonstrated remarkable reasoning capabilities. DeepSeek R1-Zero's accuracy in the AIME math competition climbed from an initial 15.6% to 71.0%, with multiple attempts reaching even 86.7%. In another test, the model also exhibited strong transfer learning ability, reaching a performance level above 96.3% of human participants on the programming contest platform Codeforces. These results clearly demonstrate that R1-Zero isn't simply memorizing problem-solving patterns, but has genuinely mastered deep mathematical intuition and universal reasoning capabilities.

While DeepSeek R1 is powerful, it currently has some pain points, such as its lack of multi-modal capabilities. The DeepSeek website version offers file upload and network connectivity, but these two functions cannot be enabled simultaneously.

To address the aforementioned issues, we leveraged Dify, an open-source LLM Ops tool, for low-code development. When developing LLM products with Dify, you only need to focus on product design without worrying about code implementation. By simply dragging and adding nodes, you can quickly transform ideas into runnable products and deploy them.

We will not directly use DeepSeek R1 as the output model, but rather use its output as a pre-processing inference tool to enhance the multimodal capabilities of a more powerful model that lacks inference capabilities. Furthermore, we will utilize Dify's beta Plugin feature to package the built LLM application as an OpenAI-formatted API, allowing integration with other tools.

Dify: Low-Code Integration and Development of DeepSeek Applications

On Dify, you can quickly build a bidirectional collaborative system based on DeepSeek R1 and multi-modal models through visual workflow design.

First, you need to log in to Dify and select "Create Blank Application" -> "Chatflow".

File Upload and Doc Extractor

Dify v0.10.0 has added a file upload function, which needs to work with a doc extractor to parse files into text that LLMs can read.

You can enable and set file types in "Features" -> "File Upload".

DeepSeek R1 Node (LLM Node): "The Top Student's" In-Depth Reasoning

First, you need to obtain and add your DeepSeek API Key in "Settings" -> "Model Providers".

If you are using the community or enterprise version, please ensure that Dify is the latest version.

DeepSeek R1 plays the role of the "top student", focusing on problem breakdown and logical reasoning. Its core task is to output the complete thought process rather than directly providing answers.

When writing system prompts, it is recommended to write structured prompts, such as using XML format, which can enhance the model's decomposition of the problem task.

<Role>
You are an LLM with reasoning capabilities.
Unlike other LLMs, you can output your complete thinking process.
</Role>
<Task>
Your task is to assist other LLMs that lack reasoning capabilities.
You need to output complete thinking processes for other LLMs based on user questions.
<Steps>
"Step 1": "Receive questions from users."
"Step 2": "Conduct deep reasoning and analysis on user questions."
"Step 3": "Elaborate on the reasoning process and logic, ensuring the process is complete and easy to understand."
"Step 4": "Output the complete reasoning process, no final answer needed."
</Steps>
</Task>
<Limitations

Do not output the final answer, only output the thinking process.

Do not explain your own capabilities or limitations.

</Limitations>
In addition, we need to adjust the user input content, adding the content from the doc extractor:
<User Query>
{{Start}}
</User Query>
<file>
{{text}}
</file>

Note that the two input variables are enclosed in XML format, which will help the LLM understand. You can refer to the previous node's variables by typing { or /.

Gemini Node (LLM Node): Multi-Modal Implementation

Gemini is a multimodal model with strong visual capabilities, relying on the R1 reasoning framework to combine multimodal data and generate a final answer. Its advantage lies in image parsing and result optimization.

The system prompt is as follows:

<Role>
You are an LLM that excels at learning.
</Role>
<Task>
You need to learn from others' thinking processes about problems, enhance your results with their thinking, and then provide your answer.
<Steps>
"Step 1": "Receive thinking process from DeepSeek-R1 model."
"Step 2": "Carefully study and understand DeepSeek-R1's reasoning logic and steps."
"Step 3": "Generate final answer based on DeepSeek-R1's thinking, combined with image capabilities."
"Step 4": "Output the final answer, no need to explain the thinking process."
</Steps>
</Task>
<Limitations>
Do not repeat DeepSeek-R1's thinking process, only output the final answer.
Do not explain your own capabilities or learning process.
Ensure the answer is accurate and relevant to the question.
</Limitations>

In addition, you need to enable the LLM's visual capabilities in this node to gain vision capabilities.

Try it Now

You can now immediately pull this demo from the Explore page to your application list:

English:Deploy to Dify

Chinese:Deploy to Dify

On this page

    Related articles

    The Innovation Engine for Generative AI Applications

    The Innovation Engine for Generative AI Applications

    The Innovation Engine for Generative AI Applications