Insights

Marketplace

Solutions

Pricing

Docs

Blog

Forum

GitHub

Get Started

Introducing Knowledge Pipeline

An adaptable, scalable and observable RAG data processing pipeline that converts enterprise unstructured data into high-quality context usable by LLM.

Leilei

Product Marketing

Written on

Sep 23, 2025

Share to Twitter

Share to LinkedIn

Share to Hacker News

Release

Sep 23, 2025

Introducing Knowledge Pipeline

An adaptable, scalable and observable RAG data processing pipeline that converts enterprise unstructured data into high-quality context usable by LLM.

Leilei

Product Marketing

Share to Twitter

Share to LinkedIn

Share to Hacker News

Release

Introducing Knowledge Pipeline

An adaptable, scalable and observable RAG data processing pipeline that converts enterprise unstructured data into high-quality context usable by LLM.

Leilei

Product Marketing

Written on

Sep 23, 2025

Share to Twitter

Share to LinkedIn

Share to Hacker News

Release

Sep 23, 2025

Introducing Knowledge Pipeline

Share to Twitter

Share to LinkedIn

Share to Hacker News

Release

Sep 23, 2025

Introducing Knowledge Pipeline

Share to Twitter

Share to LinkedIn

Share to Hacker News

Today we are introducing the new Knowledge Pipeline, a visual pipeline that turns messy enterprise data into high quality context for LLMs.

In most enterprises, the bottleneck is not the model. It is context engineering on unstructured data. Critical information sits in PDFs, PPT, Excel, images, HTML, and more. The challenge is to convert scattered, heterogeneous, and constantly changing internal data into reliable context that LLMs can consume. This is not simple import work. It is an engineered process that needs design, tuning, and observability.

Traditional RAG often struggles on enterprise data due to three issues:

Fragmented sources. Data lives across ERP, wikis, email, and drives, each with its own auth and format, making point by point integration costly.
Parsing loss. After parsing, documents become unstructured text with charts and formulas dropped, and when naive chunking further breaks document logic, LLMs end up answering from incomplete fragments.
Black box processing. Little visibility into each step makes it hard to tell whether failures come from parsing, chunking, or embedding, and reproducing errors is painful.

Knowledge Pipeline provides the missing data infrastructure for context engineering. With a visual pipeline, teams control the entire path from raw sources to trustworthy context.

Visual and orchestrated Knowledge Pipeline

Knowledge Pipeline inherits Dify Workflow’s canvas experience and makes the RAG ETL path visible. Each step is a node. From source connection and document parsing to chunking strategies, you choose the right plugin for text, images, tables, and scans. Backed by the Dify Marketplace, teams assemble document processing lines like building blocks and tailor flows by industry and data type.

When needed, you can embed Workflow nodes such as If-else, Code, and LLM into the pipeline. Use a model for content enrichment and code for rule based cleaning to achieve true flexibility.

Enterprise grade data source integrations

Knowledge Pipeline brings Data Source as a new plugin type, letting each knowledge base connect to multiple unstructured sources without custom adapters or auth code. Grab what you need from the Marketplace, or use the standard interfaces to build connectors for your own systems.

Covered sources include:

Local files: 30 plus formats such as PDF, Word, Excel, PPT, Markdown
Cloud storage: Google Drive, AWS S3, Azure Blob, Box, OneDrive, Dropbox
Online docs: Notion, Confluence, SharePoint, GitLab, GitHub
Web crawling: Firecrawl, Jina, Bright Data, Tavily

Pluggable data processing pipeline

We've broken processing into standard nodes to make the pipeline predictable and extensible. You can swap plugins based on your scenario.

Extract

Ingestion from many sources. The next steps adapt to the upstream output type, whether file objects or page content, including text and images.

Transform

The core of the pipeline, composed of four stages:

Parse: Choose the optimal parser per file type, extract text and structured metadata. For scans, tables, or PPT text box ordering, run multiple parsers in parallel to avoid loss.
Enrich: Use LLM and Code nodes for entity extraction, summarization, classification, redaction, and more.
Chunk: Three strategies are available: General, Parent-Child, and Q&A, covering common documents, long technical files, and structured table queries.
Embed: Choose embeddings by cost, language, and dimension from different providers.

Load

Write vectors and metadata into the knowledge base and build efficient indexes. Support high quality vector indexes and cost efficient inverted indexes. Configure metadata tags for precise filtering and access control.

After processing, retrieval supports vector, full text, or hybrid strategies. Use metadata filters and reranking to return precise results with original citations, and an LLM organizes the final answer with mixed text and images for better accuracy and experience.

Observable debugging

Legacy pipelines behave like a black box. With Knowledge Pipeline you can Test Run the entire flow step by step and inspect inputs and outputs at each node. The Variable Inspect panel shows intermediate variables and context in real time, so you can quickly locate parsing errors, chunking issues, or missing metadata.

Once validation is complete, publish the pipeline and move into standardized processing.

See the docs for detailed guidance.

Templates for common scenarios

Seven built in templates help you start fast:

General document processing - General Mode (ECO)
Split into general paragraph chunks with economical indexing. Good for large batches.
Long document processing - Parent‑Child (HQ)
Hierarchical parent‑child chunking that preserves local precision and global context. Ideal for long technical docs and reports.
Table data extraction - Simple Q&A
Extract selected columns from tables and build structured question answer pairs for natural language queries.
Complex PDF parsing - Complex PDF with Images & Tables
Targeted extraction of images and tables from PDFs for downstream multimodal search.
Multimodal enrichment - Contextual Enrichment Using LLM
Let an LLM describe images and tables to improve retrieval.
Document format conversion - Convert to Markdown
Convert Office formats to Markdown for speed and compatibility.
Intelligent Q&A generation - LLM Generated Q&A
Produce key question answer sets from long documents to create precise knowledge points

RAG plugin ecosystem

Dify provides an open plugin ecosystem built by the team, partners, and community. With a plugin based architecture, enterprises select tools that fit their needs.

Connector: Google Drive, Notion, Confluence, and many more
Ingestion: LlamaParse, Unstructured, OCR tools
Storage: Qdrant, Weaviate, Milvus, Oracle and other leading vector databases, customizable for enterprise and open-source deployments

Why Knowledge Pipeline

Knowledge Pipeline operationalizes context engineering. It converts unstructured enterprise data into high quality context that powers retrieval, reasoning, and applications.

Three core benefits

Bridge business and data engineering

Visual orchestration and real time debugging let business teams participate directly. They can see how data is processed and help troubleshoot retrieval, while engineering focuses on growth critical work.

Lower build and maintenance cost

Many RAG projects are one off builds. Knowledge Pipeline turns processing into reusable assets. Contracts review, support knowledge bases, and technical docs become templates that teams can copy and adapt, reducing rebuilds and ongoing effort.

Adopt best-of-breed vendors

No need to choose between full in house or a single vendor. Swap OCR, parsing, structured extraction, vector stores, and rerankers at any time while keeping the overall architecture stable.

What’s next

In the latest version, we rebuilt the Workflow engine around a queued graph execution model. The new engine removes limits in complex parallel scenarios and enables more flexible node wiring and control. Pipelines can start from any node, pause and resume in the middle, and lay the groundwork for breakpoints, human-in-the-loop, and trigger based execution.

Start orchestrating enterprise-grade Knowledge Pipelines.

On this page

Release

Introducing Trigger

With Trigger, your Dify Workflows no longer have to wait for a manual click or API call. They can sit in the background, stay online, and automatically react to what’s happening in the outside world.

Leilei

Nov 21, 2025

Release

Introducing Trigger

Leilei

Nov 21, 2025

Release

Attention, Hijacked by 100 SaaS Tools

Rather than relying on manual clicks, we now subscribe to events.

Evan Chen

Nov 19, 2025

Release

Attention, Hijacked by 100 SaaS Tools

Rather than relying on manual clicks, we now subscribe to events.

Evan Chen

Nov 19, 2025

Release

Knowledge Pipeline Plugin Ecosystem: Co‑Build Enterprise‑Grade RAG with Global Partners

Enterprise data processing requires specialized solutions beyond any single platform. Dify's Knowledge Pipeline serves 120+ countries including Fortune 500 companies through our open plugin architecture. We're now inviting data connectivity, document processing, and vector optimization specialists to join our partner ecosystem and build next-generation enterprise AI infrastructure together.

Leilei

Sep 24, 2025

Release

Knowledge Pipeline Plugin Ecosystem: Co‑Build Enterprise‑Grade RAG with Global Partners

Leilei

Sep 24, 2025

Release

2025 Dify Summer Highlights

Dify’s summer updates in v1.7–v1.8 bring secure OAuth integrations, smarter workflow tools, and faster, more reliable execution—paving the way for upcoming Visual RAG Pipelines.

Dify

Sep 9, 2025

Release

2025 Dify Summer Highlights

Dify’s summer updates in v1.7–v1.8 bring secure OAuth integrations, smarter workflow tools, and faster, more reliable execution—paving the way for upcoming Visual RAG Pipelines.

Dify

Sep 9, 2025

Ready to Build the AI App of Tomorrow?

Launch production-ready agents powered by RAG pipelines, integrations, and full observability - no heavy lifting required.

Ready to Build the AI App of Tomorrow?

Launch production-ready agents powered by RAG pipelines, integrations, and full observability - no heavy lifting required.

Ready to Build the AI App of Tomorrow?

Launch production-ready agents powered by RAG pipelines, integrations, and full observability - no heavy lifting required.

Resources

Docs

Blog

Education

Partner

Events

Support

Roadmap

Company

Talk to Us

Cookie Settings

Data Protection Agreement

Marketplace Agreement

End User License Agreement

Dify Affiliate Program Agreement

Brand Guidelines

Unlock Agentic AI with Dify. Develop, deploy, and manage autonomous agents, RAG pipelines, and more for teams at any scale, effortlessly.

Build Production-Ready Agentic Workflow

Resources

Docs

Blog

Education

Partner

Events

Support

Roadmap

Company

Talk to Us

Cookie Settings

Data Protection Agreement

Marketplace Agreement

End User License Agreement

Dify Affiliate Program Agreement

Brand Guidelines

Unlock Agentic AI with Dify. Develop, deploy, and manage autonomous agents, RAG pipelines, and more for teams at any scale, effortlessly.

Build Production-Ready Agentic Workflow

Resources

Docs

Blog

Education

Partner

Events

Support

Roadmap

Company

Talk to Us

Cookie Settings

Data Protection Agreement

Marketplace Agreement

End User License Agreement

Dify Affiliate Program Agreement

Brand Guidelines

Unlock Agentic AI with Dify. Develop, deploy, and manage autonomous agents, RAG pipelines, and more for teams at any scale, effortlessly.

Build Production-Ready Agentic Workflow

Resources

Docs

Blog

Education

Partner

Events

Support

Roadmap

Company

Talk to Us

Cookie Settings

Data Protection Agreement

Marketplace Agreement

End User License Agreement

Dify Affiliate Program Agreement

Brand Guidelines

Unlock Agentic AI with Dify. Develop, deploy, and manage autonomous agents, RAG pipelines, and more for teams at any scale, effortlessly.

Build Production-Ready Agentic Workflow

Insights

Insights

Insights

Introducing Knowledge Pipeline

Introducing Knowledge Pipeline

Introducing Knowledge Pipeline

Introducing Knowledge Pipeline

Introducing Knowledge Pipeline

Visual and orchestrated Knowledge Pipeline

Enterprise grade data source integrations

Pluggable data processing pipeline

Observable debugging

Templates for common scenarios

RAG plugin ecosystem

Why Knowledge Pipeline

What’s next

Leilei

Leilei

Evan Chen

Evan Chen

Leilei

Leilei

Dify

Dify

Ready to Build the AI App of Tomorrow?

Ready to Build the AI App of Tomorrow?

Ready to Build the AI App of Tomorrow?

Ready to Build the AI App of Tomorrow?

Resources

Company

Resources

Company

Resources

Company

Resources

Company