RAG-Anything: All-in-One Multimodal RAG Framework
Retrieval-Augmented Generation (RAG) has transformed how large language models access external knowledge, but most RAG systems handle only text. Real-world documents contain images, tables, equations, and charts – multimodal content that text-only RAG pipelines simply discard. RAG-Anything from HKUDS solves this problem with a modular, end-to-end framework that processes every modality in your documents and makes them retrievable through a unified knowledge graph.
With over 17,000 stars on GitHub, RAG-Anything has quickly become the go-to open-source solution for building multimodal RAG systems that go beyond plain text retrieval.
How It Works
RAG-Anything implements a 5-stage pipeline that transforms raw multimodal documents into a queryable knowledge graph, then retrieves the most relevant multimodal chunks to augment LLM responses.
Stage 1: Document Parsing
The pipeline begins with MineRU-based document parsing, which extracts structured content from PDFs, images, and other document formats. This stage identifies text blocks, figures, tables, equations, and charts, preserving their spatial relationships and semantic context.
Stage 2: Multi-Modal Understanding
Each extracted element is processed by specialized modality processors:
- Image Processor – Uses vision-language models to generate rich textual descriptions of images, making them searchable alongside text
- Table Processor – Converts tables into structured text representations that capture both content and layout
- Equation Processor – Transforms mathematical equations into LaTeX and natural language descriptions
- Chart Processor – Analyzes charts and graphs, extracting data trends and visual patterns
Stage 3: Multimodal Analysis Engine
The analysis engine orchestrates all modality processors, ensuring that each content type receives specialized treatment while maintaining cross-modal relationships. This is where RAG-Anything’s modular design shines – you can add or replace processors without touching the rest of the pipeline.
Stage 4: Knowledge Graph Index
Processed content is indexed into a knowledge graph built on LightRAG, enabling both entity-level and relationship-level retrieval. The knowledge graph captures connections between concepts across different modalities, so a query about “revenue trends” can retrieve both the relevant text paragraph and the chart that visualizes those trends.
Stage 5: Modality-Aware Retrieval
When a query arrives, RAG-Anything performs modality-aware retrieval from the knowledge graph. This means it doesn’t just find text matches – it retrieves the most relevant content regardless of modality, whether that’s a paragraph, a table, an equation, or an image description. The retrieved chunks are then assembled into a context-rich prompt for the LLM.
Installation
From PyPI
The simplest way to install RAG-Anything is via pip:
pip install raganything
From Source
For the latest development version, clone the repository and install with uv:
git clone https://github.com/HKUDS/RAG-Anything.git
cd RAG-Anything
uv sync
Requirements: Python 3.10 or higher.
Usage
Basic Query
After installation, you can start querying with RAG-Anything using its Python API:
from raganything import RAGAnything
# Initialize with your LLM and embedding model
rag = RAGAnything(
working_dir="./rag_workspace",
llm_model="your-llm-model",
embedding_model="your-embedding-model"
)
# Insert documents
rag.insert("path/to/your/document.pdf")
# Query with multimodal retrieval
result = rag.query("What are the key findings in this report?")
print(result)
Multimodal Document Processing
RAG-Anything automatically detects and processes different content types within your documents:
# Insert a document containing images, tables, and equations
rag.insert("research_paper.pdf")
# The framework will:
# 1. Parse the document structure
# 2. Process each modality with specialized processors
# 3. Build a knowledge graph from all content types
# 4. Enable cross-modal retrieval
Custom Modality Processors
You can extend RAG-Anything with custom modality processors for specialized content types:
from raganything import ModalProcessor
class CustomProcessor(ModalProcessor):
def process(self, content):
# Your custom processing logic
return processed_result
# Register the custom processor
rag.register_processor("custom_type", CustomProcessor())
Features
| Feature | Description |
|---|---|
| Multimodal RAG | Process and retrieve from images, tables, equations, and charts alongside text |
| Knowledge Graph Index | Built on LightRAG for entity and relationship-level retrieval |
| Modular Architecture | Pluggable modality processors that can be added or replaced independently |
| MineRU Parsing | Advanced document parsing that preserves spatial and semantic structure |
| Cross-Modal Retrieval | Queries retrieve relevant content regardless of original modality |
| LLM Agnostic | Works with any LLM backend through configurable model interfaces |
| Incremental Updates | Add new documents without rebuilding the entire knowledge graph |
Architecture Deep Dive
Understanding the 5-Stage Pipeline
The RAG-Anything architecture follows a clear data flow pattern where each stage builds on the output of the previous one:
Document Parsing to Multi-Modal Understanding: Raw documents enter the pipeline and are decomposed into their constituent elements. The parser identifies regions of text, images, tables, equations, and charts, extracting both content and metadata like page position and bounding boxes.
Multi-Modal Understanding to Analysis Engine: Each extracted element is routed to its corresponding modality processor. The analysis engine coordinates this routing and ensures that cross-modal references (like “see Figure 3” in text) are preserved and linked.
Analysis Engine to Knowledge Graph: The processed outputs from all modality processors are unified into a single knowledge graph. Text descriptions, table summaries, equation representations, and image captions all become nodes and edges in the graph, connected by semantic relationships.
Knowledge Graph to Retrieval: At query time, the retrieval system traverses the knowledge graph to find the most relevant content across all modalities. This modality-aware approach ensures that the LLM receives the richest possible context, regardless of whether the answer comes from text, a table, or a visual element.
Why Knowledge Graphs for RAG?
Traditional RAG systems use vector similarity search over text chunks, which works well for text-only content but fails when the answer is encoded in a chart, table, or equation. Knowledge graphs solve this by:
- Capturing relationships between entities across modalities
- Enabling multi-hop reasoning through connected concepts
- Supporting both local and global retrieval – find specific facts or understand broad themes
- Preserving document structure so retrieved content maintains its original context
The LightRAG Foundation
RAG-Anything builds on LightRAG, which provides efficient graph-based indexing and retrieval. Key advantages include:
- Dual-level retrieval – entity-level for specific facts, relationship-level for broader context
- Incremental updates – add new documents without rebuilding the entire graph
- Low resource overhead – optimized for production deployments with minimal memory footprint
Troubleshooting
| Issue | Solution |
|---|---|
ModuleNotFoundError: No module named 'raganything' | Ensure you installed with pip install raganything and activated the correct virtual environment |
| MineRU parsing errors | Install the full MineRU dependencies: pip install raganything[core] |
| OOM during knowledge graph construction | Reduce batch size or use a smaller embedding model |
| Slow retrieval on large corpora | Consider increasing the number of graph traversal hops or adjusting the retrieval threshold |
| Image processing failures | Verify your vision-language model is properly configured and accessible |
Conclusion
RAG-Anything represents a significant step forward for Retrieval-Augmented Generation by treating multimodal content as first-class citizens in the retrieval pipeline. Rather than stripping away images, tables, and equations, it processes each modality with specialized processors and indexes them into a unified knowledge graph. This approach produces richer, more accurate LLM responses that leverage the full content of your documents.
The framework’s modular architecture makes it easy to extend with custom processors, and its LightRAG foundation ensures efficient graph-based retrieval at scale. Whether you are building research assistants, document Q&A systems, or enterprise knowledge bases, RAG-Anything provides the infrastructure to handle real-world documents in all their multimodal complexity.
Links
Related Posts
- LightRAG: Simple and Fast Retrieval-Augmented Generation
- Ollama: Run Large Language Models Locally
- Open WebUI: User-Friendly AI Interface
Enjoyed this post? Never miss out on future posts by following us