MODE (Mixture of Document Experts) is a novel framework designed to enhance
Retrieval-Augmented Generation (RAG) by organizing documents into semantically coherent clusters
and utilizing centroid-based retrieval. Unlike traditional RAG pipelines that rely on large vector databases and
re-rankers, MODE offers a scalable, interpretable, and efficient alternative, particularly suited for specialized
or small to medium-sized datasets. It provides two primary classes: ModeIngestion
for clustering
and data preparation, and ModeInference
for efficient semantic search and response generation.
pip install mode_rag
Description: Clusters text data using HDBSCAN and identifies centroids (central points) within each cluster. Results are persisted for inference.
mode_rag.ModeIngestion
List[str]
): List of text documents or chunks.Union[torch.Tensor, List[List[float]], np.ndarray]
): Embeddings corresponding to the text chunks.int
, optional): Minimum samples per cluster. Default 10
.int
, optional): Maximum samples per cluster. Default 30
.str
, optional): Directory where clustering results will be saved.process_data(parallel: bool = True) -> Tuple[Dict[int, List[str]], Dict[int, np.ndarray]]
bool
, optional): Whether to compute centroids in parallel. Defaults to True
.Tuple[Dict[int, List[str]], Dict[int, np.ndarray]]
: Dictionary mapping cluster labels to text chunks and medoid embeddings.
# ========================================
# 📄 Sample Code:
# ========================================
#
# 1. Loading pdf using PyPDFLoader
# 2. create chunking using `RecursiveCharacterTextSplitter`.
# 3. for embedding we are using langchain_huggingface.
# This is a sample using `RecursiveCharacterTextSplitter` and `EmbeddingGenerator`.
# You can use your **own chunking/embedding** logic.
# Main inputs to `ModeIngestion` are `chunks` and `embeddings`:
## requirements
# pip install langchain_huggingface==0.1.2
# pip install langchain_community==0.3.4
# pip install pypdf==5.1.0
import os
import json
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from mode_rag import ModeIngestion, EmbeddingGenerator
import os
import json
## Pdf reader
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://arxiv.org/pdf/1706.03762")
docs = loader.load()
print("downloaded the files")
from langchain.text_splitter import RecursiveCharacterTextSplitter
print("Chunking the pdf:doc")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(docs)
chunks = []
for doc in documents:
chunks.append(doc.page_content)
print("doing embedding")
embed_gen = EmbeddingGenerator()
embeddings = embed_gen.generate_embeddings(chunks)
print("embedding done")
main_processor = ModeIngestion(
chunks=chunks,
embedding=embeddings,
persist_directory="attention",
)
main_processor.process_data(parallel=False)
Description: ModeInference allows you to perform fast and efficient semantic search on pre-clustered text embeddings. It matches a query to the most relevant clusters using centroids, then retrieves and synthesizes responses.
mode_rag.ModeInference
str
): Path to directory containing pre-clustered text data and centroids.invoke(query: str, query_embedding: torch.Tensor, prompt: ModelPrompt, model_input: dict = {}, parallel: bool = True, top_n_model: int = 1) -> str
str
): The search query text.torch.Tensor
): The embedding of the search query.ModelPrompt
): A prompt object that helps format the model's output.dict
, optional): Additional parameters for the LLM such as temperature
, max_tokens
, top_p
, stream
, model
(default: {"model": "openai/gpt-4o"}
).bool
, optional): Whether to perform computations in parallel. Defaults to True
.int
, optional): Number of top matching results to retrieve. Defaults to 1
.str
: A response generated from the retrieved search results.
# ========================================
# 📄 Sample Code:
# ========================================
#
# 1. Load clustered data (`ModeInference`).
# 2. Generate query embedding (replaceable with your `embedding.py`).
# 3. Retrieve context and synthesize response with `ModelPrompt`.
import os
import json
import sys
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from mode_rag import (
EmbeddingGenerator,
ModeInference,
ModelPrompt,
)
main_processor = ModeInference(
persist_directory="attention",
)
print("====start======")
# Create a PromptManager instance
query = "What are the key mathematical operations involved in computing self-attention?"
embed_gen = EmbeddingGenerator()
embedding = embed_gen.generate_embedding(query)
prompts = ModelPrompt(
ref_sys_prompt="Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just return you don't know.",
ref_usr_prompt="context: ",
syn_sys_prompt="You have been provided with a set of responses from various models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.\nResponses from models:",
syn_usr_prompt="responses:",
)
response = main_processor.invoke(
query,
embedding,
prompts,
model_input={"temperature": 0.3, "model": "openai/gpt-4o-mini"},
top_n_model=2,
)
print(response)
MIT License. See LICENSE for details.
Contributions are welcome! Please submit a pull request or open an issue in the GitHub repository.
Developed and maintained by Rahul Anand.