🚀 Mode Documentation

Overview

MODE (Mixture of Document Experts) is a novel framework designed to enhance Retrieval-Augmented Generation (RAG) by organizing documents into semantically coherent clusters and utilizing centroid-based retrieval. Unlike traditional RAG pipelines that rely on large vector databases and re-rankers, MODE offers a scalable, interpretable, and efficient alternative, particularly suited for specialized or small to medium-sized datasets. It provides two primary classes: ModeIngestion for clustering and data preparation, and ModeInference for efficient semantic search and response generation.

Installation

pip install mode_rag

Modules

ModeIngestion

Description: Clusters text data using HDBSCAN and identifies centroids (central points) within each cluster. Results are persisted for inference.

Class: ModeIngestion

mode_rag.ModeIngestion
Parameters

Method: process_data

process_data(parallel: bool = True) -> Tuple[Dict[int, List[str]], Dict[int, np.ndarray]]
Parameters
Returns

Example


# ========================================
# 📄 Sample Code: 
# ========================================
#
# 1. Loading pdf using PyPDFLoader
# 2. create chunking using `RecursiveCharacterTextSplitter`.
# 3. for embedding we are using langchain_huggingface.
# This is a sample using `RecursiveCharacterTextSplitter` and `EmbeddingGenerator`.
# You can use your **own chunking/embedding** logic.
# Main inputs to `ModeIngestion` are `chunks` and `embeddings`:


## requirements
# pip install langchain_huggingface==0.1.2
# pip install langchain_community==0.3.4
# pip install pypdf==5.1.0


import os
import json

os.environ["TOKENIZERS_PARALLELISM"] = "false"


from mode_rag import ModeIngestion, EmbeddingGenerator
import os
import json

## Pdf reader
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("https://arxiv.org/pdf/1706.03762")
docs = loader.load()

print("downloaded the files")

from langchain.text_splitter import RecursiveCharacterTextSplitter

print("Chunking the pdf:doc")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(docs)
chunks = []
for doc in documents:
    chunks.append(doc.page_content)

print("doing embedding")
embed_gen = EmbeddingGenerator()
embeddings = embed_gen.generate_embeddings(chunks)
print("embedding done")
main_processor = ModeIngestion(
    chunks=chunks,
    embedding=embeddings,
    persist_directory="attention",
)
main_processor.process_data(parallel=False)


    

ModeInference

Description: ModeInference allows you to perform fast and efficient semantic search on pre-clustered text embeddings. It matches a query to the most relevant clusters using centroids, then retrieves and synthesizes responses.

Class: ModeInference

mode_rag.ModeInference
Parameters

Method: invoke

invoke(query: str, query_embedding: torch.Tensor, prompt: ModelPrompt, model_input: dict = {}, parallel: bool = True, top_n_model: int = 1) -> str
Parameters
Returns

Example



# ========================================
# 📄 Sample Code:
# ========================================
#
# 1. Load clustered data (`ModeInference`).
# 2. Generate query embedding (replaceable with your `embedding.py`).
# 3. Retrieve context and synthesize response with `ModelPrompt`.

import os
import json
import sys

os.environ["TOKENIZERS_PARALLELISM"] = "false"


from mode_rag import (
    EmbeddingGenerator,
    ModeInference,
    ModelPrompt,
)


main_processor = ModeInference(
    persist_directory="attention",
)

print("====start======")
# Create a PromptManager instance

query = "What are the key mathematical operations involved in computing self-attention?"

embed_gen = EmbeddingGenerator()
embedding = embed_gen.generate_embedding(query)

prompts = ModelPrompt(
    ref_sys_prompt="Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just return you don't know.",
    ref_usr_prompt="context: ",
    syn_sys_prompt="You have been provided with a set of responses from various models to the latest user query. Your task is to synthesize these responses into a single, high-quality response. It is crucial to critically evaluate the information provided in these responses, recognizing that some of it may be biased or incorrect. Your response should not simply replicate the given answers but should offer a refined, accurate, and comprehensive reply to the instruction. Ensure your response is well-structured, coherent, and adheres to the highest standards of accuracy and reliability.\nResponses from models:",
    syn_usr_prompt="responses:",
)

response = main_processor.invoke(
    query,
    embedding,
    prompts,
    model_input={"temperature": 0.3, "model": "openai/gpt-4o-mini"},
    top_n_model=2,
)
print(response)

License

MIT License. See LICENSE for details.

Contributing

Contributions are welcome! Please submit a pull request or open an issue in the GitHub repository.

Author

Developed and maintained by Rahul Anand.