DEV Community

Alain Airom
Alain Airom

Posted on

Bob Strikes Again: ‘PageIndex’ Test and Implementation

A test on ‘PageIndex’ using IBM Bob

Introduction

Again I spotted a GitHub repo which is PageIndex and as I was intrigued, I built a test application by using IBM Bob — specifically leveraging the power of local LLMs (such as Granite through Ollama) to see if I could bring local intelligence to document processing.

I wanted to bridge the gap between high-end document indexing and the privacy of local execution. The result is the PageIndex Testing Application, a Streamlit-based powerhouse designed to put PageIndex through its paces without ever sending a prompt to the cloud.

🏗️ The Vision: Local RAG, Visual Context, and Beyond

The goal was simple but ambitious: create a “laboratory” where anyone could upload a complex PDF and watch as a local LLM dissects it. By integrating Ollama as the backbone, this application bypasses traditional API costs while offering three distinct ways to interact with data:

💬 Chat Quickstart: A streamlined Q&A interface using the PageIndex Chat API for rapid-fire insights.

  • 🔍 Simple RAG: A deeper dive that uses reasoning-based retrieval. It pulls the entire hierarchical “tree” structure of a document, allowing the Granite model to navigate nodes and find the precise context needed for an answer.
  • 👁️ Vision RAG: Perhaps the most exciting mode — it performs visual document analysis without traditional OCR. By extracting PDF pages as images, it allows vision-capable models like Llava to “see” charts, diagrams, and layouts directly.

🛠️ Built for Developers and Testers

Under the hood, I’ve organized the project to be as modular as possible. From a custom Ollama Client that handles streaming and vision tasks to automated scripts that manage the environment, the architecture ensures that the transition from a GitHub repo to a working local demo takes less than five minutes.


What is PageIndex? Excerpt from GitHub repository ⤵️

PageIndex: Vectorless, Reasoning-based RAG
Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic similarity rather than true relevance. But similarity ≠ relevance — what we truly need in retrieval is relevance, and that requires reasoning. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.
Inspired by AlphaGo, we propose PageIndex — a vectorless, reasoning-based RAG system that builds a hierarchical tree index from long documents and uses LLMs to reason over that index for agentic, context-aware retrieval. It simulates how human experts navigate and extract knowledge from complex documents through tree search, enabling LLMs to think and reason their way to the most relevant document sections. PageIndex performs retrieval in two steps:
Generate a “Table-of-Contents” tree structure index of documents
Perform reasoning-based retrieval through tree search

** 🎯 Core Features
**> Compared to traditional vector-based RAG, PageIndex features:
No Vector DB: Uses document structure and LLM reasoning for retrieval, instead of vector similarity search.
No Chunking: Documents are organized into natural sections, not artificial chunks.
Human-like Retrieval: Simulates how human experts navigate and extract knowledge from complex documents.
Better Explainability and Traceability: Retrieval is based on reasoning — traceable and interpretable, with page and section references. No more opaque, approximate vector search (“vibe retrieval”).
PageIndex powers a reasoning-based RAG system that achieved state-of-the-art 98.7% accuracy on FinanceBench, demonstrating superior performance over vector-based RAG solutions in professional document analysis.


🛠️ From Concept to Code: Building the PageIndex Local Laboratory

The application is a comprehensive Streamlit-based UI designed to bridge the gap between advanced document indexing and local privacy by integrating the PageIndex SDK with a local Ollama instance. It serves as a testing ground for three distinct document interaction strategies, replacing cloud-based LLMs with the granite model to ensure cost-effective, private data processing.

Core Functionalities and Demo Modes

The synthesis of the application reveals a highly modular architecture focused on three primary use cases:

  • 💬 Chat Quickstart: Utilizes the PageIndex Chat API directly to allow users to ask rapid-fire questions and receive streaming responses from processed PDFs.
  • 🔍 Simple RAG: Implements a reasoning-based retrieval system. It retrieves the document’s hierarchical tree structure, which the local model then searches to find the most relevant nodes before generating a context-aware answer.
  • 👁️ Vision RAG: Enables visual document analysis without traditional OCR. By extracting PDF pages as images, it allows vision-capable models like Llava to interpret charts, diagrams, and complex layouts.

Rapid Setup and Integration

To go from a GitHub clone to a functional application, the project utilizes a streamlined “Quick Start” workflow:

  1. Environment Configuration: Sensitive credentials like the PageIndex API Key are managed through a .env file to prevent accidental exposure.
  2. Model Orchestration: Users pull required models (e.g., granite and granite-vision/llama-vision/qwen…) via Ollama to serve as the local intelligence backbone.
  3. Automated Launch: A dedicated launch.sh script automates virtual environment creation, dependency installation, and application startup, making the lab accessible at http://localhost:8501 within minutes.

Technical Stack Summary

The application leverages a robust set of tools to achieve its goals:

  • Frontend: Streamlit for a clean, interactive, and Python-native UI.
  • Intelligence: Ollama integration for local, OpenAI-compatible LLM operations.
  • Document Processing: PageIndex SDK for sophisticated document submission, status tracking, and tree-structure retrieval.
  • Utilities: PyMuPDF for page-to-image extraction and a custom OllamaClient for handling async chat completions and vision tasks.

🔍 Deep Dive: How the Simple RAG Reasoning Process Works

The Simple RAG (Retrieval-Augmented Generation) demo is the “brain” of the application, moving beyond basic keyword matching to a sophisticated, tree-based reasoning search. Instead of just searching for text fragments, it uses the structural intelligence of PageIndex to understand how a document is organized.

Here is the step-by-step breakdown of the logic Bob implemented:

  • Tree Retrieval: The application first calls client.get_tree(doc_id) to pull the hierarchical PageIndex tree, where each node contains a title, page index, and summary.
  • Metadata Mapping: It uses utils.create_node_mapping() to build a fast-lookup dictionary of every node in the document.
  • Search Optimization: To save on context window space and speed up the local LLM, the app uses utils.remove_fields() to strip out heavy raw text, leaving only titles and summaries for the "reasoning" phase.
  • The LLM Search Prompt: Bob sends this slimmed-down tree to granite with a specific instruction: “Find nodes relevant to this query”.
  • Intelligent Extraction: Once the LLM identifies the relevant node_id list, the application reaches back into the full map to extract the actual high-density text content.
  • Contextual Synthesis: Finally, the query and the retrieved content are bundled into a final prompt, allowing the local LLM to generate an answer backed by verifiable document evidence.
# app.py
"""
PageIndex Testing Application
A Streamlit UI for testing PageIndex samples with local Ollama
"""

import streamlit as st
import os
import sys
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Add modules to path
sys.path.append(str(Path(__file__).parent))

from modules.chat_quickstart import ChatQuickstartDemo
from modules.simple_rag import SimpleRAGDemo
from modules.vision_rag import VisionRAGDemo

# Page configuration
st.set_page_config(
    page_title="PageIndex Testing Application",
    page_icon="📚",
    layout="wide",
    initial_sidebar_state="expanded"
)

# Custom CSS
st.markdown("""
<style>
    .main-header {
        font-size: 2.5rem;
        font-weight: bold;
        color: #1f77b4;
        text-align: center;
        margin-bottom: 1rem;
    }
    .sub-header {
        font-size: 1.2rem;
        color: #666;
        text-align: center;
        margin-bottom: 2rem;
    }
    .demo-card {
        padding: 1.5rem;
        border-radius: 0.5rem;
        background-color: #f8f9fa;
        margin-bottom: 1rem;
    }
    .success-box {
        padding: 1rem;
        border-radius: 0.5rem;
        background-color: #d4edda;
        border: 1px solid #c3e6cb;
        color: #155724;
    }
    .error-box {
        padding: 1rem;
        border-radius: 0.5rem;
        background-color: #f8d7da;
        border: 1px solid #f5c6cb;
        color: #721c24;
    }
    .info-box {
        padding: 1rem;
        border-radius: 0.5rem;
        background-color: #d1ecf1;
        border: 1px solid #bee5eb;
        color: #0c5460;
    }
</style>
""", unsafe_allow_html=True)

def check_configuration():
    """Check if required configuration is present"""
    api_key = os.getenv("PAGEINDEX_API_KEY")
    ollama_url = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
    ollama_model = os.getenv("OLLAMA_MODEL", "granite3-dense:8b")

    issues = []
    if not api_key or api_key == "your_pageindex_api_key_here":
        issues.append("⚠️ PageIndex API key not configured")

    return {
        "api_key": api_key,
        "ollama_url": ollama_url,
        "ollama_model": ollama_model,
        "issues": issues
    }

def main():
    # Header
    st.markdown('<div class="main-header">📚 PageIndex Testing Application</div>', unsafe_allow_html=True)
    st.markdown('<div class="sub-header">Test PageIndex samples with local Ollama integration</div>', unsafe_allow_html=True)

    # Check configuration
    config = check_configuration()

    # Sidebar
    with st.sidebar:
        st.image("https://pageindex.ai/static/images/pageindex_banner.jpg", use_container_width=True)
        st.markdown("---")

        st.subheader("⚙️ Configuration")

        if config["issues"]:
            for issue in config["issues"]:
                st.warning(issue)
            st.info("💡 Copy `.env.example` to `.env` and configure your API key")
        else:
            st.success("✅ Configuration loaded")

        st.markdown("---")
        st.subheader("🤖 Ollama Model Selection")

        # Get available models
        from modules.ollama_client import OllamaClient
        ollama_client = OllamaClient(config['ollama_url'], config['ollama_model'])

        if ollama_client.check_connection():
            available_models = ollama_client.list_models()

            if available_models:
                # Model selector
                selected_model = st.selectbox(
                    "Select Model:",
                    available_models,
                    index=available_models.index(config['ollama_model']) if config['ollama_model'] in available_models else 0,
                    help="Choose from your installed Ollama models"
                )

                # Update session state with selected model
                st.session_state['selected_model'] = selected_model

                # Show model info
                st.info(f"🎯 Using: {selected_model}")

                # Detect vision models
                vision_keywords = ['vision', 'llava', 'bakllava', 'qwen', 'minicpm']
                is_vision = any(keyword in selected_model.lower() for keyword in vision_keywords)

                if is_vision:
                    st.success("👁️ Vision-capable model detected")
                else:
                    st.warning("⚠️ Text-only model (Vision RAG may not work)")
            else:
                st.error("No models found. Run: `ollama pull <model>`")
        else:
            st.error("Cannot connect to Ollama")

        st.markdown("---")
        st.subheader("📊 System Info")
        st.text(f"Ollama URL: {config['ollama_url']}")

        st.markdown("---")
        st.subheader("📖 About")
        st.markdown("""
        This application demonstrates three PageIndex samples:

        1. **Chat Quickstart**: Simple document Q&A
        2. **Simple RAG**: Reasoning-based retrieval
        3. **Vision RAG**: Vision-based document analysis

        All samples use local Ollama with granite3-dense model instead of OpenAI.
        """)

        st.markdown("---")
        st.markdown("🔗 [PageIndex Docs](https://docs.pageindex.ai)")
        st.markdown("🔗 [GitHub Repo](https://github.com/VectifyAI/PageIndex)")

    # Main content
    tab1, tab2, tab3 = st.tabs([
        "💬 Chat Quickstart",
        "🔍 Simple RAG",
        "👁️ Vision RAG"
    ])

    with tab1:
        st.markdown("### 💬 Chat Quickstart Demo")
        st.markdown("""
        This demo shows how to:
        - Upload a document to PageIndex
        - Check processing status
        - Ask questions about the document
        """)

        if config["issues"]:
            st.error("⚠️ Please configure your API key in `.env` file to use this demo")
        else:
            selected_model = st.session_state.get('selected_model', config["ollama_model"])
            demo1 = ChatQuickstartDemo(config["api_key"], config["ollama_url"], selected_model)
            demo1.render()

    with tab2:
        st.markdown("### 🔍 Simple RAG Demo")
        st.markdown("""
        This demo demonstrates:
        - Building a PageIndex tree structure
        - Reasoning-based retrieval with tree search
        - Answer generation from retrieved context
        """)

        if config["issues"]:
            st.error("⚠️ Please configure your API key in `.env` file to use this demo")
        else:
            selected_model = st.session_state.get('selected_model', config["ollama_model"])
            demo2 = SimpleRAGDemo(config["api_key"], config["ollama_url"], selected_model)
            demo2.render()

    with tab3:
        st.markdown("### 👁️ Vision RAG Demo")
        st.markdown("""
        This demo showcases:
        - Vision-based document analysis
        - PDF page image extraction
        - Visual context reasoning without OCR
        """)

        if config["issues"]:
            st.error("⚠️ Please configure your API key in `.env` file to use this demo")
        else:
            selected_model = st.session_state.get('selected_model', config["ollama_model"])

            # Check if model is vision-capable
            vision_keywords = ['vision', 'llava', 'bakllava', 'qwen', 'minicpm']
            is_vision = any(keyword in selected_model.lower() for keyword in vision_keywords)

            if not is_vision:
                st.warning(f"⚠️ Model '{selected_model}' may not support vision. Consider using: llava, qwen-vl, or llama3.2-vision")

            demo3 = VisionRAGDemo(config["api_key"], config["ollama_url"], selected_model)
            demo3.render()

if __name__ == "__main__":
    main()

# Made with Bob
Enter fullscreen mode Exit fullscreen mode

📚 API Reference Highlights

If you’re looking to extend Bob’s work, the PageIndexClient provides several core methods to manage this pipeline:

| Method                 | Purpose                                        | Key Parameter         |
| ---------------------- | ---------------------------------------------- | --------------------- |
| `submit_document()`    | Uploads a PDF for indexing.                    | `file_path` (str)     |
| `get_document()`       | Checks if the document status is "completed".  | `doc_id` (str)        |
| `is_retrieval_ready()` | Returns `True` when the tree is ready for RAG. | `doc_id` (str)        |
| `get_tree()`           | Returns the hierarchical JSON structure.       | `node_summary` (bool) |
| `chat_completions()`   | Streams a response directly from PageIndex.    | `messages` (list)     |

Enter fullscreen mode Exit fullscreen mode

👁️ Deep Dive: Vision RAG & Multimodal Intelligence

While the Simple RAG process relies on text hierarchies, the Vision RAG demo is designed for documents where the “story” is told through visuals — think architecture blueprints, financial charts, or complex technical diagrams. Bob implemented this to prove that local AI can “see” without the need for expensive, error-prone OCR (Optical Character Recognition).

The Visual Workflow

  • Automatic Image Extraction: As soon as a PDF is uploaded, the application uses PyMuPDF (fitz) to convert every page into a high-resolution image.
  • Visual Tree Mapping: The PageIndex tree is retrieved, but this time Bob’s code maps each structural node (like “Quarterly Revenue Chart”) to its specific page number.
  • Vision-Based Search: When you ask a question like “Describe the trend in this bar chart,” the local model (e.g., Llava) performs a “vision search” across the tree to identify which page contains the relevant visual element.
  • Base64 Transmission: The identified page images are encoded into Base64 strings and sent directly to the local Ollama vision endpoint.
  • Visual Synthesis: The model analyzes the actual pixels of the chart or diagram to generate an answer that text-only RAG would likely miss.
# vision_rag.py
"""
Vision RAG Demo Module
Based on vision_RAG_pageindex.ipynb
"""

import streamlit as st
import json
import asyncio
import fitz  # PyMuPDF
import base64
from pathlib import Path
from pageindex import PageIndexClient
import pageindex.utils as utils
from .ollama_client import OllamaClient


class VisionRAGDemo:
    """Demo for Vision-based Vectorless RAG with PageIndex"""

    def __init__(self, api_key: str, ollama_url: str, ollama_model: str):
        """
        Initialize Vision RAG Demo

        Args:
            api_key: PageIndex API key
            ollama_url: Ollama base URL
            ollama_model: Ollama model name (should support vision)
        """
        self.api_key = api_key
        self.ollama_url = ollama_url
        self.ollama_model = ollama_model
        self.pi_client = PageIndexClient(api_key=api_key)
        self.ollama_client = OllamaClient(base_url=ollama_url, model=ollama_model)

    def extract_pdf_page_images(self, pdf_path: str, output_dir: str = "pdf_images") -> tuple:
        """Extract page images from PDF"""
        Path(output_dir).mkdir(exist_ok=True)
        pdf_document = fitz.open(pdf_path)
        page_images = {}
        total_pages = len(pdf_document)

        for page_number in range(len(pdf_document)):
            page = pdf_document.load_page(page_number)
            mat = fitz.Matrix(2.0, 2.0)  # 2x zoom for better quality
            pix = page.get_pixmap(matrix=mat)
            img_data = pix.tobytes("jpeg")
            image_path = Path(output_dir) / f"page_{page_number + 1}.jpg"

            with open(image_path, "wb") as image_file:
                image_file.write(img_data)

            page_images[page_number + 1] = str(image_path)

        pdf_document.close()
        return page_images, total_pages

    def get_page_images_for_nodes(self, node_list: list, node_map: dict, page_images: dict) -> list:
        """Get PDF page images for retrieved nodes"""
        image_paths = []
        seen_pages = set()

        for node_id in node_list:
            if node_id not in node_map:
                continue

            node_info = node_map[node_id]
            start_page = node_info.get('start_index', node_info.get('page_index', 1))
            end_page = node_info.get('end_index', start_page)

            for page_num in range(start_page, end_page + 1):
                if page_num not in seen_pages and page_num in page_images:
                    image_paths.append(page_images[page_num])
                    seen_pages.add(page_num)

        return image_paths

    async def call_vlm(self, prompt: str, image_paths: list = None) -> str:
        """Call Vision Language Model using Ollama"""
        return self.ollama_client.chat_with_vision(prompt, image_paths)

    def render(self):
        """Render the demo UI"""
        st.markdown("---")

        # Check Ollama connection
        if not self.ollama_client.check_connection():
            st.error("⚠️ Cannot connect to Ollama. Please ensure Ollama is running.")
            st.info(f"Expected URL: {self.ollama_url}")
            return

        st.success(f"✅ Connected to Ollama ({self.ollama_model})")
        st.warning("⚠️ Note: Vision RAG requires a vision-capable model like llava or bakllava")

        # Step 1: Document Upload
        st.subheader("📄 Step 1: Upload Document")

        col1, col2 = st.columns([2, 1])

        with col1:
            uploaded_file = st.file_uploader(
                "Choose a PDF file",
                type=['pdf'],
                key="vision_rag_upload",
                help="Upload a PDF document for vision-based RAG analysis"
            )

        with col2:
            use_sample = st.checkbox("Use sample document", value=False, key="vision_rag_sample")
            if use_sample:
                st.info("Using 'Attention Is All You Need' paper")

        if st.button("📤 Submit Document", type="primary", key="vision_rag_submit", disabled=not (uploaded_file or use_sample)):
            with st.spinner("Processing document..."):
                try:
                    if use_sample:
                        # Use sample document from input folder
                        sample_path = Path("input/1706.03762v7.pdf")
                        if sample_path.exists():
                            pdf_path = str(sample_path)
                        else:
                            st.error("Sample document not found in input folder")
                            return
                    else:
                        # Save uploaded file
                        temp_path = Path("temp") / uploaded_file.name
                        temp_path.parent.mkdir(exist_ok=True)

                        with open(temp_path, "wb") as f:
                            f.write(uploaded_file.getbuffer())

                        pdf_path = str(temp_path)

                    # Extract page images
                    st.info("Extracting page images from PDF...")
                    page_images, total_pages = self.extract_pdf_page_images(pdf_path)
                    st.success(f"✅ Extracted {len(page_images)} page images from {total_pages} pages")

                    # Submit to PageIndex
                    st.info("Submitting document to PageIndex...")
                    doc_id = self.pi_client.submit_document(pdf_path)["doc_id"]

                    st.session_state['vision_doc_id'] = doc_id
                    st.session_state['vision_page_images'] = page_images
                    st.session_state['vision_total_pages'] = total_pages
                    st.session_state['vision_pdf_path'] = pdf_path

                    st.success(f"✅ Document submitted! Document ID: `{doc_id}`")

                except Exception as e:
                    st.error(f"❌ Error processing document: {str(e)}")

        # Step 2: Get Tree Structure
        if 'vision_doc_id' in st.session_state:
            st.markdown("---")
            st.subheader("🌳 Step 2: Get PageIndex Tree Structure")

            doc_id = st.session_state['vision_doc_id']

            if st.button("🔄 Get Tree Structure", key="vision_get_tree"):
                with st.spinner("Fetching tree structure..."):
                    try:
                        if self.pi_client.is_retrieval_ready(doc_id):
                            tree = self.pi_client.get_tree(doc_id, node_summary=True)['result']
                            st.session_state['vision_tree'] = tree

                            st.success("✅ Tree structure retrieved!")

                            # Display simplified tree
                            with st.expander("📋 View Tree Structure"):
                                tree_str = self._format_tree(tree)
                                st.code(tree_str, language="text")
                        else:
                            st.warning("⏳ Document is still processing. Please wait and try again.")

                    except Exception as e:
                        st.error(f"❌ Error getting tree: {str(e)}")

        # Step 3: Vision-Based Retrieval
        if 'vision_tree' in st.session_state:
            st.markdown("---")
            st.subheader("👁️ Step 3: Vision-Based Retrieval")

            # Sample queries
            sample_queries = [
                "What is the last operation in the Scaled Dot-Product Attention figure?",
                "Describe the architecture diagram in this paper.",
                "What are the key visual elements in the methodology section?"
            ]

            selected_query = st.selectbox(
                "Choose a sample query or write your own:",
                ["Custom query"] + sample_queries,
                key="vision_query_select"
            )

            if selected_query == "Custom query":
                query = st.text_area(
                    "Enter your query:",
                    height=80,
                    key="vision_custom_query",
                    placeholder="Type your query here..."
                )
            else:
                query = st.text_area(
                    "Enter your query:",
                    value=selected_query,
                    height=80,
                    key="vision_query"
                )

            if st.button("🔍 Perform Vision Search", type="primary", key="vision_search", disabled=not query):
                with st.spinner("Performing vision-based retrieval..."):
                    try:
                        tree = st.session_state['vision_tree']

                        # Remove text field for tree search
                        tree_without_text = utils.remove_fields(tree.copy(), fields=['text'])

                        # Create search prompt
                        search_prompt = f"""
You are given a question and a tree structure of a document.
Each node contains a node id, node title, and a corresponding summary.
Your task is to find all tree nodes that are likely to contain the answer to the question.

Question: {query}

Document tree structure:
{json.dumps(tree_without_text, indent=2)}

Please reply in the following JSON format:
{{
    "thinking": "<Your thinking process on which nodes are relevant to the question>",
    "node_list": ["node_id_1", "node_id_2", ..., "node_id_n"]
}}

Directly return the final JSON structure. Do not output anything else.
"""

                        # Call VLM for tree search
                        tree_search_result = asyncio.run(self.call_vlm(search_prompt))

                        # Parse result
                        try:
                            result_json = json.loads(tree_search_result)
                            st.session_state['vision_search_result'] = result_json
                            st.session_state['vision_current_query'] = query

                            # Display reasoning
                            st.markdown("**🧠 Reasoning Process:**")
                            st.info(result_json['thinking'])

                            # Display retrieved nodes
                            st.markdown("**📑 Retrieved Nodes:**")
                            total_pages = st.session_state['vision_total_pages']
                            node_map = utils.create_node_mapping(tree, include_page_ranges=True, max_page=total_pages)

                            for node_id in result_json["node_list"]:
                                if node_id in node_map:
                                    node_info = node_map[node_id]
                                    node = node_info['node']
                                    start_page = node_info.get('start_index', node.get('page_index', 1))
                                    end_page = node_info.get('end_index', start_page)
                                    page_range = start_page if start_page == end_page else f"{start_page}-{end_page}"

                                    st.markdown(f"- **Node ID:** `{node['node_id']}` | **Pages:** {page_range} | **Title:** {node['title']}")

                            # Get page images for retrieved nodes
                            page_images = st.session_state['vision_page_images']
                            retrieved_images = self.get_page_images_for_nodes(
                                result_json["node_list"],
                                node_map,
                                page_images
                            )

                            st.session_state['vision_retrieved_images'] = retrieved_images
                            st.success(f"✅ Retrieved {len(retrieved_images)} PDF page image(s) for visual context")

                        except json.JSONDecodeError:
                            st.error("❌ Failed to parse VLM response as JSON")
                            st.code(tree_search_result)

                    except Exception as e:
                        st.error(f"❌ Error during vision search: {str(e)}")

        # Step 4: Answer Generation with Visual Context
        if 'vision_search_result' in st.session_state and 'vision_retrieved_images' in st.session_state:
            st.markdown("---")
            st.subheader("💡 Step 4: Generate Answer with Visual Context")

            # Display retrieved images
            retrieved_images = st.session_state['vision_retrieved_images']

            with st.expander(f"🖼️ View Retrieved Page Images ({len(retrieved_images)} pages)"):
                if len(retrieved_images) > 0:
                    cols = st.columns(min(3, len(retrieved_images)))
                    for idx, img_path in enumerate(retrieved_images[:6]):  # Show max 6 images
                        with cols[idx % 3]:
                            st.image(img_path, caption=f"Page {Path(img_path).stem.split('_')[1]}", use_container_width=True)
                else:
                    st.info("No page images were retrieved. This might happen if the search didn't find relevant pages.")

            if st.button("✨ Generate Answer with Vision", type="primary", key="vision_generate_answer"):
                with st.spinner("Generating answer using visual context..."):
                    try:
                        query = st.session_state['vision_current_query']
                        retrieved_images = st.session_state['vision_retrieved_images']

                        # Generate answer using VLM with visual context
                        answer_prompt = f"""
Answer the question based on the images of the document pages as context.

Question: {query}

Provide a clear, concise answer based only on the context provided in the images.
"""

                        answer = asyncio.run(self.call_vlm(answer_prompt, retrieved_images))

                        st.markdown("**📝 Generated Answer (Vision-based):**")
                        st.success(answer)

                        # Store in history
                        if 'vision_history' not in st.session_state:
                            st.session_state['vision_history'] = []

                        st.session_state['vision_history'].append({
                            'query': query,
                            'reasoning': st.session_state['vision_search_result']['thinking'],
                            'nodes': st.session_state['vision_search_result']['node_list'],
                            'num_images': len(retrieved_images),
                            'answer': answer
                        })

                    except Exception as e:
                        st.error(f"❌ Error generating answer: {str(e)}")

            # Display history
            if 'vision_history' in st.session_state and st.session_state['vision_history']:
                st.markdown("---")
                st.subheader("📜 Vision RAG History")

                for i, item in enumerate(reversed(st.session_state['vision_history'])):
                    with st.expander(f"Query {len(st.session_state['vision_history']) - i}: {item['query'][:50]}..."):
                        st.markdown(f"**Query:** {item['query']}")
                        st.markdown(f"**Reasoning:** {item['reasoning']}")
                        st.markdown(f"**Retrieved Nodes:** {', '.join(item['nodes'])}")
                        st.markdown(f"**Images Used:** {item['num_images']} pages")
                        st.markdown(f"**Answer:** {item['answer']}")

    def _format_tree(self, tree, indent=0):
        """Format tree structure for display"""
        result = []
        prefix = "  " * indent

        if isinstance(tree, dict):
            node_id = tree.get('node_id', 'N/A')
            title = tree.get('title', 'Untitled')
            page = tree.get('page_index', 'N/A')

            result.append(f"{prefix}[{node_id}] {title} (Page: {page})")

            if 'children' in tree:
                for child in tree['children']:
                    result.append(self._format_tree(child, indent + 1))

        return "\n".join(result)

# Made with Bob
Enter fullscreen mode Exit fullscreen mode

🛠️ Bob’s Toolbox: Key Utility Functions

To make these complex workflows reliable, the application relies on a set of utility functions found in the pageindex.utils module:

  • create_node_mapping(): This is the "GPS" of the app. it creates a dictionary of every node ID, allowing the code to instantly jump from an LLM's recommendation to the actual content or page image.
  • remove_fields(): Essential for local LLM performance. It strips out the "heavy" full text from the tree during the reasoning phase, ensuring the granite3-dense model doesn't run out of memory (VRAM).
  • print_tree(): A developer-friendly tool that prints a clean, indented version of the document's structure to the terminal for debugging.

🎇 Bonus: Presentation creation using Bob 🎇


Last but not least, I also tested a new mode from Bob: Presentation Creation. This powerful addition allows for the automated generation of HTML, PowerPoint, and PDF presentations. This expands Bob’s utility beyond the high-quality architecture documents it was already capable of producing, such as the comprehensive technical maps and component overviews seen in this project. By integrating this feature, Bob further streamlines the transition from raw technical implementation to stakeholder-ready communication, ensuring that the insights gained from testing PageIndex can be shared as effectively as they were developed.


Conclusion

In conclusion, thanks to the implementation of “Bob,” I have experienced a profound shift in my development workflow. I must admit that since I started using Bob, each time I spot some interesting code or technology — like the PageIndex repository — my efforts to test and implement it are reduced from days to hours, and sometimes even less than an hour. This efficiency is driven by the modular architecture Bob provides, which allows for rapid setup of complex environments including local Ollama integration and Streamlit interfaces. Having said that, would I use PageIndex for every project? Maybe yes, maybe not; however, one thing is taken for granted: Bob gives me the real agility to test significantly more technologies and codebases with minimal friction. By automating the heavy lifting of environment configuration and model orchestration, Bob has transformed a process that used to be a daunting task into a seamless, high-speed exploration of the latest tech.

>>> Thanks for reading <<<

Links

Top comments (0)