DEV Community

Pratyay Banerjee
Pratyay Banerjee Subscriber

Posted on • Edited on

Kaizen โ€” ๐˜“๐˜ฆ๐˜ต ๐˜บ๐˜ฐ๐˜ถ๐˜ณ ๐˜ง๐˜ฐ๐˜ค๐˜ถ๐˜ด ๐˜ง๐˜ฐ๐˜ญ๐˜ญ๐˜ฐ๐˜ธ ๐˜บ๐˜ฐ๐˜ถ! ๐ŸŽฏ

Built with Google Gemini: Writing Challenge

Disclaimer: This is a submission for the Built with Google Gemini: Writing Challenge


Presenting Kaizen ๐Ÿฆ„

Kaizen is a multi-agent Chromium extension that quietly tracks where your attention actually settles while you browse. It doesnโ€™t block anything or try to push productivity. Instead, it helps you notice when your mind has drifted and guides you back to the thread you were following. Our brains are wired to conserve energy, so the moment we pause to ponder, we begin to wander. This is why long browsing sessions often feel scattered. Kaizen steps in at that exact point, especially for people who experience attention drift or ADHD-like patterns, helping the web feel connected again rather than fragmented.ใ…ค

๐Ÿ”— Try it out here: https://kaizen.apps.sandipan.dev

transparent-divider

Product Demo โ–ถ๏ธ

The name "Kaizen" comes from the Japanese concept of continuous improvement (ๆ”นๅ–„). Small, steady gains. That's the whole philosophy โ€” not perfection, just awareness.

breaker

Inspiration ๐Ÿ’ก

If you write code or study on the web, youโ€™ve likely lived this moment โ€” a tab for documentation leads to a blog post, then a video, then a forum thread, and somewhere between the scrolls, the thread of your original question frays. Minutes later, you know you saw something useful, but you canโ€™t quite recall where, or what ๐Ÿ˜ญ

โ€œDistraction is the modern poverty. Focus is the new wealth.โ€ โ€” James Clear.

image.png

The truth is, the internet has made information abundant, but our ability to retain and build on that knowledge hasn't kept pace. The truth is, our brains are wired to conserve energy, so the moment we pause to ponder, we begin to wander. This is why long browsing sessions often feel scattered.

I suffer from ADHD. My co-builder @sandipndev does too. We've both tried the usual focus apps, especially the ones that block websites or guilt-trip you with timers. They made us feel worse, not better.

transparent-divider

So, around New Year's, we decided that our resolution for 2026 would be to actually fix this problem, but not with another blocker or pomodoro clone, but with something that genuinely understands how attention works! That motivation landed us at the Commit To Change: An AI Agents Hackathon 2026 hosted by Encode Club, and that's where Kaizen was born.

Kaizen_logo

We asked ourselves: what if we could utilize AI smartly to understand where your attention actually settles, be it the content you read, the figure you saw, or the video you watched, and turn those moments into a private learning loop? We wanted to turn scattered web browsing into genuine learning, and that too without blocking sites or nagging you to focus, but by understanding what you're actually paying attention to and helping you build on it.

kaizen-landing.gif kaizen-glimpse.gif

Built on Google gemini-3.1-flash, Kaizen supercharges your browser activity while keeping your data private. It notices, gently reflects, and helps you remember, the kind of help that keeps users rooted in the activity because progress is felt, not forced!

transparent-divider

Codebase / App Repository ๐Ÿ”—

GitHub logo anikvox / kaizen

overcome adhd

Kaizen

kaizen.apps.sandipan.dev ยท Focus that Follows You

A privacy-first browser extension that tracks where your attention actually goes and gently helps you stay on track โ€” without blocking content or enforcing rigid workflows.

Built by CS students with ADHD who wanted a tool that understands attention patterns, not one that locks you out.


Screenshots

Extension Side Panel ย ย  Focus Guardian Nudge

Left: Extension side panel with focus tracking and growing tree. Right: Gentle nudge when you drift.


Features

  • Cognitive Attention Tracking โ€” Knows what you're reading, watching, and listening to โ€” not just which tabs are open
  • Focus Guardian Agent โ€” Detects doomscrolling and distraction patterns, sends supportive nudges instead of blocking
  • AI Chat with Memory โ€” Ask "What was I reading about today?" and get context-aware answers from your browsing history
  • Auto-Generated Quizzes โ€” Turn passive reading into active recall with knowledge verification quizzes
  • Insights & Achievements โ€” Track streaks, milestones, and focus patterns over time
  • โ€ฆ

techy_blank_space

What it does ๐Ÿค”

Kaizen acts as your AI co-pilot for focused learning on the web. It runs silently in the background, tracking what you actually pay attention to, including what you read, watch, and explore, and utilizes Google Gemini & Opik to help you stay focused, remember what matters, and test your understanding.

kaizen-demo.gif

As you browse, Kaizenm gradually turns your native attention into learning. When your focus slips, it offers gentle nudges to bring you back. When you finish reading or watching something, it surfaces quick recall prompts to reinforce what you just absorbed. It occasionally slips in short, well-timed quizzes to check your understanding while the idea is still fresh. And when you want to go deeper, its context-aware chat remembers where youโ€™ve been, helping you connect ideas and build knowledge over time.

Kaizen is supercharged with a plethora of awesome features,

  • ๐Ÿง  Cognitive Attention Tracking โ€” tracks where your mind actually settles across text, images, audio & YouTube
  • ๐Ÿค– Multi-Agent AI System โ€” four coordinated agents (Focus Guardian, Chat, Focus Clustering, Mental Health) powered by Gemini
  • ๐Ÿ’ฌ Agentic Co-Pilot Chat โ€” tool-calling assistant that synthesizes your reading sessions with context-aware insights
  • ๐ŸŒŠ Supportive Pulse Nudges โ€” gentle reminders when you drift, never blocking โ€” with self-calibrating sensitivity
  • ๐Ÿ“ Knowledge Quizzes โ€” auto-generated verification quizzes from your actual browsing content
  • ๐Ÿ“Š Cognitive Analytics Dashboard โ€” attention entropy, browsing fragmentation, late-night patterns over 7โ€“90 day windows
  • ๐ŸŒฑ Growing Plant Gamification โ€” a virtual plant that grows with your focus time
  • ๐Ÿ” Privacy-First Engine โ€” PII anonymization, encrypted API keys (AES-256-GCM), GDPR-compliant with full data export/deletion
  • ๐Ÿ”ญ Full Opik Observability โ€” every LLM call, tool invocation, and agent decision traced end-to-end

Also, for more transparency, we also present a comparative analysis of available solutions,

Traditional Approach
Kaizen Approach
๐ŸŸ  Blocks websites entirely ๐ŸŸข Supportive pulse nudges โ€” zero blocking
๐ŸŸ  Binary "focused" / "distracted" state ๐ŸŸข Granular cognitive attention sensing
๐ŸŸ  Punishes distraction ๐ŸŸข Understands attention patterns and gently guides
๐ŸŸ  No understanding of what you're learning ๐ŸŸข Tracks reading, images, audio, video โ€” builds context
๐ŸŸ  Cloud-locked data silos ๐ŸŸข Privacy-first, PII-anonymized AI

We wanted something that understands where your attention actually goes and gently helps you stay on track โ€” without locking you out of anything.

Breaker

How Gemini powers the system โšก

"We used Gemini" tells you nothing. So let me be specific about how deeply it's woven into every layer of Kaizen.

Gemini is the system default provider throughout Kaizen, integrated via the Vercel AI SDK (@ai-sdk/google v3.0.22) alongside the direct Google SDK (@google/genai v1.40.0). Every agent, every summarization call, every quiz โ€” Gemini handles it unless the user explicitly switches to another provider (Anthropic Claude or OpenAI GPT-4 are available as alternatives).

Here's the core provider resolution logic:

// service.ts โ€” LLM Provider Resolution
export class LLMService {
  getProvider(): LLMProvider {
    // 1. Check user's custom provider + encrypted API key
    if (this.settings?.llmProvider) {
      const provider = this.tryCreateUserProvider();
      if (provider) return provider;
    }
    // 2. Fall back to system Gemini
    return this.createSystemProvider(); // โ†’ gemini-2.5-flash-lite
  }
}

// models.ts โ€” System Defaults
export const SYSTEM_DEFAULT_PROVIDER: LLMProviderType = "gemini";
export const SYSTEM_DEFAULT_MODEL = "gemini-2.5-flash-lite";
Enter fullscreen mode Exit fullscreen mode

And the GeminiProvider class wraps the Vercel AI SDK with full tool-calling, multimodal content (text + base64 images), and streaming support:

// providers/gemini.ts
export class GeminiProvider implements LLMProvider {
  readonly providerType = "gemini" as const;

  constructor(config: LLMProviderConfig) {
    this.google = createGoogleGenerativeAI({
      apiKey: config.apiKey,
    });
  }

  async generate(options: LLMGenerateOptions): Promise<LLMResponse> {
    const result = await generateText({
      model: this.google(this.model),
      system: options.systemPrompt,
      messages,
      tools: options.tools,
      experimental_telemetry: getTelemetrySettings({
        name: `gemini-${this.model}`,
        userId: this.userId,
      }),
    });
    // Extract toolCalls and toolResults from response...
  }

  async stream(options: LLMStreamOptions): Promise<void> {
    const result = streamText({
      model: this.google(this.model),
      system: options.systemPrompt,
      messages,
      tools: options.tools,
    });
    for await (const chunk of result.textStream) {
      await options.callbacks.onToken(chunk, fullContent);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

ใ…ค

The four agents ๐Ÿค–

Kaizen isn't an usual GPT wrapper with a focus timer bolted on. It's a coordinated multi-agent system where each agent has a specific job, its own set of tools, and its own Gemini-powered decision loop.

Agent What it does How Gemini is used
๐Ÿ›ก๏ธ Focus Guardian Monitors your browsing every 60 seconds. Detects doomscrolling, distraction, and focus drift. Sends nudges when confidence is high enough. Gemini analyzes 15 minutes of activity context (domain switches, dwell times, social media ratio) and returns a structured JSON decision at temperature: 0.1 for consistency.
๐Ÿ’ฌ Chat Agent Conversational AI with tool-calling. You can ask "what was I reading about today?" and it queries your actual attention data. Gemini streams responses via streamText() with up to 5 agentic steps. It autonomously selects from 11 tools to ground answers in real data.
๐ŸŽฏ Focus Agent Clusters your attention into focus sessions. Figures out what topics you're working on and tracks evolution. Gemini runs an agentic loop (up to 10 iterations) calling tools like create_focus, merge_focuses, update_focus to organize attention data into coherent sessions.
๐Ÿง˜ Mental Health Agent Generates cognitive wellness reports โ€” fragmentation, sleep patterns, media balance, quiz retention. Gemini runs another agentic loop with specialized tools (analyze_sleep_patterns, analyze_focus_quality, analyze_media_balance, think_aloud) and produces a full report in supportive, non-clinical language.

โ—‰ Temperature tuning across tasks ๐ŸŒก๏ธ

Different tasks need different levels of creativity. We tuned Gemini's temperature for each use case:

// config.ts โ€” LLM Configuration Presets
export const LLM_CONFIG = {
  decision:        { temperature: 0.1, maxTokens: 10   },  // Should we nudge? Yes/no.
  summarization:   { temperature: 0.3, maxTokens: 200  },  // Factual, deterministic
  focusAnalysis:   { temperature: 0.3, maxTokens: 50   },  // Concise clustering
  imageDescription:{ temperature: 0.3, maxTokens: 150  },  // Vision captions
  titleGeneration: { temperature: 0.7, maxTokens: 20   },  // Creative but short
  agent:           { temperature: 0.7, maxTokens: 4096 },  // Chat โ€” balanced
  quizGeneration:  { temperature: 0.9, maxTokens: 2000 },  // We *want* variety!
};
Enter fullscreen mode Exit fullscreen mode

At 0.1, Gemini is disciplined โ€” it gives consistent nudge decisions. At 0.9, it generates creative quiz question phrasing without going off the rails. That predictability across the temperature range was one of the reasons we kept Gemini as the default over other providers.

โ—‰ Tool-calling in practice ๐Ÿ”ง

The Chat Agent's tool-calling is where Gemini's structured output really shines. When you ask "what have I been focusing on?", here's what actually happens:

User message arrives
  โ†’ Gemini evaluates available tools
  โ†’ Selects: get_active_focus
  โ†’ Tool executes Prisma query against PostgreSQL
  โ†’ Results returned to Gemini
  โ†’ Gemini composes a response grounded in your data
  โ†’ Response streamed back via SSE
Enter fullscreen mode Exit fullscreen mode

The 11 tools available to the Chat Agent:

  • get_attention_data โ€” recent text/image/audio/YouTube attention
  • get_active_website โ€” what tab you're on right now
  • get_active_focus โ€” your current focus topics
  • search_browsing_history โ€” search past activity
  • get_reading_activity โ€” reading session data
  • get_youtube_history โ€” YouTube watch history
  • get_focus_history โ€” past focus sessions
  • get_current_time โ€” current time in user's timezone
  • get_current_weather โ€” weather at user's location
  • set_user_location โ€” remember location (geocoding via OpenMeteo)
  • set_translation_language โ€” language preferences

Gemini picks which tools to call, interprets the results, and sometimes chains multiple tool calls in a single turn. We capped it at 5 steps per message to prevent runaway loops. Here's the actual execution from agent.ts:

// chat/agent.ts โ€” Agentic Chat Execution
const result = streamText({
  model: provider(modelId),
  system: systemPrompt,  // Fetched from Opik prompt library
  messages: coreMessages, // Supports multimodal (text + images)
  tools,
  maxSteps: 5,
  onStepFinish: (step) => {
    // Create Opik span for each tool call
    if (step.toolCalls && step.toolCalls.length > 0) {
      for (const toolCall of step.toolCalls) {
        const toolSpan = trace?.span({
          name: `tool:${toolCall.toolName}`,
          type: "tool",
          input: { args: toolCall.args },
        });
        toolSpan?.end({ result: /* tool output */ });
      }
    }
  },
});
Enter fullscreen mode Exit fullscreen mode

We tested Gemini, Claude, and GPT-4 for this pipeline. Gemini's tool selection was the most reliable for our use case โ€” it rarely picked the wrong tool or returned malformed tool calls across 11 different tool schemas. That's why it became the default.

โ—‰ Multimodal attention โ€” Gemini Vision ๐Ÿ‘๏ธ

When you linger on an image while browsing, the extension tracks your hover duration and confidence score. If you're actually paying attention, Kaizen sends the image as base64-encoded data directly to Gemini for caption generation:

// providers/gemini.ts โ€” Multimodal content formatting
private formatUserContent(content: LLMMessageContent) {
  return content.map((part) => {
    if (part.type === "image") {
      return {
        type: "image" as const,
        image: `data:${part.mimeType};base64,${part.data}`,
      };
    }
    return { type: "text" as const, text: part.text };
  });
}
Enter fullscreen mode Exit fullscreen mode

This means the Chat Agent can later tell you "you were looking at a diagram of TCP handshakes" instead of just "you visited a networking article." The image summaries + text summaries together form Kaizen's memory layer.

โ—‰ Quiz generation from real attention ๐Ÿ“

When you hit "Generate Quiz," a pg-boss background job fires. The Quiz Agent pulls your recent attention data, feeds it to Gemini at temperature: 0.9, and generates 10 multiple-choice questions based on what you've been reading. A content hash prevents duplicate questions across sessions. The quiz stays valid for 24 hours.

This is probably the feature I'm most proud of. Passive reading becomes active recall, and you didn't have to do anything extra. You just browsed normally, and now there's a quiz waiting for you. ๐ŸŽฏ

โ—‰ Focus Guardian โ€” the self-learning nudge engine ๐Ÿ›ก๏ธ

The Focus Guardian runs autonomously, analyzing your last 15 minutes of activity. Here's what actually happens in the decision loop (from focus-agent.ts):

// agent/focus-agent.ts โ€” Focus Guardian Decision
const prompt = `${promptData.content}

RECENT ACTIVITY (last 15 minutes):
- Domains visited: ${context.recentDomains.join(", ")}
- Number of different sites: ${context.domainSwitchCount}
- Average time per page: ${Math.round(context.averageDwellTime / 1000)}s
- Social media/entertainment time: ${Math.round(context.socialMediaTime / 1000)}s
- Reading time (estimated): ${Math.round(context.readingTime / 1000)}s
- Has active focus: ${context.hasActiveFocus ? `Yes (${context.focusTopics.join(", ")})` : "No"}

USER FEEDBACK HISTORY:
- False positive rate: ${(feedback.falsePositiveRate * 100).toFixed(0)}%
- Acknowledged rate: ${(feedback.acknowledgedRate * 100).toFixed(0)}%
- Sensitivity setting: ${sensitivity}`;

const response = await provider.generate({
  messages: [{ role: "user", content: prompt }],
});
Enter fullscreen mode Exit fullscreen mode

Nudge types: doomscroll, distraction, break, focus_drift, encouragement, and all_clear. There's a configurable cooldown between nudges so it never feels like nagging.

The system self-calibrates. Every nudge records whether you acknowledged it, dismissed it, or marked it as a false positive:

// Sensitivity auto-adjustment from user feedback
if (response === "false_positive") {
  newSensitivity = Math.max(0.1, newSensitivity - 0.05); // fewer nudges
} else if (response === "acknowledged") {
  newSensitivity = Math.min(0.9, newSensitivity + 0.02); // nudge was helpful
}
Enter fullscreen mode Exit fullscreen mode

Over time, the agent learns your patterns. If it keeps getting it wrong, it backs off. If it's on point, it stays the course.

Breaker

The Tech Stack โš™๏ธ

Everything runs on a TypeScript monorepo (pnpm workspaces):

kaizen/
โ”œโ”€โ”€ apps/
โ”‚   โ”œโ”€โ”€ api/          # Hono backend โ€” agents, data ingestion, SSE
โ”‚   โ”œโ”€โ”€ extension/    # Plasmo browser extension โ€” attention sensors
โ”‚   โ””โ”€โ”€ web/          # Next.js dashboard โ€” analytics, chat, settings
โ”œโ”€โ”€ packages/
โ”‚   โ”œโ”€โ”€ api-client/   # Shared typed API client
โ”‚   โ””โ”€โ”€ ui/           # Shared component library
โ””โ”€โ”€ docker-compose.yml
Enter fullscreen mode Exit fullscreen mode
Layer What we used
Runtime Node.js 22+
Backend Hono v4.6.14, Prisma ORM v6.2.1, PostgreSQL 16
Job Queue pg-boss v12 (single-concurrency, resource-aware)
Real-time Custom SSE (Server-Sent Events) for cross-device sync
Auth Clerk v1.21.4 (web), device token handshake (extension)
AI Google Gemini via Vercel AI SDK v6.0.77 (@ai-sdk/google + @google/genai)
Observability Comet Opik v1.0.6 โ€” tracing, prompt library, anonymizers
Extension Plasmo, React, TypeScript
Dashboard Next.js 15, Tailwind CSS, Lucide Icons
Encryption AES-256-GCM for API key storage

Attention sensors ๐Ÿ“ก

The extension runs separate monitors for different content types:

Sensor File What it tracks
๐Ÿ“– Text monitor-text.ts Paragraphs read, words processed, reading progress, sustained attention duration
๐Ÿ–ผ๏ธ Image monitor-image.ts Hover duration, confidence score โ†’ triggers Gemini Vision for caption generation
๐Ÿ”Š Audio monitor-audio.ts Playback duration, active listening time
๐Ÿ“บ YouTube background scripts Watch time, captions ingestion, video context

Each sensor generates a confidence score (0โ€“100) based on hover duration, scroll velocity, and viewport position. A quick skim doesn't count as learning. Sustained attention does.


Database Schema ๐Ÿ—„๏ธ

// Core attention tracking
TextAttention    โ†’ text, wordsRead, confidence, timestamp
ImageAttention   โ†’ src, alt, hoverDuration, summary (AI-generated)
AudioAttention   โ†’ playbackDuration, activeTime
YoutubeAttention โ†’ captions, activeWatchTime

// Agentic features
Focus            โ†’ item, keywords[], isActive, lastActivityAt
AgentNudge       โ†’ type, message, confidence, reasoning, response
Pulse            โ†’ userId, message (short nudges)

// Quiz system
Quiz             โ†’ questions (JSON), contentHash (deduplication)
QuizAnswer       โ†’ selectedIndex, isCorrect
QuizResult       โ†’ totalQuestions, correctAnswers

// User settings (encrypted API keys)
UserSettings     โ†’ geminiApiKeyEncrypted, llmProvider, llmModel
Enter fullscreen mode Exit fullscreen mode

ใ…ค

Real-time SSE events ๐Ÿ“ก

Custom Server-Sent Events sync state across browser extension + dashboard:

  • pomodoro-tick โ€” Timer updates
  • chat-message-created/updated โ€” Chat streaming
  • active-tab-changed โ€” Tab context sync
  • focus-changed โ€” Focus session state
  • settings-updated โ€” Cross-device settings sync
  • pulses-updated โ€” Nudge notifications

transparent-divider

Observability with Opik ๐Ÿ”ญ

We integrated Comet Opik for full observability across the entire agent system. This turned out to be one of the best decisions we made, as you can't evaluate what you can't see.

opik.gif

What we instrumented

๐Ÿ”— Tracing โ€” Every LLM call, every tool invocation, every agent decision is traced end-to-end. Traces are grouped by thread ID so you can follow the full decision flow:

// telemetry.ts โ€” Opik Trace Hierarchy
const trace = client.trace({
  name: options.name,
  input: options.input ? anonymizeInput(options.input) : undefined,
  metadata: { ...options.metadata, environment: process.env.NODE_ENV },
  tags: options.tags || ["kaizen"],
  threadId: options.threadId,
});

// Nested spans for each step
const span = trace.span({
  name: "tool:get_attention_data",
  type: "tool",
  input: anonymizeInput({ args: toolCall.args }),
});
span.update({ output: processedOutput, endTime: new Date() });
Enter fullscreen mode Exit fullscreen mode

The resulting trace hierarchy looks like:

Trace: chat-agent
โ”œโ”€โ”€ Span: streamText [type: llm]
โ”‚   โ”œโ”€โ”€ Span: tool:get_active_website [type: tool]
โ”‚   โ”œโ”€โ”€ Span: tool:get_attention_data [type: tool]
โ”‚   โ””โ”€โ”€ Span: tool:search_browsing_history [type: tool]
โ””โ”€โ”€ Span: followUp-streamText [type: llm]
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“š Prompt Library โ€” All 11 system prompts live in Opik under named entries, fetched fresh on every call with local fallbacks:

// prompt-provider.ts โ€” Opik-first, local fallback
export async function getPromptWithMetadata(name: PromptName) {
  if (isOpikPromptsEnabled()) {
    const opikPrompt = await getPromptFromOpik(name);
    if (opikPrompt?.content) {
      return { content: opikPrompt.content, source: "opik",
               promptVersion: opikPrompt.commit };
    }
  }
  return { content: LOCAL_PROMPT_MAP[name], source: "local" };
}
Enter fullscreen mode Exit fullscreen mode

This let us iterate on prompts without redeploying code. We'd see a bad nudge in a trace, tweak the prompt in Opik's dashboard, and the fix was live immediately.

๐Ÿ”’ Anonymizers โ€” Before anything gets logged to Opik, we strip PII using @cdssnc/sanitize-pii:

// anonymizer.ts โ€” PII Protection
function isSensitiveKey(key: string): boolean {
  const sensitivePatterns = [
    /^userId$/i, /password/i, /secret/i,
    /token/i, /api[_-]?key/i, /auth/i,
    /credential/i, /private[_-]?key/i,
  ];
  return sensitivePatterns.some((pattern) => pattern.test(key));
}

// User inputs โ†’ anonymized. LLM outputs โ†’ preserved for debugging.
export function anonymizeInput<T>(data: T): T { return anonymizeData(data); }
export function anonymizeOutput<T>(data: T): T { /* only redact sensitive keys */ }
Enter fullscreen mode Exit fullscreen mode

๐Ÿ›ก๏ธ Guardrails โ€” Agents validate inputs before tool execution. The Focus Guardian only fires a nudge when confidence exceeds a dynamically adjusted threshold. The Chat Agent validates tool arguments before running Prisma queries.

Breaker

Why Opik mattered ๐ŸŽฏ

Early on, the Focus Guardian was nudging people during legitimate deep dives. Someone would be reading a 30-minute technical article, and the agent would flag it as distraction because the domain switching pattern looked similar to aimless browsing.

Without Opik, we'd have said "the AI is dumb" and guessed at fixes.

With tracing, we could pull up the exact decision chain: here's the 15 minutes of context the agent saw, here's the domain switch count, here's the confidence score, here's the prompt, here's the output. The problem was obvious โ€” the prompt didn't have a strong enough signal for sustained single-topic browsing. We tweaked the prompt in Opik, the fix deployed without a code change, and false positives dropped.

That cycle โ€” trace the failure โ†’ find the root cause โ†’ fix the prompt โ†’ verify in production โ€” happened dozens of times.

Breaker

What worked well โœ…

  • Tool-calling was reliable. We tested Gemini, Claude, and GPT-4 for our agent pipelines, and Gemini's structured output parsing was the most consistent for our use case. The Chat Agent makes autonomous tool selections across 11 different tools, and Gemini rarely picked the wrong one or returned malformed tool calls. This is why it became our system default.
  • The million-token context window was a real advantage. Gemini 3.1-flash and 3.1-flash-lite both support 1M token context windows. For the Focus Agent's clustering loop, which sometimes processes hours of attention data across many topics, especially having that headroom meant we didn't have to aggressively truncate context. We could pass in a richer activity history and get better clustering decisions.
  • Temperature control behaved predictably. From 0.1 for binary decisions to 0.9 for quiz generation, Gemini responded consistently. At 0.1 it was disciplined; at 0.9 it got creative without going off the rails. That predictability across the full range was a real win.
  • Multimodal input worked out of the box. We send base64-encoded images directly to Gemini for caption generation. The quality of image descriptions was good enough that the Chat Agent could later reference them meaningfully. No separate vision pipeline needed.
  • The model fetcher dynamically discovers new models. We use @google/genai to fetch the live model list from the Gemini API (filtering for generateContent support), with sorting priority baked in for the upcoming gemini-series family, so when new model lands, Kaizen will pick it up automatically.

Breaker

Research ๐Ÿ“š

We donโ€™t remember things just because we saw them. We remember them when we bring them back to mind. A small, well-timed reminder can turn a passing moment online into something that actually sticks. And it works better when the reminder supports your intention rather than trying to control your behavior. When your browser can quietly keep track of the ideas you spent time on and surface them again when you need them, the pressure of โ€œtrying to hold everything in your headโ€ eases up.

This is especially supportive for people with ADHD, where working memory and task switching can feel heavy, and for people who experience early memory decline, where gentle spaced recall helps keep learning active. Kaizen helps keep the thread. Small nudges, quick check-ins, and context that stays with you, so you donโ€™t have to start from scratch every time you return to a thought.

  1. Distractibility trait linked to ADHD
  2. Fighting ADHD along the hustle culture: How can employees keep their mental health in check
  3. Study of Internet addiction in children with attention-deficit hyperactivity disorder and normal control
  4. ADHD Youth and Digital Media Use
  5. Attention Spans โ€” Podcast (APA)
  6. Digital Distractions (ADDitude Magazine โ€“ tag archive)
  7. What if the Attention Crisis Is All a Distraction? (The New Yorker)
  8. Distraction fatigue vs ADHD: How technology is reshaping our attention spans
  9. Retrieval practice helps strengthen memory by actively recalling information (FPSYG, 2019)
  10. Spaced retrieval improves retention by revisiting information over time (Mavmatrix, art. 1160)
  11. Small external cues and short recall prompts support working memory in ADHD
  12. Gentle reminders and cueing tools help reduce memory load in dementia (NCBI, 2017)
  13. Retrieval reactivates and stabilizes memory traces when spaced and repeated
  14. Everyday memory aids help maintain independence and reduce strain in dementia

Kaizen keeps attention anchored to meaning, not effort! โœจ

Breaker

Challenges we ran into ๐Ÿ˜ค

We did run into a few challenges along the way. Since we were working from different time zones, coordinating calls and staying in sync took some extra effort. Most of our collaboration happened asynchronously, which meant we had to be very clear about decisions and hand-offs.

On the technical side, figuring out what โ€œreal attentionโ€ meant was something we had to refine multiple times. We experimented with how much weight to give scroll patterns, mouse movement, viewport position, and reading pace so that quick skims didnโ€™t count as learning. Handling different types of content also took care, especially images and YouTube videos, since the context needed to stay meaningful, not noisy.

Besides, some other challenges we faced aren't limited to,

  • Multi-turn coherence degraded over long conversations. After 10+ turns with interleaved tool calls, the Chat Agent would sometimes lose track of earlier context or repeat information. We partially fixed this by injecting a conversation summary into the system prompt, but it meant extra token usage. Not unique to Gemini, but noticeable.
  • Streaming with tool calls needed careful handling. When Gemini decides mid-stream to call a tool, the handoff between text chunks and tool-call events required state management in our SSE layer. The Vercel AI SDK abstracted most of it, but edge cases (tool call at the very start, multiple rapid tool calls) needed explicit handling.
  • Occasional overconfidence in Focus Guardian decisions. At temperature: 0.1, when the Focus Guardian is wrong, it's confidently wrong. A few times it classified focused research (lots of Stack Overflow tabs) as aimless browsing. The fix was better prompting + the feedback loop, not a model change.

Breaker

What we learned ๐Ÿ™Œ

Proper sleep is very important! ๐Ÿ˜›

Well, a lot of things, both summed up in technical & non-technical sides. We learned that โ€” itโ€™s one thing to get the AI features working, and another to make them feel good while someone is actually browsing. Most of our time went into small details: when to nudge, when to stay quiet, how to store attention history without slowing down browser, and how to keep things calm instead of distracting. Shipping Kaizen from a barebone idea into something stable took a lot of iteration, testing, and rethinking. It reminded us that real products are built in the tiny decisions, not the big demos! ๐Ÿค—

There are a few more items that I'd love to share with the community,

  • ๐ŸŸฆ Building for attention requires restraint. The hardest design decisions weren't technical. They were about when not to act. Our early Focus Guardian nudged aggressively, mostly it felt like a backseat driver. So, our lesson was that if your tool annoys people, they'll uninstall it. Being right isn't enough; you have to be right at the right moment.
  • ๐ŸŸฆ Agents need structure, not freedom. We initially gave the Chat Agent broad instructions. The results were inconsistent. What worked was constraining each agent to a narrow job with specific tools and clear decision boundaries. The Focus Guardian doesn't chat. The Chat Agent doesn't nudge. In sharp, Specialization + Coordination > Generalization.
  • ๐ŸŸฆ Observability isn't optional for agent systems. Without Opik traces, we'd still be guessing why nudges misfired. We stopped treating the AI as a black box and started treating it like any other system component with logs and metrics.
  • ๐ŸŸฆ The real product is the quiet moments. Nobody remembers the quiz that worked perfectly. They remember the time the extension stayed silent for 45 minutes during a Wikipedia deep dive they genuinely cared about, and then gently reminded them about the assignment they'd originally set out to work on. Getting those moments right took dozens of prompt iterations and hundreds of traced decisions.
  • ๐ŸŸฆ Gemini as a default provider was the right call. After benchmarking all three providers, Gemini's combination of reliable tool-calling, 1M context window, and consistent temperature behavior made it the best fit. Our system makes potentially dozens of Gemini calls per user per hour โ€” attention summaries, focus clustering, guardian checks โ€” and reliability at that volume mattered more than peak performance on any single call.

Breaker

What's next? ๐Ÿš€

We're continuing to develop Kaizen and planning the next release cycle:

  • โฐ Spaced repetition โ€” surfacing what you read at the moment you're most likely to forget it
  • ๐Ÿ•ธ๏ธ Topic relationship mapping โ€” showing how things you learn connect across sessions
  • โšก Better batching โ€” optimized Gemini call grouping during long browsing sessions
  • ๐Ÿ“ค Export to note-taking tools โ€” so learning doesn't stay trapped in the extension
  • ๐Ÿ‘ฅ Shareable study threads โ€” lightweight collaboration for shared focus sessions

For Gemini specifically, we're interested in structured output (JSON mode) for agent responses. Right now we parse freeform text from several agent pipelines, and guaranteed JSON would let us simplify those parsing layers.

End notes ๐Ÿ™Œ๐Ÿป

As CS students who struggle with ADHD, we primarily built Kaizen because we needed it ourselves. Traditional blockers felt like punishment. Our New Year's resolution was to build the tool we wished existed, something that doesn't lock you out, doesn't judge, just watches where your attention goes, learns your patterns, and gently, continuously helps you get better. That's what kaizen (ๆ”นๅ–„) stands for, i.e. continuous improvement.

Huge thanks to DEV and MLH for hosting this writing challenge, and to the Google Gemini team for building models that actually hold up under real multi-agent workloads! ๐Ÿ™Œ

Permissive License โš–๏ธ

MIT License

Top comments (0)