Disclaimer: This is a submission for the Built with Google Gemini: Writing Challenge
Presenting Kaizen ๐ฆ
Kaizen is a multi-agent Chromium extension that quietly tracks where your attention actually settles while you browse. It doesnโt block anything or try to push productivity. Instead, it helps you notice when your mind has drifted and guides you back to the thread you were following. Our brains are wired to conserve energy, so the moment we pause to ponder, we begin to wander. This is why long browsing sessions often feel scattered. Kaizen steps in at that exact point, especially for people who experience attention drift or ADHD-like patterns, helping the web feel connected again rather than fragmented.ใ ค
๐ Try it out here: https://kaizen.apps.sandipan.dev
Product Demo โถ๏ธ
The name "Kaizen" comes from the Japanese concept of continuous improvement (ๆนๅ). Small, steady gains. That's the whole philosophy โ not perfection, just awareness.
Inspiration ๐ก
If you write code or study on the web, youโve likely lived this moment โ a tab for documentation leads to a blog post, then a video, then a forum thread, and somewhere between the scrolls, the thread of your original question frays. Minutes later, you know you saw something useful, but you canโt quite recall where, or what ๐ญ
โDistraction is the modern poverty. Focus is the new wealth.โ โ James Clear.
The truth is, the internet has made information abundant, but our ability to retain and build on that knowledge hasn't kept pace. The truth is, our brains are wired to conserve energy, so the moment we pause to ponder, we begin to wander. This is why long browsing sessions often feel scattered.
I suffer from ADHD. My co-builder @sandipndev does too. We've both tried the usual focus apps, especially the ones that block websites or guilt-trip you with timers. They made us feel worse, not better.
So, around New Year's, we decided that our resolution for 2026 would be to actually fix this problem, but not with another blocker or pomodoro clone, but with something that genuinely understands how attention works! That motivation landed us at the Commit To Change: An AI Agents Hackathon 2026 hosted by Encode Club, and that's where Kaizen was born.
We asked ourselves: what if we could utilize AI smartly to understand where your attention actually settles, be it the content you read, the figure you saw, or the video you watched, and turn those moments into a private learning loop? We wanted to turn scattered web browsing into genuine learning, and that too without blocking sites or nagging you to focus, but by understanding what you're actually paying attention to and helping you build on it.
![]() |
![]() |
|---|
Built on Google gemini-3.1-flash, Kaizen supercharges your browser activity while keeping your data private. It notices, gently reflects, and helps you remember, the kind of help that keeps users rooted in the activity because progress is felt, not forced!
Codebase / App Repository ๐
- Kaizen ๐ github.com/anikvox/kaizen [Open Source]
- Live App ๐ kaizen.apps.sandipan.dev
Kaizen
kaizen.apps.sandipan.dev ยท Focus that Follows You
A privacy-first browser extension that tracks where your attention actually goes and gently helps you stay on track โ without blocking content or enforcing rigid workflows.
Built by CS students with ADHD who wanted a tool that understands attention patterns, not one that locks you out.
Screenshots
Left: Extension side panel with focus tracking and growing tree. Right: Gentle nudge when you drift.
Features
- Cognitive Attention Tracking โ Knows what you're reading, watching, and listening to โ not just which tabs are open
- Focus Guardian Agent โ Detects doomscrolling and distraction patterns, sends supportive nudges instead of blocking
- AI Chat with Memory โ Ask "What was I reading about today?" and get context-aware answers from your browsing history
- Auto-Generated Quizzes โ Turn passive reading into active recall with knowledge verification quizzes
- Insights & Achievements โ Track streaks, milestones, and focus patterns over time
- โฆ
What it does ๐ค
Kaizen acts as your AI co-pilot for focused learning on the web. It runs silently in the background, tracking what you actually pay attention to, including what you read, watch, and explore, and utilizes Google Gemini & Opik to help you stay focused, remember what matters, and test your understanding.
As you browse, Kaizenm gradually turns your native attention into learning. When your focus slips, it offers gentle nudges to bring you back. When you finish reading or watching something, it surfaces quick recall prompts to reinforce what you just absorbed. It occasionally slips in short, well-timed quizzes to check your understanding while the idea is still fresh. And when you want to go deeper, its context-aware chat remembers where youโve been, helping you connect ideas and build knowledge over time.
Kaizen is supercharged with a plethora of awesome features,
- ๐ง Cognitive Attention Tracking โ tracks where your mind actually settles across text, images, audio & YouTube
- ๐ค Multi-Agent AI System โ four coordinated agents (Focus Guardian, Chat, Focus Clustering, Mental Health) powered by Gemini
- ๐ฌ Agentic Co-Pilot Chat โ tool-calling assistant that synthesizes your reading sessions with context-aware insights
- ๐ Supportive Pulse Nudges โ gentle reminders when you drift, never blocking โ with self-calibrating sensitivity
- ๐ Knowledge Quizzes โ auto-generated verification quizzes from your actual browsing content
- ๐ Cognitive Analytics Dashboard โ attention entropy, browsing fragmentation, late-night patterns over 7โ90 day windows
- ๐ฑ Growing Plant Gamification โ a virtual plant that grows with your focus time
- ๐ Privacy-First Engine โ PII anonymization, encrypted API keys (AES-256-GCM), GDPR-compliant with full data export/deletion
- ๐ญ Full Opik Observability โ every LLM call, tool invocation, and agent decision traced end-to-end
Also, for more transparency, we also present a comparative analysis of available solutions,
| ๐ Blocks websites entirely | ๐ข Supportive pulse nudges โ zero blocking |
| ๐ Binary "focused" / "distracted" state | ๐ข Granular cognitive attention sensing |
| ๐ Punishes distraction | ๐ข Understands attention patterns and gently guides |
| ๐ No understanding of what you're learning | ๐ข Tracks reading, images, audio, video โ builds context |
| ๐ Cloud-locked data silos | ๐ข Privacy-first, PII-anonymized AI |
We wanted something that understands where your attention actually goes and gently helps you stay on track โ without locking you out of anything.
How Gemini powers the system โก
"We used Gemini" tells you nothing. So let me be specific about how deeply it's woven into every layer of Kaizen.
Gemini is the system default provider throughout Kaizen, integrated via the Vercel AI SDK (@ai-sdk/google v3.0.22) alongside the direct Google SDK (@google/genai v1.40.0). Every agent, every summarization call, every quiz โ Gemini handles it unless the user explicitly switches to another provider (Anthropic Claude or OpenAI GPT-4 are available as alternatives).
Here's the core provider resolution logic:
// service.ts โ LLM Provider Resolution
export class LLMService {
getProvider(): LLMProvider {
// 1. Check user's custom provider + encrypted API key
if (this.settings?.llmProvider) {
const provider = this.tryCreateUserProvider();
if (provider) return provider;
}
// 2. Fall back to system Gemini
return this.createSystemProvider(); // โ gemini-2.5-flash-lite
}
}
// models.ts โ System Defaults
export const SYSTEM_DEFAULT_PROVIDER: LLMProviderType = "gemini";
export const SYSTEM_DEFAULT_MODEL = "gemini-2.5-flash-lite";
And the GeminiProvider class wraps the Vercel AI SDK with full tool-calling, multimodal content (text + base64 images), and streaming support:
// providers/gemini.ts
export class GeminiProvider implements LLMProvider {
readonly providerType = "gemini" as const;
constructor(config: LLMProviderConfig) {
this.google = createGoogleGenerativeAI({
apiKey: config.apiKey,
});
}
async generate(options: LLMGenerateOptions): Promise<LLMResponse> {
const result = await generateText({
model: this.google(this.model),
system: options.systemPrompt,
messages,
tools: options.tools,
experimental_telemetry: getTelemetrySettings({
name: `gemini-${this.model}`,
userId: this.userId,
}),
});
// Extract toolCalls and toolResults from response...
}
async stream(options: LLMStreamOptions): Promise<void> {
const result = streamText({
model: this.google(this.model),
system: options.systemPrompt,
messages,
tools: options.tools,
});
for await (const chunk of result.textStream) {
await options.callbacks.onToken(chunk, fullContent);
}
}
}
ใ ค
The four agents ๐ค
Kaizen isn't an usual GPT wrapper with a focus timer bolted on. It's a coordinated multi-agent system where each agent has a specific job, its own set of tools, and its own Gemini-powered decision loop.
| Agent | What it does | How Gemini is used |
|---|---|---|
| ๐ก๏ธ Focus Guardian | Monitors your browsing every 60 seconds. Detects doomscrolling, distraction, and focus drift. Sends nudges when confidence is high enough. | Gemini analyzes 15 minutes of activity context (domain switches, dwell times, social media ratio) and returns a structured JSON decision at temperature: 0.1 for consistency. |
| ๐ฌ Chat Agent | Conversational AI with tool-calling. You can ask "what was I reading about today?" and it queries your actual attention data. | Gemini streams responses via streamText() with up to 5 agentic steps. It autonomously selects from 11 tools to ground answers in real data. |
| ๐ฏ Focus Agent | Clusters your attention into focus sessions. Figures out what topics you're working on and tracks evolution. | Gemini runs an agentic loop (up to 10 iterations) calling tools like create_focus, merge_focuses, update_focus to organize attention data into coherent sessions. |
| ๐ง Mental Health Agent | Generates cognitive wellness reports โ fragmentation, sleep patterns, media balance, quiz retention. | Gemini runs another agentic loop with specialized tools (analyze_sleep_patterns, analyze_focus_quality, analyze_media_balance, think_aloud) and produces a full report in supportive, non-clinical language. |
โ Temperature tuning across tasks ๐ก๏ธ
Different tasks need different levels of creativity. We tuned Gemini's temperature for each use case:
// config.ts โ LLM Configuration Presets
export const LLM_CONFIG = {
decision: { temperature: 0.1, maxTokens: 10 }, // Should we nudge? Yes/no.
summarization: { temperature: 0.3, maxTokens: 200 }, // Factual, deterministic
focusAnalysis: { temperature: 0.3, maxTokens: 50 }, // Concise clustering
imageDescription:{ temperature: 0.3, maxTokens: 150 }, // Vision captions
titleGeneration: { temperature: 0.7, maxTokens: 20 }, // Creative but short
agent: { temperature: 0.7, maxTokens: 4096 }, // Chat โ balanced
quizGeneration: { temperature: 0.9, maxTokens: 2000 }, // We *want* variety!
};
At 0.1, Gemini is disciplined โ it gives consistent nudge decisions. At 0.9, it generates creative quiz question phrasing without going off the rails. That predictability across the temperature range was one of the reasons we kept Gemini as the default over other providers.
โ Tool-calling in practice ๐ง
The Chat Agent's tool-calling is where Gemini's structured output really shines. When you ask "what have I been focusing on?", here's what actually happens:
User message arrives
โ Gemini evaluates available tools
โ Selects: get_active_focus
โ Tool executes Prisma query against PostgreSQL
โ Results returned to Gemini
โ Gemini composes a response grounded in your data
โ Response streamed back via SSE
The 11 tools available to the Chat Agent:
-
get_attention_dataโ recent text/image/audio/YouTube attention -
get_active_websiteโ what tab you're on right now -
get_active_focusโ your current focus topics -
search_browsing_historyโ search past activity -
get_reading_activityโ reading session data -
get_youtube_historyโ YouTube watch history -
get_focus_historyโ past focus sessions -
get_current_timeโ current time in user's timezone -
get_current_weatherโ weather at user's location -
set_user_locationโ remember location (geocoding via OpenMeteo) -
set_translation_languageโ language preferences
Gemini picks which tools to call, interprets the results, and sometimes chains multiple tool calls in a single turn. We capped it at 5 steps per message to prevent runaway loops. Here's the actual execution from agent.ts:
// chat/agent.ts โ Agentic Chat Execution
const result = streamText({
model: provider(modelId),
system: systemPrompt, // Fetched from Opik prompt library
messages: coreMessages, // Supports multimodal (text + images)
tools,
maxSteps: 5,
onStepFinish: (step) => {
// Create Opik span for each tool call
if (step.toolCalls && step.toolCalls.length > 0) {
for (const toolCall of step.toolCalls) {
const toolSpan = trace?.span({
name: `tool:${toolCall.toolName}`,
type: "tool",
input: { args: toolCall.args },
});
toolSpan?.end({ result: /* tool output */ });
}
}
},
});
We tested Gemini, Claude, and GPT-4 for this pipeline. Gemini's tool selection was the most reliable for our use case โ it rarely picked the wrong tool or returned malformed tool calls across 11 different tool schemas. That's why it became the default.
โ Multimodal attention โ Gemini Vision ๐๏ธ
When you linger on an image while browsing, the extension tracks your hover duration and confidence score. If you're actually paying attention, Kaizen sends the image as base64-encoded data directly to Gemini for caption generation:
// providers/gemini.ts โ Multimodal content formatting
private formatUserContent(content: LLMMessageContent) {
return content.map((part) => {
if (part.type === "image") {
return {
type: "image" as const,
image: `data:${part.mimeType};base64,${part.data}`,
};
}
return { type: "text" as const, text: part.text };
});
}
This means the Chat Agent can later tell you "you were looking at a diagram of TCP handshakes" instead of just "you visited a networking article." The image summaries + text summaries together form Kaizen's memory layer.
โ Quiz generation from real attention ๐
When you hit "Generate Quiz," a pg-boss background job fires. The Quiz Agent pulls your recent attention data, feeds it to Gemini at temperature: 0.9, and generates 10 multiple-choice questions based on what you've been reading. A content hash prevents duplicate questions across sessions. The quiz stays valid for 24 hours.
This is probably the feature I'm most proud of. Passive reading becomes active recall, and you didn't have to do anything extra. You just browsed normally, and now there's a quiz waiting for you. ๐ฏ
โ Focus Guardian โ the self-learning nudge engine ๐ก๏ธ
The Focus Guardian runs autonomously, analyzing your last 15 minutes of activity. Here's what actually happens in the decision loop (from focus-agent.ts):
// agent/focus-agent.ts โ Focus Guardian Decision
const prompt = `${promptData.content}
RECENT ACTIVITY (last 15 minutes):
- Domains visited: ${context.recentDomains.join(", ")}
- Number of different sites: ${context.domainSwitchCount}
- Average time per page: ${Math.round(context.averageDwellTime / 1000)}s
- Social media/entertainment time: ${Math.round(context.socialMediaTime / 1000)}s
- Reading time (estimated): ${Math.round(context.readingTime / 1000)}s
- Has active focus: ${context.hasActiveFocus ? `Yes (${context.focusTopics.join(", ")})` : "No"}
USER FEEDBACK HISTORY:
- False positive rate: ${(feedback.falsePositiveRate * 100).toFixed(0)}%
- Acknowledged rate: ${(feedback.acknowledgedRate * 100).toFixed(0)}%
- Sensitivity setting: ${sensitivity}`;
const response = await provider.generate({
messages: [{ role: "user", content: prompt }],
});
Nudge types: doomscroll, distraction, break, focus_drift, encouragement, and all_clear. There's a configurable cooldown between nudges so it never feels like nagging.
The system self-calibrates. Every nudge records whether you acknowledged it, dismissed it, or marked it as a false positive:
// Sensitivity auto-adjustment from user feedback
if (response === "false_positive") {
newSensitivity = Math.max(0.1, newSensitivity - 0.05); // fewer nudges
} else if (response === "acknowledged") {
newSensitivity = Math.min(0.9, newSensitivity + 0.02); // nudge was helpful
}
Over time, the agent learns your patterns. If it keeps getting it wrong, it backs off. If it's on point, it stays the course.
The Tech Stack โ๏ธ
Everything runs on a TypeScript monorepo (pnpm workspaces):
kaizen/
โโโ apps/
โ โโโ api/ # Hono backend โ agents, data ingestion, SSE
โ โโโ extension/ # Plasmo browser extension โ attention sensors
โ โโโ web/ # Next.js dashboard โ analytics, chat, settings
โโโ packages/
โ โโโ api-client/ # Shared typed API client
โ โโโ ui/ # Shared component library
โโโ docker-compose.yml
| Layer | What we used |
|---|---|
| Runtime | Node.js 22+ |
| Backend | Hono v4.6.14, Prisma ORM v6.2.1, PostgreSQL 16 |
| Job Queue | pg-boss v12 (single-concurrency, resource-aware) |
| Real-time | Custom SSE (Server-Sent Events) for cross-device sync |
| Auth | Clerk v1.21.4 (web), device token handshake (extension) |
| AI | Google Gemini via Vercel AI SDK v6.0.77 (@ai-sdk/google + @google/genai) |
| Observability | Comet Opik v1.0.6 โ tracing, prompt library, anonymizers |
| Extension | Plasmo, React, TypeScript |
| Dashboard | Next.js 15, Tailwind CSS, Lucide Icons |
| Encryption | AES-256-GCM for API key storage |
Attention sensors ๐ก
The extension runs separate monitors for different content types:
| Sensor | File | What it tracks |
|---|---|---|
| ๐ Text | monitor-text.ts |
Paragraphs read, words processed, reading progress, sustained attention duration |
| ๐ผ๏ธ Image | monitor-image.ts |
Hover duration, confidence score โ triggers Gemini Vision for caption generation |
| ๐ Audio | monitor-audio.ts |
Playback duration, active listening time |
| ๐บ YouTube | background scripts | Watch time, captions ingestion, video context |
Each sensor generates a confidence score (0โ100) based on hover duration, scroll velocity, and viewport position. A quick skim doesn't count as learning. Sustained attention does.
Database Schema ๐๏ธ
// Core attention tracking
TextAttention โ text, wordsRead, confidence, timestamp
ImageAttention โ src, alt, hoverDuration, summary (AI-generated)
AudioAttention โ playbackDuration, activeTime
YoutubeAttention โ captions, activeWatchTime
// Agentic features
Focus โ item, keywords[], isActive, lastActivityAt
AgentNudge โ type, message, confidence, reasoning, response
Pulse โ userId, message (short nudges)
// Quiz system
Quiz โ questions (JSON), contentHash (deduplication)
QuizAnswer โ selectedIndex, isCorrect
QuizResult โ totalQuestions, correctAnswers
// User settings (encrypted API keys)
UserSettings โ geminiApiKeyEncrypted, llmProvider, llmModel
ใ ค
Real-time SSE events ๐ก
Custom Server-Sent Events sync state across browser extension + dashboard:
-
pomodoro-tickโ Timer updates -
chat-message-created/updatedโ Chat streaming -
active-tab-changedโ Tab context sync -
focus-changedโ Focus session state -
settings-updatedโ Cross-device settings sync -
pulses-updatedโ Nudge notifications
Observability with Opik ๐ญ
We integrated Comet Opik for full observability across the entire agent system. This turned out to be one of the best decisions we made, as you can't evaluate what you can't see.
What we instrumented
๐ Tracing โ Every LLM call, every tool invocation, every agent decision is traced end-to-end. Traces are grouped by thread ID so you can follow the full decision flow:
// telemetry.ts โ Opik Trace Hierarchy
const trace = client.trace({
name: options.name,
input: options.input ? anonymizeInput(options.input) : undefined,
metadata: { ...options.metadata, environment: process.env.NODE_ENV },
tags: options.tags || ["kaizen"],
threadId: options.threadId,
});
// Nested spans for each step
const span = trace.span({
name: "tool:get_attention_data",
type: "tool",
input: anonymizeInput({ args: toolCall.args }),
});
span.update({ output: processedOutput, endTime: new Date() });
The resulting trace hierarchy looks like:
Trace: chat-agent
โโโ Span: streamText [type: llm]
โ โโโ Span: tool:get_active_website [type: tool]
โ โโโ Span: tool:get_attention_data [type: tool]
โ โโโ Span: tool:search_browsing_history [type: tool]
โโโ Span: followUp-streamText [type: llm]
๐ Prompt Library โ All 11 system prompts live in Opik under named entries, fetched fresh on every call with local fallbacks:
// prompt-provider.ts โ Opik-first, local fallback
export async function getPromptWithMetadata(name: PromptName) {
if (isOpikPromptsEnabled()) {
const opikPrompt = await getPromptFromOpik(name);
if (opikPrompt?.content) {
return { content: opikPrompt.content, source: "opik",
promptVersion: opikPrompt.commit };
}
}
return { content: LOCAL_PROMPT_MAP[name], source: "local" };
}
This let us iterate on prompts without redeploying code. We'd see a bad nudge in a trace, tweak the prompt in Opik's dashboard, and the fix was live immediately.
๐ Anonymizers โ Before anything gets logged to Opik, we strip PII using @cdssnc/sanitize-pii:
// anonymizer.ts โ PII Protection
function isSensitiveKey(key: string): boolean {
const sensitivePatterns = [
/^userId$/i, /password/i, /secret/i,
/token/i, /api[_-]?key/i, /auth/i,
/credential/i, /private[_-]?key/i,
];
return sensitivePatterns.some((pattern) => pattern.test(key));
}
// User inputs โ anonymized. LLM outputs โ preserved for debugging.
export function anonymizeInput<T>(data: T): T { return anonymizeData(data); }
export function anonymizeOutput<T>(data: T): T { /* only redact sensitive keys */ }
๐ก๏ธ Guardrails โ Agents validate inputs before tool execution. The Focus Guardian only fires a nudge when confidence exceeds a dynamically adjusted threshold. The Chat Agent validates tool arguments before running Prisma queries.
Why Opik mattered ๐ฏ
Early on, the Focus Guardian was nudging people during legitimate deep dives. Someone would be reading a 30-minute technical article, and the agent would flag it as distraction because the domain switching pattern looked similar to aimless browsing.
Without Opik, we'd have said "the AI is dumb" and guessed at fixes.
With tracing, we could pull up the exact decision chain: here's the 15 minutes of context the agent saw, here's the domain switch count, here's the confidence score, here's the prompt, here's the output. The problem was obvious โ the prompt didn't have a strong enough signal for sustained single-topic browsing. We tweaked the prompt in Opik, the fix deployed without a code change, and false positives dropped.
That cycle โ trace the failure โ find the root cause โ fix the prompt โ verify in production โ happened dozens of times.
What worked well โ
- Tool-calling was reliable. We tested Gemini, Claude, and GPT-4 for our agent pipelines, and Gemini's structured output parsing was the most consistent for our use case. The Chat Agent makes autonomous tool selections across 11 different tools, and Gemini rarely picked the wrong one or returned malformed tool calls. This is why it became our system default.
-
The million-token context window was a real advantage. Gemini
3.1-flashand3.1-flash-liteboth support 1M token context windows. For the Focus Agent's clustering loop, which sometimes processes hours of attention data across many topics, especially having that headroom meant we didn't have to aggressively truncate context. We could pass in a richer activity history and get better clustering decisions. -
Temperature control behaved predictably. From
0.1for binary decisions to0.9for quiz generation, Gemini responded consistently. At0.1it was disciplined; at0.9it got creative without going off the rails. That predictability across the full range was a real win. - Multimodal input worked out of the box. We send base64-encoded images directly to Gemini for caption generation. The quality of image descriptions was good enough that the Chat Agent could later reference them meaningfully. No separate vision pipeline needed.
-
The model fetcher dynamically discovers new models. We use
@google/genaito fetch the live model list from the Gemini API (filtering forgenerateContentsupport), with sorting priority baked in for the upcoming gemini-series family, so when new model lands, Kaizen will pick it up automatically.
Research ๐
We donโt remember things just because we saw them. We remember them when we bring them back to mind. A small, well-timed reminder can turn a passing moment online into something that actually sticks. And it works better when the reminder supports your intention rather than trying to control your behavior. When your browser can quietly keep track of the ideas you spent time on and surface them again when you need them, the pressure of โtrying to hold everything in your headโ eases up.
This is especially supportive for people with ADHD, where working memory and task switching can feel heavy, and for people who experience early memory decline, where gentle spaced recall helps keep learning active. Kaizen helps keep the thread. Small nudges, quick check-ins, and context that stays with you, so you donโt have to start from scratch every time you return to a thought.
- Distractibility trait linked to ADHD
- Fighting ADHD along the hustle culture: How can employees keep their mental health in check
- Study of Internet addiction in children with attention-deficit hyperactivity disorder and normal control
- ADHD Youth and Digital Media Use
- Attention Spans โ Podcast (APA)
- Digital Distractions (ADDitude Magazine โ tag archive)
- What if the Attention Crisis Is All a Distraction? (The New Yorker)
- Distraction fatigue vs ADHD: How technology is reshaping our attention spans
- Retrieval practice helps strengthen memory by actively recalling information (FPSYG, 2019)
- Spaced retrieval improves retention by revisiting information over time (Mavmatrix, art. 1160)
- Small external cues and short recall prompts support working memory in ADHD
- Gentle reminders and cueing tools help reduce memory load in dementia (NCBI, 2017)
- Retrieval reactivates and stabilizes memory traces when spaced and repeated
- Everyday memory aids help maintain independence and reduce strain in dementia
Kaizen keeps attention anchored to meaning, not effort! โจ
Challenges we ran into ๐ค
We did run into a few challenges along the way. Since we were working from different time zones, coordinating calls and staying in sync took some extra effort. Most of our collaboration happened asynchronously, which meant we had to be very clear about decisions and hand-offs.
On the technical side, figuring out what โreal attentionโ meant was something we had to refine multiple times. We experimented with how much weight to give scroll patterns, mouse movement, viewport position, and reading pace so that quick skims didnโt count as learning. Handling different types of content also took care, especially images and YouTube videos, since the context needed to stay meaningful, not noisy.
Besides, some other challenges we faced aren't limited to,
- Multi-turn coherence degraded over long conversations. After 10+ turns with interleaved tool calls, the Chat Agent would sometimes lose track of earlier context or repeat information. We partially fixed this by injecting a conversation summary into the system prompt, but it meant extra token usage. Not unique to Gemini, but noticeable.
- Streaming with tool calls needed careful handling. When Gemini decides mid-stream to call a tool, the handoff between text chunks and tool-call events required state management in our SSE layer. The Vercel AI SDK abstracted most of it, but edge cases (tool call at the very start, multiple rapid tool calls) needed explicit handling.
-
Occasional overconfidence in Focus Guardian decisions. At
temperature: 0.1, when the Focus Guardian is wrong, it's confidently wrong. A few times it classified focused research (lots of Stack Overflow tabs) as aimless browsing. The fix was better prompting + the feedback loop, not a model change.
What we learned ๐
Proper sleep is very important! ๐
Well, a lot of things, both summed up in technical & non-technical sides. We learned that โ itโs one thing to get the AI features working, and another to make them feel good while someone is actually browsing. Most of our time went into small details: when to nudge, when to stay quiet, how to store attention history without slowing down browser, and how to keep things calm instead of distracting. Shipping Kaizen from a barebone idea into something stable took a lot of iteration, testing, and rethinking. It reminded us that real products are built in the tiny decisions, not the big demos! ๐ค
There are a few more items that I'd love to share with the community,
- ๐ฆ Building for attention requires restraint. The hardest design decisions weren't technical. They were about when not to act. Our early Focus Guardian nudged aggressively, mostly it felt like a backseat driver. So, our lesson was that if your tool annoys people, they'll uninstall it. Being right isn't enough; you have to be right at the right moment.
- ๐ฆ Agents need structure, not freedom. We initially gave the Chat Agent broad instructions. The results were inconsistent. What worked was constraining each agent to a narrow job with specific tools and clear decision boundaries. The Focus Guardian doesn't chat. The Chat Agent doesn't nudge. In sharp,
Specialization+Coordination>Generalization. - ๐ฆ Observability isn't optional for agent systems. Without Opik traces, we'd still be guessing why nudges misfired. We stopped treating the AI as a black box and started treating it like any other system component with logs and metrics.
- ๐ฆ The real product is the quiet moments. Nobody remembers the quiz that worked perfectly. They remember the time the extension stayed silent for 45 minutes during a Wikipedia deep dive they genuinely cared about, and then gently reminded them about the assignment they'd originally set out to work on. Getting those moments right took dozens of prompt iterations and hundreds of traced decisions.
- ๐ฆ Gemini as a default provider was the right call. After benchmarking all three providers, Gemini's combination of reliable tool-calling, 1M context window, and consistent temperature behavior made it the best fit. Our system makes potentially dozens of Gemini calls per user per hour โ attention summaries, focus clustering, guardian checks โ and reliability at that volume mattered more than peak performance on any single call.
What's next? ๐
We're continuing to develop Kaizen and planning the next release cycle:
- โฐ Spaced repetition โ surfacing what you read at the moment you're most likely to forget it
- ๐ธ๏ธ Topic relationship mapping โ showing how things you learn connect across sessions
- โก Better batching โ optimized Gemini call grouping during long browsing sessions
- ๐ค Export to note-taking tools โ so learning doesn't stay trapped in the extension
- ๐ฅ Shareable study threads โ lightweight collaboration for shared focus sessions
For Gemini specifically, we're interested in structured output (JSON mode) for agent responses. Right now we parse freeform text from several agent pipelines, and guaranteed JSON would let us simplify those parsing layers.
End notes ๐๐ป
As CS students who struggle with ADHD, we primarily built Kaizen because we needed it ourselves. Traditional blockers felt like punishment. Our New Year's resolution was to build the tool we wished existed, something that doesn't lock you out, doesn't judge, just watches where your attention goes, learns your patterns, and gently, continuously helps you get better. That's what kaizen (ๆนๅ) stands for, i.e. continuous improvement.
Huge thanks to DEV and MLH for hosting this writing challenge, and to the Google Gemini team for building models that actually hold up under real multi-agent workloads! ๐










Top comments (0)