DEV Community

Muhammad Ali Nasir
Muhammad Ali Nasir

Posted on

I Built an AI System That Makes 4 Agents Debate Scientific Papers , And Then Tells You Where They Disagree

How GraphRAG + a multi-LLM council produces more trustworthy answers than any single AI model


There is a quiet crisis in AI-assisted research that nobody talks about.

Every tool you've used — ChatGPT, Perplexity, Copilot, Elicit — does the same thing: it reads papers and gives you one confident answer.

The problem is that science doesn't work that way.

Take BRCA1's role in triple-negative breast cancer. Ask any AI tool and you'll get a confident, well-written paragraph. What you won't get is this:

  • Study A says BRCA1 mutations are associated with increased aggressiveness
  • Study B says the same patients show better response rates to chemotherapy
  • Study C shows shorter progression-free survival despite better initial response
  • Three studies report conflicting BRCA1 expression levels

These aren't edge cases. These are the real contradictions buried across 200 papers that a researcher needs to know before designing an experiment, filing an IND, or trusting a conclusion.

A single AI model smooths these over. It synthesizes them into a confident answer. And in doing so, it hides exactly the information a scientist needs most.

This is why I built Research Council.


The Core Idea: Deliberation Over Confidence

The insight that drove this project came from an unlikely place: Andrej Karpathy's llm-council repo — a simple Saturday hack that instead of asking one LLM a question, routes it to multiple LLMs and has them review each other's answers.

The key insight: cross-model review catches things a single model misses.

I wanted to take this further. What if instead of generic LLMs reviewing each other, you had specialized agents — each trained to look at the evidence from a fundamentally different angle — deliberating over a structured knowledge graph of papers?

That's Research Council.


What It Actually Does

When you ask Research Council a research question, here's what happens:

Step 1: The Knowledge Graph

Before you ask anything, papers are ingested from PubMed, arXiv, Semantic Scholar, or uploaded as PDFs. Research Council doesn't just chunk them into vectors. It builds a Neo4j knowledge graph:

  • Nodes: Paper, Gene, Drug, Disease, Protein, Pathway, Author, Conclusion
  • Relationships: CONTRADICTS, SUPPORTS, CITES, MENTIONS, STUDIES, TARGETS

This is the critical difference from standard RAG. Traditional RAG asks: "which chunks are semantically similar to my query?" GraphRAG asks: "what is the structural relationship between the entities in my query?"

When you ask about BRCA1 and TNBC, the graph doesn't just return the most similar text chunks — it returns the neighborhood of BRCA1: every paper that mentions it, every drug that targets related proteins, and critically, every paper that contradicts another paper on the topic.

Step 2: Token-Efficient Context Assembly

Here's a technical detail that matters a lot at scale.

Naive multi-agent systems load every available tool into the prompt upfront. With 50+ tools, that's 25,000+ tokens before the agent does anything useful.

Research Council uses langgraph-bigtool: tools are embedded with SentenceTransformers at startup and retrieved semantically at query time. Only 2-4 relevant tools are loaded per query.

The result: a full 4-agent deliberation on a complex biomedical question uses 3,118 tokens total. About $0.002.

Step 3: The Council Deliberates

Four specialized agents receive the same knowledge graph subgraph and analyze it in parallel via asyncio.gather():

🔬 Evidence Agent

"What does the data actually show? Be precise about sample sizes, study types, and effect sizes. Never speculate beyond what the data shows."

⚔️ Skeptic Agent

"Find the weaknesses: biased study designs, underpowered samples, conflicting results, publication bias. Be constructively critical."

🔗 Connector Agent

"Find non-obvious links — drug repurposing opportunities, analogous mechanisms from other diseases, techniques from adjacent fields."

📋 Methodology Agent

"Evaluate whether experimental designs are appropriate, controls are adequate, statistical methods are sound, and whether conclusions are justified by the methods used."

Each agent produces an independent response. Then the real work begins.

Step 4: 12 Cross-Reviews

Every agent reviews every other agent's response — anonymized, to prevent model bias. That's 4 × 3 = 12 peer evaluations.

Each review produces:

  • An agreement score (0.0 to 1.0)
  • Specific points of disagreement
  • Constructive critique

The aggregate agreement score becomes a signal for confidence. High agreement → higher confidence. Persistent disagreement → lower confidence, and the Chairman must explain why the agents disagreed.

Step 5: The Chairman Synthesizes

A Chairman agent (running on OpenRouter with the best available model) receives all four original responses plus all twelve cross-reviews. It produces:

{
  "summary": "...",
  "confidence_score": 0.65,
  "key_findings": ["...", "..."],
  "contradictions": ["...", "..."],
  "citations": [{"claim": "...", "paper_id": "PMID:..."}],
  "methodology_notes": "...",
  "agent_agreement": 0.80
}
Enter fullscreen mode Exit fullscreen mode

Notice the confidence score is 0.65, not 0.95. That's intentional. The system doesn't inflate confidence. If the evidence is contested, the score reflects that.

Step 6: The Write-Back Loop

Every conclusion the Chairman produces gets written back to Neo4j as a new Conclusion node — linked to every paper it references. The graph compounds over time. Each query makes future queries more informed.


The Result

Here's the actual output on "Are there contradictions in BRCA1's role in triple-negative breast cancer?":

Confidence: 65%

Not 95%. Not "based on multiple sources." A calibrated 65% because the agents genuinely disagreed on two points and the methodology agent flagged three studies as underpowered.

4 Contradictions surfaced:

  • BRCA1 mutations associated with both increased tumor aggressiveness AND better prognosis
  • Higher treatment response rates but shorter progression-free survival
  • Conflicting reports on BRCA1 expression levels across studies
  • Variable associations between BRCA1 mutations and TNBC significance

6 Key findings, each cited to a specific PubMed ID.

8 Methodology concerns — variable TNBC definitions, selection bias, small sample sizes, retrospective designs.

Agent agreement: 80% — two agents disagreed on whether the survival paradox was explained by tumor heterogeneity or methodological inconsistency.

Compare this to what ChatGPT gives you: a confident, well-written paragraph that mentions none of the contradictions.


The Architecture

Researcher Query
      │
      ▼
GraphRAG Layer
  Neo4j: entities + relationships
  ChromaDB: vector embeddings (CPU-only, MiniLM)
  Hybrid retrieval: ~2,000 token context
      │
      ▼
LangGraph Orchestrator
  BigTool: 2-4 tools loaded dynamically
  Hybrid retrieval node
  Context assembly node
      │
      ▼
LLM Council (Groq + OpenRouter)
  Stage 1: 4 parallel agents
  Stage 2: 12 cross-reviews
  Stage 3: Chairman synthesis
      │
      ▼
Answer + Neo4j Writeback
  Confidence · Citations · Contradictions · Provenance
Enter fullscreen mode Exit fullscreen mode

Full stack:

  • LangGraph + langgraph-bigtool (orchestration)
  • Neo4j 5 Community (knowledge graph, local)
  • ChromaDB (vector store, local, CPU)
  • all-MiniLM-L6-v2 (embeddings, 80MB, CPU-only)
  • Groq llama-3.3-70b + llama-3.1-8b (council agents, fast)
  • OpenRouter claude-sonnet (Chairman, best synthesis quality)
  • FastAPI + React + Vite

Hardware requirements: 16GB RAM, 4GB VRAM. No beefy GPU needed. Embeddings run entirely on CPU.


What I Learned Building This

1. The write-back loop is the most underappreciated part

Most RAG systems are stateless. Query in, answer out. Research Council writes every conclusion back to the graph as a new node linked to source papers. After 50 queries, the graph has 50 validated conclusions that inform future answers. This is the difference between a tool and a system that compounds knowledge.

2. Confidence calibration is harder than it sounds

Getting agents to express genuine uncertainty rather than inflated confidence required careful prompt engineering. The current approach — deriving confidence from agent agreement scores — works but isn't theoretically principled. There's real research to be done here.

3. 12 cross-reviews might be overkill

n × (n-1) cross-reviews scales quadratically. With 4 agents that's 12, manageable. With 8 agents that's 56 — too slow. A smarter aggregation strategy (maybe pairwise disagreement sampling) would make larger councils viable.

4. The skeptic agent is the most valuable

After dozens of test queries, the Skeptic Agent consistently surfaces the most useful information — not because skepticism is inherently valuable, but because existing AI tools have a strong bias toward presenting positive findings. The explicit adversarial role corrects for this.


What's Next

Research Council V1 is live and open source (MIT licensed).

V2 plans:

  • Community detection on the knowledge graph (Louvain clustering)
  • Temporal analysis — track how understanding of a topic evolves year by year
  • MCP server integrations — Zotero, PubMed MCP, Neo4j MCP
  • HuggingFace Space for live demos
  • Export council output as formatted PDF for lab notes

What I'd love help with:

  • Better confidence calibration methodology
  • Async optimization of the cross-review loop
  • Additional paper sources (bioRxiv, ChemRxiv, Europe PMC)
  • Domain-specific agent specializations beyond biomedicine

Try It

GitHub: https://github.com/al1-nasir/Research_council

MIT licensed. Runs on a laptop. Groq has a free tier. OpenRouter costs fractions of a cent per query.

If you work in research, drug discovery, or AI for science — I'd genuinely love to know what question you'd throw at it first.


Built by Muhammad Ali Nasir. Inspired by karpathy/llm-council, extended with domain-specific GraphRAG and biomedical agent specialization.


Tags: Artificial Intelligence · Machine Learning · Biomedical Research · GraphRAG · Open Source · LLM · Python · Drug Discovery · Research Tools

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.