📋 Table of Contents — click to expand / collapse
Onboarding onto a new codebase takes days — sometimes weeks. Developers waste hours tracing function calls, reading stale docs, and asking colleagues questions that interrupt everyone's flow.
The Codebase Onboarding Agent eliminates that friction. It ingests any Git repository, builds a triple index (semantic vectors + BM25 keyword + call graph), and exposes an autonomous RAG agent that answers precise, citation-verified questions about the codebase in seconds.
This is not a chatbot wrapper around an LLM. It is a production-grade agentic pipeline with:
- 🔬 AST-level chunking — logic is never split mid-function
- 🕸️ Graph-augmented retrieval — understand who calls what
- 🛡️ Hallucination gating — every citation is verified against the index before it reaches the user
- ⚡ Semantic answer caching — repeat questions are served in milliseconds
- 🔄 Webhook auto-sync — the agent updates itself on every GitHub push
| Old Way | This Agent |
|---|---|
| Read docs that may be stale | Answers from the actual, live code |
| Grep and trace calls manually | Graph traversal finds callers/callees instantly |
| Ask a senior dev (interrupt their flow) | Ask the agent — it cites the exact file and line |
| Re-onboard after every big PR | Webhook re-ingests on merge, cache is purged |
| Feature | Description | |
|---|---|---|
| 🤖 | Autonomous Agent Loop | LLM decides which tools to call — no hardcoded router |
| 🔍 | Hybrid Search | Vector (ChromaDB) + Keyword (BM25) fused via Reciprocal Rank Fusion |
| 🕸️ | Call Graph Engine | NetworkX graph of imports and function calls — traversed via BFS (3-hop limit) |
| 🛡️ | Hallucination Guard | Citation validation gates any response with confidence < 4.0 / 10 |
| ⚡ | Semantic Cache | 95% similarity threshold — identical questions skip the LLM entirely |
| 📊 | Mermaid Diagrams | Auto-generates call-graph diagrams that render in the Streamlit UI |
| 🔄 | Webhook Auto-Sync | HMAC-verified GitHub webhooks trigger re-ingestion on every push |
| 🌐 | REST API | FastAPI backend with X-API-Key auth and sliding-window rate limiting |
| 📓 | Ragas Evaluation | Automated faithfulness, relevancy, and recall scoring via golden set |
Developer / GitHub Webhook
|
v
Streamlit App (frontend/)
| HTTP REST
v
FastAPI Router (app/api/)
|
|──[Ingestion Pipeline]─────────────────────────────────┐
| 1. Git Clone / Fetch (app/ingestion/clone.py) |
| 2. File Filter (file_filter.py) |
| 3. Tree-Sitter Parser (app/parsing/) |
| 4. AST Chunker (chunker.py) |
| 5a. Vector Store (ChromaDB) |
| 5b. BM25 Index (app/retrieval/) |
| 5c. Call Graph (NetworkX) |
| |
└──[Chat Agentic Loop]──────────────────────────────────┘
Query ──> Semantic Answer Cache
| (miss)
v
RAG Agent Loop (app/agent/)
|
v
Agent Tools (tools.py)
| |
v v
Hybrid Search Graph Queries
(RRF + Reranker) (app/graph/)
\ /
\ /
v v
Hallucination Guard (confidence.py)
|
v
Cache & Return → API
User / Webhook
|
| POST /ingest
v
API Layer ──── 202 Accepted {job_id}
|
| [Background Task]
v
clone_repo()
|
v
filter_repo_files() ← drops binaries, minified files, images
|
v
parse_file() ← Tree-Sitter AST extraction
|
v
┌──────────────────────────────────────┐
│ [Parallel Indexing] │
│ embed → ChromaDB (semantic) │
│ tokenize → BM25 (keyword) │
│ analyze → NetworkX (call graph) │
└──────────────────────────────────────┘
|
v
metadata_store.update(status="synced")
Key steps:
| Step | What Happens |
|---|---|
| Fast Return | API returns job_id immediately — client is never blocked |
| Safe Fetch | Falls back to bundled dummy repo if network is unavailable |
| Filtering | Drops binaries, minified JS, images — saves LLM tokens |
| AST Chunking | Splits at class/function boundaries — context never chopped mid-logic |
| Triple Indexing | Semantic (ChromaDB) + Keyword (BM25) + Relational (NetworkX) simultaneously |
User Question
|
v
Semantic Cache ──── [HIT: similarity > 0.95] ──── Return cached answer
|
[MISS]
|
v
Agent Loop (max 4 iterations)
|
| think → decide tool → call tool → observe result
v
┌─────────────────────────────────────────┐
│ search_code → Hybrid Search (RRF) │
│ read_file → Raw file contents │
│ get_callers → Graph BFS upstream │
│ get_callees → Graph BFS downstream │
│ generate_diagram → Mermaid output │
└─────────────────────────────────────────┘
|
| [context near limit → compress_older_tool_results()]
v
Final Answer
|
v
Hallucination Guard
|
├── confidence >= 4.0 → cache + return answer
└── confidence < 4.0 → return safe fallback, gated=true
GitHub Push / Merge Event
|
v
POST /webhook
|
| Verify X-Hub-Signature-256 (HMAC)
v
Acquire write lock for repo_id
|
v
Re-run full ingestion pipeline (force_reindex=True)
|
v
invalidate_cache(repo_id) ← purges ALL cached answers for this repo
|
v
metadata_store.update(commit_hash=new_hash)
Why this matters: HMAC-verified webhooks + write-locking + aggressive cache invalidation ensures the agent's answers always reflect the latest merged commit. Stale architectural explanations are never served.
🌐 API Layer — app/api/ & app/main.py
| File | Responsibility |
|---|---|
app/main.py |
FastAPI bootstrap — middleware, exception handlers, router mount |
app/config.py |
Pydantic BaseSettings — single source of truth for env variables |
app/api/router.py |
REST endpoints: /ingest, /status, /chat, /diagram |
app/api/auth.py |
X-API-Key header authentication |
app/api/rate_limiter.py |
Sliding-window memory-based rate limiting on /chat |
⚙️ Ingestion Pipeline — app/ingestion/
| File | Responsibility |
|---|---|
clone.py |
Git clone with size limits; offline fallback to dummy repo |
file_filter.py |
Allowlist: .py, .js, .ts — drops binaries and minified files |
metadata_store.py |
Persists sync state (pending / synced / failed) and commit hashes |
locking.py |
Thread-level locking per repo_id — prevents concurrent ingestion collisions |
🔬 Parsing & Chunking — app/parsing/
| File | Responsibility |
|---|---|
tree_sitter_parser.py |
AST extraction of classes, functions, and imports via Tree-sitter grammars |
chunker.py |
Splits files at logical (class/function) boundaries — never mid-logic |
🗄️ Retrieval & Storage — app/retrieval/
| File | Responsibility |
|---|---|
vector_store.py |
ChromaDB interface — stores and queries chunk embeddings |
bm25_store.py |
Pure-Python BM25 index — catches exact variable/function name matches |
embeddings.py |
SentenceTransformers wrapper for converting text to vectors |
hybrid_search.py |
Reciprocal Rank Fusion (RRF) combining vector + BM25 results |
reranker.py |
Cross-Encoder reranking of top-K hybrid search results |
query_expansion.py |
LLM-generated synonym/keyword expansion before hitting the index |
🕸️ Graph Operations — app/graph/ & app/diagrams/
| File | Responsibility |
|---|---|
builder.py |
Builds a NetworkX directed graph of imports and internal function calls |
queries.py |
Timeout-safe BFS traversal for callers/callees (3-hop limit) |
mermaid_generator.py |
Converts a graph subgraph into a Mermaid diagram string for the frontend |
🤖 Agentic Loop — app/agent/
| File | Responsibility |
|---|---|
loop.py |
Core execution loop — LLM call → tool intercept → tool run → feed back → repeat |
tools.py |
JSON schema definitions + retry logic for all 5 agent tools |
system_prompt.py |
LLM instructions: persona, rules, and strict markdown citation format |
semantic_cache.py |
95% similarity threshold cache — same question, same commit → instant return |
context_manager.py |
Monitors token usage; condenses old tool results via secondary LLM |
confidence.py |
Hallucination Guard — validates every cited file/line; gates below score 4.0 |
🧪 Evaluation Suite — eval/
| File | Responsibility |
|---|---|
run_eval.py |
Ragas automated evaluation: Faithfulness, Answer Relevancy, Context Precision, Context Recall |
compare_runs.py |
Compares new eval runs against previous baselines to detect regressions |
All routes require the X-API-Key header.
// Request
{ "repo_url": "https://github.com/psf/requests", "ref": "main", "force_reindex": false }
// Response — 202 Accepted
{ "job_id": "abc-123", "status": "pending" }{ "sync_status": "synced", "commit_hash": "a1b2c3d", "error": null, "has_circular_dependencies": false }// Request
{ "repo_id": "psf_requests", "question": "How does session handling work?", "session_id": "opt-123" }
// Response
{
"answer": "Session handling works by...",
"sources": [
{ "file_path": "requests/sessions.py", "function_name": "Session.request", "start_line": 400, "end_line": 450 }
],
"confidence_score": 9.5,
"gated": false
}// Request
{ "repo_id": "psf_requests", "entry_point": "Session.request", "direction": "both" }
// Response
{ "mermaid_markdown": "graph TD\n..." }| Method | Path | Purpose |
|---|---|---|
POST |
/ingest |
Start background ingestion of a repository |
GET |
/status/{repo_id} |
Check ingestion / sync status |
POST |
/chat |
Run the agentic RAG query loop |
POST |
/diagram |
Generate a Mermaid call-graph diagram |
The Hallucination Guard (confidence.py) is the most critical safety feature. It parses the final LLM response for markdown citations (e.g. `src/auth.py:10-15`) and validates every single one against the actual index.
Final LLM Answer
|
v
Parse markdown citations (`file:start-end`)
|
v
For each citation:
- Does file exist in index? → No → penalize score
- Are line numbers within bounds? → No → penalize score
|
v
Compute deterministic confidence score (0–10)
|
├── score >= 4.0 → return answer, gated=false
└── score < 4.0 → strip answer, gated=true, return safe fallback
| Tool | Purpose |
|---|---|
search_code |
Hybrid (vector + BM25) search with RRF fusion and Cross-Encoder reranking |
read_file |
Read raw file contents from the indexed repository |
get_callers |
Graph traversal: find functions that call a given function (3-hop limit) |
get_callees |
Graph traversal: find functions called by a given function (3-hop limit) |
generate_diagram |
Produce a Mermaid diagram from a graph subgraph |
| Layer | Technology | Role |
|---|---|---|
| Runtime | Python 3.12 | Core language |
| Backend | FastAPI | REST API server, background tasks, auth |
| Frontend | Streamlit | Developer-facing chat and diagram UI |
| LLM | Groq (LLaMA 3) | Agent reasoning and answer generation |
| Vector Store | ChromaDB | Semantic embedding storage and search |
| Keyword Index | BM25 (pure Python) | Exact variable/function name matching |
| Graph Engine | NetworkX | Call graph — callers, callees, import chains |
| AST Parser | Tree-sitter | Language-aware code chunking |
| Embeddings | SentenceTransformers | Text-to-vector conversion |
| Reranker | Cross-Encoder | Relevance reranking of hybrid search results |
| Evaluation | Ragas | Automated faithfulness and recall scoring |
| Webhooks | HMAC SHA-256 | Secure GitHub push verification |
📦 codebase-onboarding-agent/
│
├── 📂 app/
│ ├── 📂 api/
│ │ ├── 🔐 auth.py ← X-API-Key authentication
│ │ ├── 🚦 rate_limiter.py ← Sliding-window rate limiting
│ │ └── 🛣️ router.py ← /ingest /status /chat /diagram
│ │
│ ├── 📂 agent/
│ │ ├── 💬 loop.py ← Core agentic execution loop
│ │ ├── 🔧 tools.py ← Tool schemas + retry logic
│ │ ├── 📝 system_prompt.py ← LLM instruction set
│ │ ├── ⚡ semantic_cache.py ← 95% similarity answer cache
│ │ ├── 📏 context_manager.py ← Token usage + compression
│ │ └── 🛡️ confidence.py ← Hallucination guard + gating
│ │
│ ├── 📂 ingestion/
│ │ ├── 📥 clone.py ← Git clone + offline fallback
│ │ ├── 🔍 file_filter.py ← Allowlist filtering
│ │ ├── 🗃️ metadata_store.py ← Sync state + commit hash tracking
│ │ └── 🔒 locking.py ← Thread-level repo locks
│ │
│ ├── 📂 parsing/
│ │ ├── 🌳 tree_sitter_parser.py ← AST extraction
│ │ └── ✂️ chunker.py ← Function/class boundary splitting
│ │
│ ├── 📂 retrieval/
│ │ ├── 🧮 vector_store.py ← ChromaDB interface
│ │ ├── 🔡 bm25_store.py ← BM25 keyword index
│ │ ├── 🧬 embeddings.py ← SentenceTransformers wrapper
│ │ ├── 🔀 hybrid_search.py ← RRF fusion
│ │ ├── 🎯 reranker.py ← Cross-Encoder reranking
│ │ └── 💡 query_expansion.py ← LLM synonym expansion
│ │
│ ├── 📂 graph/
│ │ ├── 🏗️ builder.py ← NetworkX graph construction
│ │ └── 🔎 queries.py ← BFS traversal (3-hop limit)
│ │
│ ├── 📂 diagrams/
│ │ └── 📊 mermaid_generator.py ← Graph → Mermaid markdown
│ │
│ ├── ⚙️ config.py ← Pydantic BaseSettings (all env vars)
│ └── 🚀 main.py ← FastAPI bootstrap
│
├── 📂 frontend/
│ └── 🌐 streamlit_app.py ← Chat UI + diagram rendering
│
├── 📂 eval/
│ ├── 🧪 run_eval.py ← Ragas evaluation runner
│ └── 📈 compare_runs.py ← Regression detection
│
├── 📂 tests/ ← Full pytest suite
├── 🔧 .env.example ← Environment variable template
├── 🚀 run_local.bat ← One-command local startup
├── 📋 requirements.txt ← All Python dependencies
└── 📖 README.md
python --version # 3.12 required
git --version # any recent version① Clone the repository
git clone https://github.com/HurairaMaqbool/HurairaMaqbool.git
cd codebase-onboarding-agent② Create and activate a virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate③ Install all dependencies
pip install -r requirements.txt④ Configure environment variables
cp .env.example .envEdit .env:
GROQ_API_KEY="your_groq_api_key"
LLM_PROVIDER="groq"
API_KEY="dev-secret-key"
WEBHOOK_SECRET="your_github_webhook_secret"⑤ Start the system
run_local.bat| Service | URL |
|---|---|
| FastAPI Backend | http://localhost:8000 |
| Streamlit Frontend | http://localhost:8501 |
| API Docs (Swagger) | http://localhost:8000/docs |
# Run the full test suite
pytest tests/
# Run a specific module
pytest tests/test_module_12.py -v
# Run with coverage report
pytest tests/ --cov=app --cov-report=htmlNote:
tests/test_golden_set.pyperforms live API calls against real repositories. It is skipped by default to preserve API quotas — run explicitly when needed.
python eval/run_eval.py # Run full Ragas eval
python eval/compare_runs.py # Compare against previous baselineMetrics tracked: Faithfulness · Answer Relevancy · Context Precision · Context Recall
✅ Phase 1 — Core Pipeline
[x] Git ingestion + AST chunking
[x] Triple indexing (ChromaDB + BM25 + NetworkX)
[x] Hybrid search with RRF + reranker
[x] Autonomous agentic loop (5 tools)
✅ Phase 2 — Safety & Reliability
[x] Hallucination guard with citation validation
[x] Semantic answer cache (95% threshold)
[x] Context compression (token overflow protection)
[x] HMAC-verified webhook auto-sync
🔄 Phase 3 — Evaluation & Observability (in progress)
[ ] Ragas golden set scoring
[ ] LangSmith tracing integration
[ ] Regression detection pipeline
🔮 Phase 4 — Extensions
[ ] VS Code extension
[ ] Multi-repo cross-codebase queries
[ ] Docker containerisation + CI/CD
[ ] Fine-tuned reranker on code-specific data
[ ] Support for Go, Rust, Java (Tree-sitter grammar expansion)
Contributions, ideas, and bug reports are warmly welcomed!
# 1. Fork the repository
# 2. Clone your fork
git clone https://github.com/YOUR-USERNAME/codebase-onboarding-agent.git
# 3. Create a feature branch
git checkout -b feature/your-feature-name
# 4. Make your changes and commit
git add .
git commit -m "feat: describe your change clearly"
# 5. Push and open a Pull Request
git push origin feature/your-feature-nameContribution ideas:
- 🐛 Fix edge cases in
tree_sitter_parser.pyfor multi-language repos - ➕ Add Tree-sitter grammars for Go, Rust, or Java
- 🧪 Expand the Ragas golden set with new Q&A pairs
- 🐳 Add Docker + Docker Compose support
- 📊 Build a LangSmith observability dashboard
This project is distributed under the MIT License.
MIT License — free to use, modify, and distribute with attribution.
See the LICENSE file for full terms.
AI Engineer · LangChain · LangGraph · RAG Pipelines · Agentic Systems
If this project saved you hours of onboarding time or taught you something new — a ⭐ star on GitHub means a lot and helps other developers find it.
