GitHub - HurairaMaqbool/CodeNavigator: AI-powered code navigation tool — intelligently understands, explains, and navigates codebases using machine learning. Built with Python.

📋 Table of Contents — click to expand / collapse

#	Section
1	🧠 About the Project
2	✨ Features at a Glance
3	🏗️ High-Level Architecture
4	⚙️ Workflow Breakdowns
5	📦 Module Reference
6	🔌 API Contracts
7	🛡️ Agentic Gating & Hallucination Guard
8	🛠️ Tech Stack
9	📁 Project Structure
10	🚀 Setup & Local Execution
11	🧪 Testing
12	🛣️ Roadmap
13	🤝 Contributing
14	📄 License
15	👤 Author

🧠 About the Project

Onboarding onto a new codebase takes days — sometimes weeks. Developers waste hours tracing function calls, reading stale docs, and asking colleagues questions that interrupt everyone's flow.

The Codebase Onboarding Agent eliminates that friction. It ingests any Git repository, builds a triple index (semantic vectors + BM25 keyword + call graph), and exposes an autonomous RAG agent that answers precise, citation-verified questions about the codebase in seconds.

This is not a chatbot wrapper around an LLM. It is a production-grade agentic pipeline with:

🔬 AST-level chunking — logic is never split mid-function
🕸️ Graph-augmented retrieval — understand who calls what
🛡️ Hallucination gating — every citation is verified against the index before it reaches the user
⚡ Semantic answer caching — repeat questions are served in milliseconds
🔄 Webhook auto-sync — the agent updates itself on every GitHub push

The core problem

Old Way	This Agent
Read docs that may be stale	Answers from the actual, live code
Grep and trace calls manually	Graph traversal finds callers/callees instantly
Ask a senior dev (interrupt their flow)	Ask the agent — it cites the exact file and line
Re-onboard after every big PR	Webhook re-ingests on merge, cache is purged

✨ Features at a Glance

	Feature	Description
🤖	Autonomous Agent Loop	LLM decides which tools to call — no hardcoded router
🔍	Hybrid Search	Vector (ChromaDB) + Keyword (BM25) fused via Reciprocal Rank Fusion
🕸️	Call Graph Engine	NetworkX graph of imports and function calls — traversed via BFS (3-hop limit)
🛡️	Hallucination Guard	Citation validation gates any response with confidence < 4.0 / 10
⚡	Semantic Cache	95% similarity threshold — identical questions skip the LLM entirely
📊	Mermaid Diagrams	Auto-generates call-graph diagrams that render in the Streamlit UI
🔄	Webhook Auto-Sync	HMAC-verified GitHub webhooks trigger re-ingestion on every push
🌐	REST API	FastAPI backend with X-API-Key auth and sliding-window rate limiting
📓	Ragas Evaluation	Automated faithfulness, relevancy, and recall scoring via golden set

🏗️ High-Level Architecture

Developer / GitHub Webhook
         |
         v
 Streamlit App (frontend/)
         | HTTP REST
         v
 FastAPI Router (app/api/)
         |
         |──[Ingestion Pipeline]─────────────────────────────────┐
         |    1. Git Clone / Fetch  (app/ingestion/clone.py)      |
         |    2. File Filter        (file_filter.py)              |
         |    3. Tree-Sitter Parser (app/parsing/)                |
         |    4. AST Chunker        (chunker.py)                  |
         |    5a. Vector Store      (ChromaDB)                    |
         |    5b. BM25 Index        (app/retrieval/)              |
         |    5c. Call Graph        (NetworkX)                    |
         |                                                        |
         └──[Chat Agentic Loop]──────────────────────────────────┘
              Query ──> Semantic Answer Cache
                            | (miss)
                            v
                    RAG Agent Loop (app/agent/)
                            |
                            v
                    Agent Tools (tools.py)
                       |            |
                       v            v
               Hybrid Search    Graph Queries
              (RRF + Reranker)  (app/graph/)
                       \            /
                        \          /
                         v        v
                  Hallucination Guard (confidence.py)
                            |
                            v
                    Cache & Return → API

⚙️ Workflow Breakdowns

Workflow 1 — Ingestion Pipeline

User / Webhook
     |
     | POST /ingest
     v
  API Layer ──── 202 Accepted {job_id}
     |
     | [Background Task]
     v
  clone_repo()
     |
     v
  filter_repo_files()   ← drops binaries, minified files, images
     |
     v
  parse_file()          ← Tree-Sitter AST extraction
     |
     v
  ┌──────────────────────────────────────┐
  │  [Parallel Indexing]                 │
  │  embed → ChromaDB (semantic)         │
  │  tokenize → BM25 (keyword)           │
  │  analyze → NetworkX (call graph)     │
  └──────────────────────────────────────┘
     |
     v
  metadata_store.update(status="synced")

Key steps:

Step	What Happens
Fast Return	API returns `job_id` immediately — client is never blocked
Safe Fetch	Falls back to bundled dummy repo if network is unavailable
Filtering	Drops binaries, minified JS, images — saves LLM tokens
AST Chunking	Splits at class/function boundaries — context never chopped mid-logic
Triple Indexing	Semantic (ChromaDB) + Keyword (BM25) + Relational (NetworkX) simultaneously

Workflow 2 — Agentic Chat (RAG)

User Question
     |
     v
  Semantic Cache  ──── [HIT: similarity > 0.95] ──── Return cached answer
     |
  [MISS]
     |
     v
  Agent Loop (max 4 iterations)
     |
     | think → decide tool → call tool → observe result
     v
  ┌─────────────────────────────────────────┐
  │  search_code   → Hybrid Search (RRF)    │
  │  read_file     → Raw file contents      │
  │  get_callers   → Graph BFS upstream     │
  │  get_callees   → Graph BFS downstream   │
  │  generate_diagram → Mermaid output      │
  └─────────────────────────────────────────┘
     |
     | [context near limit → compress_older_tool_results()]
     v
  Final Answer
     |
     v
  Hallucination Guard
     |
     ├── confidence >= 4.0 → cache + return answer
     └── confidence < 4.0  → return safe fallback, gated=true

Workflow 3 — Webhook Auto-Sync

GitHub Push / Merge Event
     |
     v
  POST /webhook
     |
     | Verify X-Hub-Signature-256 (HMAC)
     v
  Acquire write lock for repo_id
     |
     v
  Re-run full ingestion pipeline (force_reindex=True)
     |
     v
  invalidate_cache(repo_id)   ← purges ALL cached answers for this repo
     |
     v
  metadata_store.update(commit_hash=new_hash)

Why this matters: HMAC-verified webhooks + write-locking + aggressive cache invalidation ensures the agent's answers always reflect the latest merged commit. Stale architectural explanations are never served.

📦 Module Reference

🌐 API Layer — app/api/ & app/main.py

File	Responsibility
`app/main.py`	FastAPI bootstrap — middleware, exception handlers, router mount
`app/config.py`	Pydantic `BaseSettings` — single source of truth for env variables
`app/api/router.py`	REST endpoints: `/ingest`, `/status`, `/chat`, `/diagram`
`app/api/auth.py`	X-API-Key header authentication
`app/api/rate_limiter.py`	Sliding-window memory-based rate limiting on `/chat`

⚙️ Ingestion Pipeline — app/ingestion/

File	Responsibility
`clone.py`	Git clone with size limits; offline fallback to dummy repo
`file_filter.py`	Allowlist: `.py`, `.js`, `.ts` — drops binaries and minified files
`metadata_store.py`	Persists sync state (`pending / synced / failed`) and commit hashes
`locking.py`	Thread-level locking per `repo_id` — prevents concurrent ingestion collisions

🔬 Parsing & Chunking — app/parsing/

File	Responsibility
`tree_sitter_parser.py`	AST extraction of classes, functions, and imports via Tree-sitter grammars
`chunker.py`	Splits files at logical (class/function) boundaries — never mid-logic

🗄️ Retrieval & Storage — app/retrieval/

File	Responsibility
`vector_store.py`	ChromaDB interface — stores and queries chunk embeddings
`bm25_store.py`	Pure-Python BM25 index — catches exact variable/function name matches
`embeddings.py`	SentenceTransformers wrapper for converting text to vectors
`hybrid_search.py`	Reciprocal Rank Fusion (RRF) combining vector + BM25 results
`reranker.py`	Cross-Encoder reranking of top-K hybrid search results
`query_expansion.py`	LLM-generated synonym/keyword expansion before hitting the index

🕸️ Graph Operations — app/graph/ & app/diagrams/

File	Responsibility
`builder.py`	Builds a NetworkX directed graph of imports and internal function calls
`queries.py`	Timeout-safe BFS traversal for callers/callees (3-hop limit)
`mermaid_generator.py`	Converts a graph subgraph into a Mermaid diagram string for the frontend

🤖 Agentic Loop — app/agent/

File	Responsibility
`loop.py`	Core execution loop — LLM call → tool intercept → tool run → feed back → repeat
`tools.py`	JSON schema definitions + retry logic for all 5 agent tools
`system_prompt.py`	LLM instructions: persona, rules, and strict markdown citation format
`semantic_cache.py`	95% similarity threshold cache — same question, same commit → instant return
`context_manager.py`	Monitors token usage; condenses old tool results via secondary LLM
`confidence.py`	Hallucination Guard — validates every cited file/line; gates below score 4.0

🧪 Evaluation Suite — eval/

File	Responsibility
`run_eval.py`	Ragas automated evaluation: Faithfulness, Answer Relevancy, Context Precision, Context Recall
`compare_runs.py`	Compares new eval runs against previous baselines to detect regressions

🔌 API Contracts

All routes require the X-API-Key header.

`POST /ingest`

// Request
{ "repo_url": "https://github.com/psf/requests", "ref": "main", "force_reindex": false }

// Response — 202 Accepted
{ "job_id": "abc-123", "status": "pending" }

`GET /status/{repo_id}`

{ "sync_status": "synced", "commit_hash": "a1b2c3d", "error": null, "has_circular_dependencies": false }

`POST /chat`

// Request
{ "repo_id": "psf_requests", "question": "How does session handling work?", "session_id": "opt-123" }

// Response
{
  "answer": "Session handling works by...",
  "sources": [
    { "file_path": "requests/sessions.py", "function_name": "Session.request", "start_line": 400, "end_line": 450 }
  ],
  "confidence_score": 9.5,
  "gated": false
}

`POST /diagram`

// Request
{ "repo_id": "psf_requests", "entry_point": "Session.request", "direction": "both" }

// Response
{ "mermaid_markdown": "graph TD\n..." }

Method	Path	Purpose
`POST`	`/ingest`	Start background ingestion of a repository
`GET`	`/status/{repo_id}`	Check ingestion / sync status
`POST`	`/chat`	Run the agentic RAG query loop
`POST`	`/diagram`	Generate a Mermaid call-graph diagram

🛡️ Agentic Gating & Hallucination Guard

The Hallucination Guard (confidence.py) is the most critical safety feature. It parses the final LLM response for markdown citations (e.g. `src/auth.py:10-15`) and validates every single one against the actual index.

Final LLM Answer
       |
       v
Parse markdown citations (`file:start-end`)
       |
       v
For each citation:
  - Does file exist in index?        → No → penalize score
  - Are line numbers within bounds?  → No → penalize score
       |
       v
Compute deterministic confidence score (0–10)
       |
       ├── score >= 4.0 → return answer,  gated=false
       └── score < 4.0  → strip answer,   gated=true, return safe fallback

Available Agent Tools

Tool	Purpose
`search_code`	Hybrid (vector + BM25) search with RRF fusion and Cross-Encoder reranking
`read_file`	Read raw file contents from the indexed repository
`get_callers`	Graph traversal: find functions that call a given function (3-hop limit)
`get_callees`	Graph traversal: find functions called by a given function (3-hop limit)
`generate_diagram`	Produce a Mermaid diagram from a graph subgraph

🛠️ Tech Stack

Layer	Technology	Role
Runtime	Python 3.12	Core language
Backend	FastAPI	REST API server, background tasks, auth
Frontend	Streamlit	Developer-facing chat and diagram UI
LLM	Groq (LLaMA 3)	Agent reasoning and answer generation
Vector Store	ChromaDB	Semantic embedding storage and search
Keyword Index	BM25 (pure Python)	Exact variable/function name matching
Graph Engine	NetworkX	Call graph — callers, callees, import chains
AST Parser	Tree-sitter	Language-aware code chunking
Embeddings	SentenceTransformers	Text-to-vector conversion
Reranker	Cross-Encoder	Relevance reranking of hybrid search results
Evaluation	Ragas	Automated faithfulness and recall scoring
Webhooks	HMAC SHA-256	Secure GitHub push verification

📁 Project Structure

📦 codebase-onboarding-agent/
│
├── 📂 app/
│   ├── 📂 api/
│   │   ├── 🔐 auth.py                  ← X-API-Key authentication
│   │   ├── 🚦 rate_limiter.py          ← Sliding-window rate limiting
│   │   └── 🛣️ router.py               ← /ingest /status /chat /diagram
│   │
│   ├── 📂 agent/
│   │   ├── 💬 loop.py                  ← Core agentic execution loop
│   │   ├── 🔧 tools.py                 ← Tool schemas + retry logic
│   │   ├── 📝 system_prompt.py         ← LLM instruction set
│   │   ├── ⚡ semantic_cache.py        ← 95% similarity answer cache
│   │   ├── 📏 context_manager.py       ← Token usage + compression
│   │   └── 🛡️ confidence.py           ← Hallucination guard + gating
│   │
│   ├── 📂 ingestion/
│   │   ├── 📥 clone.py                 ← Git clone + offline fallback
│   │   ├── 🔍 file_filter.py           ← Allowlist filtering
│   │   ├── 🗃️ metadata_store.py       ← Sync state + commit hash tracking
│   │   └── 🔒 locking.py              ← Thread-level repo locks
│   │
│   ├── 📂 parsing/
│   │   ├── 🌳 tree_sitter_parser.py    ← AST extraction
│   │   └── ✂️ chunker.py              ← Function/class boundary splitting
│   │
│   ├── 📂 retrieval/
│   │   ├── 🧮 vector_store.py          ← ChromaDB interface
│   │   ├── 🔡 bm25_store.py            ← BM25 keyword index
│   │   ├── 🧬 embeddings.py            ← SentenceTransformers wrapper
│   │   ├── 🔀 hybrid_search.py         ← RRF fusion
│   │   ├── 🎯 reranker.py              ← Cross-Encoder reranking
│   │   └── 💡 query_expansion.py       ← LLM synonym expansion
│   │
│   ├── 📂 graph/
│   │   ├── 🏗️ builder.py              ← NetworkX graph construction
│   │   └── 🔎 queries.py              ← BFS traversal (3-hop limit)
│   │
│   ├── 📂 diagrams/
│   │   └── 📊 mermaid_generator.py     ← Graph → Mermaid markdown
│   │
│   ├── ⚙️ config.py                   ← Pydantic BaseSettings (all env vars)
│   └── 🚀 main.py                     ← FastAPI bootstrap
│
├── 📂 frontend/
│   └── 🌐 streamlit_app.py            ← Chat UI + diagram rendering
│
├── 📂 eval/
│   ├── 🧪 run_eval.py                 ← Ragas evaluation runner
│   └── 📈 compare_runs.py             ← Regression detection
│
├── 📂 tests/                          ← Full pytest suite
├── 🔧 .env.example                    ← Environment variable template
├── 🚀 run_local.bat                   ← One-command local startup
├── 📋 requirements.txt                ← All Python dependencies
└── 📖 README.md

🚀 Setup & Local Execution

Prerequisites

python --version   # 3.12 required
git --version      # any recent version

Step-by-Step Installation

① Clone the repository

git clone https://github.com/HurairaMaqbool/HurairaMaqbool.git
cd codebase-onboarding-agent

② Create and activate a virtual environment

python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

③ Install all dependencies

pip install -r requirements.txt

④ Configure environment variables

cp .env.example .env

Edit .env:

GROQ_API_KEY="your_groq_api_key"
LLM_PROVIDER="groq"
API_KEY="dev-secret-key"
WEBHOOK_SECRET="your_github_webhook_secret"

⑤ Start the system

run_local.bat

Service	URL
FastAPI Backend	http://localhost:8000
Streamlit Frontend	http://localhost:8501
API Docs (Swagger)	http://localhost:8000/docs

🧪 Testing

# Run the full test suite
pytest tests/

# Run a specific module
pytest tests/test_module_12.py -v

# Run with coverage report
pytest tests/ --cov=app --cov-report=html

Note: tests/test_golden_set.py performs live API calls against real repositories. It is skipped by default to preserve API quotas — run explicitly when needed.

Evaluation (Ragas)

python eval/run_eval.py        # Run full Ragas eval
python eval/compare_runs.py    # Compare against previous baseline

Metrics tracked: Faithfulness · Answer Relevancy · Context Precision · Context Recall

🛣️ Roadmap

 ✅  Phase 1 — Core Pipeline
     [x] Git ingestion + AST chunking
     [x] Triple indexing (ChromaDB + BM25 + NetworkX)
     [x] Hybrid search with RRF + reranker
     [x] Autonomous agentic loop (5 tools)

 ✅  Phase 2 — Safety & Reliability
     [x] Hallucination guard with citation validation
     [x] Semantic answer cache (95% threshold)
     [x] Context compression (token overflow protection)
     [x] HMAC-verified webhook auto-sync

 🔄  Phase 3 — Evaluation & Observability  (in progress)
     [ ] Ragas golden set scoring
     [ ] LangSmith tracing integration
     [ ] Regression detection pipeline

 🔮  Phase 4 — Extensions
     [ ] VS Code extension
     [ ] Multi-repo cross-codebase queries
     [ ] Docker containerisation + CI/CD
     [ ] Fine-tuned reranker on code-specific data
     [ ] Support for Go, Rust, Java (Tree-sitter grammar expansion)

🤝 Contributing

Contributions, ideas, and bug reports are warmly welcomed!

# 1. Fork the repository

# 2. Clone your fork
git clone https://github.com/YOUR-USERNAME/codebase-onboarding-agent.git

# 3. Create a feature branch
git checkout -b feature/your-feature-name

# 4. Make your changes and commit
git add .
git commit -m "feat: describe your change clearly"

# 5. Push and open a Pull Request
git push origin feature/your-feature-name

Contribution ideas:

🐛 Fix edge cases in tree_sitter_parser.py for multi-language repos
➕ Add Tree-sitter grammars for Go, Rust, or Java
🧪 Expand the Ragas golden set with new Q&A pairs
🐳 Add Docker + Docker Compose support
📊 Build a LangSmith observability dashboard

📄 License

This project is distributed under the MIT License.

MIT License — free to use, modify, and distribute with attribution.
See the LICENSE file for full terms.

👤 Author

Huraira Maqbool

AI Engineer · LangChain · LangGraph · RAG Pipelines · Agentic Systems

If this project saved you hours of onboarding time or taught you something new — a ⭐ star on GitHub means a lot and helps other developers find it.

_{Built with 🐍 Python · ⚡ FastAPI · 🧠 LangChain · 🕸️ NetworkX · ❤️ Passion for Developer Tooling}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
.streamlit		.streamlit
admin		admin
app		app
bm25_index		bm25_index
chroma_db		chroma_db
docs		docs
eval		eval
frontend		frontend
graph_store		graph_store
k8s		k8s
repos		repos
scratch		scratch
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
PROJECT_OVERVIEW.md		PROJECT_OVERVIEW.md
README.md		README.md
SECURITY.md		SECURITY.md
check_deps.py		check_deps.py
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
docker-engine-dns.example.json		docker-engine-dns.example.json
fix_conftest.py		fix_conftest.py
fix_empty_ifs.py		fix_empty_ifs.py
fix_indent.py		fix_indent.py
install.ps1		install.ps1
patch_sys_modules.py		patch_sys_modules.py
patch_test.py		patch_test.py
requirements-docker.txt		requirements-docker.txt
requirements-eval.txt		requirements-eval.txt
requirements-heavy.txt		requirements-heavy.txt
requirements.txt		requirements.txt
resilient_downloader.py		resilient_downloader.py
run_local.bat		run_local.bat
scratch.py		scratch.py
scratch_ast.py		scratch_ast.py
start.bat		start.bat
test_ingest.py		test_ingest.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 About the Project

The core problem

✨ Features at a Glance

🏗️ High-Level Architecture

⚙️ Workflow Breakdowns

Workflow 1 — Ingestion Pipeline

Workflow 2 — Agentic Chat (RAG)

Workflow 3 — Webhook Auto-Sync

📦 Module Reference

🔌 API Contracts

`POST /ingest`

`GET /status/{repo_id}`

`POST /chat`

`POST /diagram`

🛡️ Agentic Gating & Hallucination Guard

Available Agent Tools

🛠️ Tech Stack

📁 Project Structure

🚀 Setup & Local Execution

Prerequisites

Step-by-Step Installation

🧪 Testing

Evaluation (Ragas)

🛣️ Roadmap

🤝 Contributing

📄 License

👤 Author

Huraira Maqbool

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 About the Project

The core problem

✨ Features at a Glance

🏗️ High-Level Architecture

⚙️ Workflow Breakdowns

Workflow 1 — Ingestion Pipeline

Workflow 2 — Agentic Chat (RAG)

Workflow 3 — Webhook Auto-Sync

📦 Module Reference

🔌 API Contracts

POST /ingest

GET /status/{repo_id}

POST /chat

POST /diagram

🛡️ Agentic Gating & Hallucination Guard

Available Agent Tools

🛠️ Tech Stack

📁 Project Structure

🚀 Setup & Local Execution

Prerequisites

Step-by-Step Installation

🧪 Testing

Evaluation (Ragas)

🛣️ Roadmap

🤝 Contributing

📄 License

👤 Author

Huraira Maqbool

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /ingest`

`GET /status/{repo_id}`

`POST /chat`

`POST /diagram`

Packages