Docs

How It Works

A technical overview of the CodebaseAtlas analysis pipeline — what each step does, what the LLM is and isn't responsible for, and how your data is handled.

The analysis pipeline

When you submit a GitHub URL to Atlas, the backend runs through five deterministic stages before the LLM ever sees any data:

GitHub API fetch

The backend calls the GitHub REST API to fetch the repository file tree and the content of high-priority files (package.json, requirements.txt, Dockerfile, README, etc.). Only file names and manifest contents are read — source code is not sent anywhere.

Manifest parsing

manifest_parser.py deterministically extracts dependency lists from package.json, requirements.txt, pyproject.toml, Cargo.toml, go.mod, and similar files. No AI is involved — this is pattern matching and JSON parsing.

Framework detection

framework_detector.py uses heuristic rules to classify the tech stack: frontend framework, backend framework, database, infrastructure tooling, and testing setup. Evidence is collected for each inference.

Evidence object construction

A structured evidence object is assembled from the parsed manifests and detected frameworks. This is what gets sent to the LLM — not raw file contents. The LLM never invents dependencies or frameworks; it only describes what the evidence shows.

LLM analysis (Claude)

Claude receives the evidence object and generates: a Mermaid architecture diagram, a developer-focused technical summary, and a non-technical summary for hiring managers. Tool-use (structured output) ensures the response is machine-parseable.

What Claude is and isn't responsible for

LLM is used for

✓Architecture diagram (Mermaid syntax)
✓Technical summary
✓Non-Technical summary
✓Grouping and describing API endpoints
✓Quality finding descriptions

LLM is NOT used for

✗Detecting dependencies
✗Identifying frameworks
✗Fetching or reading file content
✗Calculating confidence scores
✗Extracting API routes

This separation means that framework detection and dependency parsing are fully deterministic and testable — results are reproducible. The LLM adds human-readable interpretation on top of verified evidence.

Confidence scores

Each detected stack item (e.g., “React” in the frontend slot) carries a confidence score between 0 and 1. This score reflects how much file evidence supports the inference:

0.80 – 1.0

High — Multiple manifest files confirm the framework

0.50 – 0.79

Medium — Some evidence present, possibly inferred from file patterns

0.0 – 0.49

Low — Weak signal — treat as a best-guess

Data handling and security

Public repos only

CodebaseAtlas only works with public repositories. It uses the unauthenticated GitHub API by default, which can only access public data. No authentication to private repos is possible through this service.

GitHub tokens

RepoScout accepts an optional GitHub personal access token to raise the API rate limit. Tokens are used for a single request only and are never stored. They are not logged or persisted in any form.

What is sent to Claude

The Anthropic API receives a structured prompt containing: detected framework names, dependency lists (package names only, not versions), file counts, and directory structure summaries. Source code is never sent to any LLM.

Full transparency

CodebaseAtlas is fully open source under the AGPL-3.0 license. The complete source code — including the analysis pipeline, framework detection logic, manifest parsers, and LLM prompts — is available on GitHub.

View source on GitHub ↗analysis_pipeline.py ↗summary_service.py ↗