How We Score

Our scoring system measures how engineers work with AI across 7 calibrated dimensions in 3 tiers. Every score is backed by telemetry evidence — not subjective judgment. The same behavior is scored differently depending on what the task demands.

Process Over Output

Two engineers can produce identical code — one by thoughtfully guiding AI with good context and verification, the other by accepting the third AI attempt after two failures. We measure the process, not just the result. Base scores are deterministic from telemetry; optional LLM enhancement adds qualitative evidence (bounded to ±15 points per dimension).

TIER 1

Calibrated AI Judgment

(65% of score)

Where the gap between strong and weak AI-augmented engineers is widest. Scores are calibrated per task — the same behavior is scored differently depending on what the task demands.

Calibrated Trust

25% weighthybrid

Does the engineer's level of verification match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.

Signals Measured

  • Whether verification intensity matches the risk level of the change
  • Appropriate trust calibration — accepting low-risk suggestions efficiently while scrutinizing high-risk ones
  • Critical evaluation of AI output relative to task complexity
  • Evidence of independent judgment in accepting or modifying AI suggestions

Example

An engineer who accepts a simple import fix without testing but runs full test suites after an AI-suggested architectural change scores 85+. Testing everything equally — or testing nothing — both score lower.

Context Engineering

20% weightdeterministic

How effectively does the engineer select and provide context to the AI? Quality over quantity — 200 relevant lines beat 2000 lines of noise. Scored against task-specific rubric files and key context targets.

Signals Measured

  • Quality of investigation before seeking AI assistance
  • Relevance and precision of context provided to the AI
  • Whether prompts include meaningful architectural and behavioral constraints
  • Context quality evolution as understanding deepens

Example

Attaching the relevant config file and test patterns before asking for a fix scores higher than pasting entire files without filtering.

Problem Decomposition

20% weightdeterministic

Does the engineer think before prompting? Exploration time is calibrated per task — a production triage task expects faster orientation than a complex refactor.

Signals Measured

  • Exploration depth appropriate to the situation
  • Evidence of structured thinking before acting
  • Prompt specificity and targeted problem framing
  • Whether complex problems are broken into manageable, scoped steps

Example

A candidate who spends 3 minutes exploring a debugging task before prompting scores well. The same 3 minutes on a production triage (where speed matters) would score lower.
TIER 2

Technical Execution

(25% of score)

Core engineering skills that remain essential regardless of AI assistance.

Debugging & Recovery

12% weighthybrid

When things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?

Signals Measured

  • Systematic approach to root-cause identification
  • Use of appropriate debugging techniques vs. blind re-prompting
  • Quality of recovery when an approach fails
  • Speed of recognizing dead ends and pivoting

Example

Spotting that the AI's fix passes tests but introduces a subtle race condition, then systematically narrowing the root cause.

Architectural Judgment

8% weightdeterministic

Does the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations for scope and pattern adherence.

Signals Measured

  • Respect for existing codebase patterns and conventions
  • Deliberate decisions about scope and placement of changes
  • Resistance to unnecessary AI-introduced complexity

Example

A focused fix in 2-3 files that follows existing patterns scores higher than letting AI scatter changes across 10 files.

Code Review Quality

5% weighthybrid

Can they critically evaluate code — whether AI-generated or human-written?

Signals Measured

  • Specificity and actionability of review feedback
  • Ability to distinguish critical issues from minor style concerns
  • Quality of suggested alternatives and explanations

Example

Identifying SQL injection AND suggesting parameterized queries scores higher than noting 'security issue'.
TIER 3

Efficiency

(10% of score)

Speed without quality is negative value — this is intentionally low-weighted.

Workflow Efficiency

10% weightdeterministic

Measures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.

Signals Measured

  • Effective use of available development tools
  • Productive momentum without unnecessary context switches
  • Appropriate tool selection for the task at hand
  • Read-before-write patterns indicating thoughtful workflow

Example

Using the AI to read and understand code before writing, combined with terminal for testing, while maintaining flow. Not penalized for taking time to be thorough.

Our scoring uses a combination of deterministic telemetry analysis and AI-assisted evaluation. Signal weighting includes per-session randomization to prevent pattern memorization.

Behavioral Patterns

Beyond individual dimensions, we detect overall session patterns — how the engineer approaches the problem as a whole. Patterns apply a small modifier to the overall score.

Calibrated Expert

+5 to +8

Behavior intensity matches task demands — light verification for simple changes, deep testing for complex ones. The strongest signal of experience.

Methodical Verifier

+2 to +5

Systematic, thorough verification on every change. Always a positive signal — may over-verify on simple tasks, but never misses real issues.

Explore-Plan-Execute

+3 to +5

High orientation time → specific prompts → targeted verification. Strong signal of structured problem-solving.

Recovery Pivot

+3 to +6

Initial approach fails → recognizes dead end → pivots strategy → succeeds. Shows resilience, experience, and intellectual honesty.

Context Blind

-5 to -10

Demonstrates skills but ignores task-specific context — doesn't read rubric files, misses key architectural patterns, applies generic solutions to specific problems.

Spray and Pray

-8 to -15

Immediate vague prompts → accept first output → no verification. The weakest signal of engineering judgment.

Comprehension Checks

After submission, candidates answer 3-5 targeted questions about their work. This prevents the "AI-generated code you don't understand" problem — research shows developers using AI score 17% lower on comprehension tests than those coding manually.

MCRoot cause identification — can they explain what was actually wrong?
MCApproach justification — why did they choose this approach over alternatives?
MCCode comprehension — what does this specific code do and what trade-offs does it make?
OpenReflection — what would they do differently with more time?

Grade Scale

The overall score (0-100) maps to a letter grade and performance band.

GradeScore RangeBandWhat It Means
S90-100ExceptionalTop-tier performance across all dimensions. Rare.
A80-89StrongConsistently strong AI collaboration. Recommended hire.
B70-79CompetentSolid fundamentals with room to grow.
C60-69DevelopingSome good patterns, but significant gaps.
D50-59Needs WorkBelow expectations for the role.
F<50Significant GapsFundamental skills not demonstrated.

Scoring Reliability

Zero Rater Variance

4 of 7 dimensions are fully deterministic — computed from telemetry signals with per-session weight randomization to prevent pattern memorization. The same session always produces the same base scores. No human rater, no subjectivity.

Bounded LLM Enhancement

For 3 hybrid dimensions, LLM analysis can adjust scores by at most ±15 points. Each adjustment includes confidence level (high/medium/low) and written justification.

Evidence Trail

Every score links to specific timestamped events in the session. You can review the raw evidence — prompts, edits, test runs — that produced each number.

Outcome Modifier

After dimension scoring, a small modifier adjusts the overall score based on final test results. Process matters most, but outcomes matter too.

Final Test Pass RateScore Adjustment
100%No penalty
70-99%-5 points
30-69%-10 points
1-29%-15 points
0%-20 points

See It In Action

View a complete sample scorecard with all dimensions scored, behavioral patterns detected, and evidence linked.