How We Score
Our scoring system measures how engineers work with AI across 7 calibrated dimensions in 3 tiers. Every score is backed by telemetry evidence — not subjective judgment. The same behavior is scored differently depending on what the task demands.
Process Over Output
Two engineers can produce identical code — one by thoughtfully guiding AI with good context and verification, the other by accepting the third AI attempt after two failures. We measure the process, not just the result. Base scores are deterministic from telemetry; optional LLM enhancement adds qualitative evidence (bounded to ±15 points per dimension).
Calibrated AI Judgment
(65% of score)Where the gap between strong and weak AI-augmented engineers is widest. Scores are calibrated per task — the same behavior is scored differently depending on what the task demands.
Calibrated Trust
25% weighthybridDoes the engineer's level of verification match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.
Calibrated Trust
25% weighthybridDoes the engineer's level of verification match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.
Signals Measured
- Whether verification intensity matches the risk level of the change
- Appropriate trust calibration — accepting low-risk suggestions efficiently while scrutinizing high-risk ones
- Critical evaluation of AI output relative to task complexity
- Evidence of independent judgment in accepting or modifying AI suggestions
Example
Context Engineering
20% weightdeterministicHow effectively does the engineer select and provide context to the AI? Quality over quantity — 200 relevant lines beat 2000 lines of noise. Scored against task-specific rubric files and key context targets.
Context Engineering
20% weightdeterministicHow effectively does the engineer select and provide context to the AI? Quality over quantity — 200 relevant lines beat 2000 lines of noise. Scored against task-specific rubric files and key context targets.
Signals Measured
- Quality of investigation before seeking AI assistance
- Relevance and precision of context provided to the AI
- Whether prompts include meaningful architectural and behavioral constraints
- Context quality evolution as understanding deepens
Example
Problem Decomposition
20% weightdeterministicDoes the engineer think before prompting? Exploration time is calibrated per task — a production triage task expects faster orientation than a complex refactor.
Problem Decomposition
20% weightdeterministicDoes the engineer think before prompting? Exploration time is calibrated per task — a production triage task expects faster orientation than a complex refactor.
Signals Measured
- Exploration depth appropriate to the situation
- Evidence of structured thinking before acting
- Prompt specificity and targeted problem framing
- Whether complex problems are broken into manageable, scoped steps
Example
Technical Execution
(25% of score)Core engineering skills that remain essential regardless of AI assistance.
Debugging & Recovery
12% weighthybridWhen things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?
Debugging & Recovery
12% weighthybridWhen things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?
Signals Measured
- Systematic approach to root-cause identification
- Use of appropriate debugging techniques vs. blind re-prompting
- Quality of recovery when an approach fails
- Speed of recognizing dead ends and pivoting
Example
Architectural Judgment
8% weightdeterministicDoes the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations for scope and pattern adherence.
Architectural Judgment
8% weightdeterministicDoes the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations for scope and pattern adherence.
Signals Measured
- Respect for existing codebase patterns and conventions
- Deliberate decisions about scope and placement of changes
- Resistance to unnecessary AI-introduced complexity
Example
Code Review Quality
5% weighthybridCan they critically evaluate code — whether AI-generated or human-written?
Code Review Quality
5% weighthybridCan they critically evaluate code — whether AI-generated or human-written?
Signals Measured
- Specificity and actionability of review feedback
- Ability to distinguish critical issues from minor style concerns
- Quality of suggested alternatives and explanations
Example
Efficiency
(10% of score)Speed without quality is negative value — this is intentionally low-weighted.
Workflow Efficiency
10% weightdeterministicMeasures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.
Workflow Efficiency
10% weightdeterministicMeasures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.
Signals Measured
- Effective use of available development tools
- Productive momentum without unnecessary context switches
- Appropriate tool selection for the task at hand
- Read-before-write patterns indicating thoughtful workflow
Example
Our scoring uses a combination of deterministic telemetry analysis and AI-assisted evaluation. Signal weighting includes per-session randomization to prevent pattern memorization.
Behavioral Patterns
Beyond individual dimensions, we detect overall session patterns — how the engineer approaches the problem as a whole. Patterns apply a small modifier to the overall score.
Calibrated Expert
+5 to +8Behavior intensity matches task demands — light verification for simple changes, deep testing for complex ones. The strongest signal of experience.
Methodical Verifier
+2 to +5Systematic, thorough verification on every change. Always a positive signal — may over-verify on simple tasks, but never misses real issues.
Explore-Plan-Execute
+3 to +5High orientation time → specific prompts → targeted verification. Strong signal of structured problem-solving.
Recovery Pivot
+3 to +6Initial approach fails → recognizes dead end → pivots strategy → succeeds. Shows resilience, experience, and intellectual honesty.
Context Blind
-5 to -10Demonstrates skills but ignores task-specific context — doesn't read rubric files, misses key architectural patterns, applies generic solutions to specific problems.
Spray and Pray
-8 to -15Immediate vague prompts → accept first output → no verification. The weakest signal of engineering judgment.
Comprehension Checks
After submission, candidates answer 3-5 targeted questions about their work. This prevents the "AI-generated code you don't understand" problem — research shows developers using AI score 17% lower on comprehension tests than those coding manually.
Grade Scale
The overall score (0-100) maps to a letter grade and performance band.
| Grade | Score Range | Band | What It Means |
|---|---|---|---|
| S | 90-100 | Exceptional | Top-tier performance across all dimensions. Rare. |
| A | 80-89 | Strong | Consistently strong AI collaboration. Recommended hire. |
| B | 70-79 | Competent | Solid fundamentals with room to grow. |
| C | 60-69 | Developing | Some good patterns, but significant gaps. |
| D | 50-59 | Needs Work | Below expectations for the role. |
| F | <50 | Significant Gaps | Fundamental skills not demonstrated. |
Scoring Reliability
Zero Rater Variance
4 of 7 dimensions are fully deterministic — computed from telemetry signals with per-session weight randomization to prevent pattern memorization. The same session always produces the same base scores. No human rater, no subjectivity.
Bounded LLM Enhancement
For 3 hybrid dimensions, LLM analysis can adjust scores by at most ±15 points. Each adjustment includes confidence level (high/medium/low) and written justification.
Evidence Trail
Every score links to specific timestamped events in the session. You can review the raw evidence — prompts, edits, test runs — that produced each number.
Outcome Modifier
After dimension scoring, a small modifier adjusts the overall score based on final test results. Process matters most, but outcomes matter too.
| Final Test Pass Rate | Score Adjustment |
|---|---|
| 100% | No penalty |
| 70-99% | -5 points |
| 30-69% | -10 points |
| 1-29% | -15 points |
| 0% | -20 points |
See It In Action
View a complete sample scorecard with all dimensions scored, behavioral patterns detected, and evidence linked.