How We Score

Our scoring system measures how engineers work with AI across 7 calibrated dimensions in 3 tiers. Every score is backed by telemetry evidence — not subjective judgment. The same behavior is scored differently depending on what the task demands.

Process Over Output

Two engineers can produce identical code — one by thoughtfully guiding AI with good context and verification, the other by accepting the third AI attempt after two failures. We measure the process, not just the result. Base scores are deterministic from telemetry; optional LLM enhancement adds qualitative evidence (bounded to ±15 points per dimension).

TIER 1

Calibrated AI Judgment

(65% of score)

Where the gap between strong and weak AI-augmented engineers is widest. Scores are calibrated per task — the same behavior is scored differently depending on what the task demands.

Calibrated Trust

25% weighthybrid

Does the engineer's level of verification match what the task demands? Trusting a simple AI rename is smart. Trusting a complex architectural suggestion without testing is dangerous.

Signals Measured

Whether verification intensity matches the risk level of the change
Appropriate trust calibration — accepting low-risk suggestions efficiently while scrutinizing high-risk ones
Critical evaluation of AI output relative to task complexity
Evidence of independent judgment in accepting or modifying AI suggestions

Example

An engineer who accepts a simple import fix without testing but runs full test suites after an AI-suggested architectural change scores 85+. Testing everything equally — or testing nothing — both score lower.

Context Engineering

20% weightdeterministic

How effectively does the engineer select and provide context to the AI? Quality over quantity — 200 relevant lines beat 2000 lines of noise. Scored against task-specific rubric files and key context targets.

Signals Measured

Quality of investigation before seeking AI assistance
Relevance and precision of context provided to the AI
Whether prompts include meaningful architectural and behavioral constraints
Context quality evolution as understanding deepens

Example

Attaching the relevant config file and test patterns before asking for a fix scores higher than pasting entire files without filtering.

Problem Decomposition

20% weightdeterministic

Does the engineer think before prompting? Exploration time is calibrated per task — a production triage task expects faster orientation than a complex refactor.

Signals Measured

Exploration depth appropriate to the situation
Evidence of structured thinking before acting
Prompt specificity and targeted problem framing
Whether complex problems are broken into manageable, scoped steps

Example

A candidate who spends 3 minutes exploring a debugging task before prompting scores well. The same 3 minutes on a production triage (where speed matters) would score lower.

TIER 2

Technical Execution

(25% of score)

Core engineering skills that remain essential regardless of AI assistance.

Debugging & Recovery

12% weighthybrid

When things go wrong, can the engineer find the root cause and fix it — not just re-prompt until something works?

Signals Measured

Systematic approach to root-cause identification
Use of appropriate debugging techniques vs. blind re-prompting
Quality of recovery when an approach fails
Speed of recognizing dead ends and pivoting

Example

Spotting that the AI's fix passes tests but introduces a subtle race condition, then systematically narrowing the root cause.

Architectural Judgment

8% weightdeterministic

Does the engineer respect the existing codebase architecture? Scored against task-specific rubric expectations for scope and pattern adherence.

Signals Measured

Respect for existing codebase patterns and conventions
Deliberate decisions about scope and placement of changes
Resistance to unnecessary AI-introduced complexity

Example

A focused fix in 2-3 files that follows existing patterns scores higher than letting AI scatter changes across 10 files.

Code Review Quality

5% weighthybrid

Can they critically evaluate code — whether AI-generated or human-written?

Signals Measured

Specificity and actionability of review feedback
Ability to distinguish critical issues from minor style concerns
Quality of suggested alternatives and explanations

Example

Identifying SQL injection AND suggesting parameterized queries scores higher than noting 'security issue'.

TIER 3

Efficiency

(10% of score)

Speed without quality is negative value — this is intentionally low-weighted.

Workflow Efficiency

10% weightdeterministic

Measures productive workflow and effective tool usage. Explicitly does NOT measure total time to completion or number of prompts.

Signals Measured

Effective use of available development tools
Productive momentum without unnecessary context switches
Appropriate tool selection for the task at hand
Read-before-write patterns indicating thoughtful workflow

Example

Using the AI to read and understand code before writing, combined with terminal for testing, while maintaining flow. Not penalized for taking time to be thorough.

Our scoring uses a combination of deterministic telemetry analysis and AI-assisted evaluation. Signal weighting includes per-session randomization to prevent pattern memorization.

Behavioral Patterns

Beyond individual dimensions, we detect overall session patterns — how the engineer approaches the problem as a whole. Patterns apply a small modifier to the overall score.

Calibrated Expert

+5 to +8

Behavior intensity matches task demands — light verification for simple changes, deep testing for complex ones. The strongest signal of experience.

Methodical Verifier

+2 to +5

Systematic, thorough verification on every change. Always a positive signal — may over-verify on simple tasks, but never misses real issues.

Explore-Plan-Execute

+3 to +5

High orientation time → specific prompts → targeted verification. Strong signal of structured problem-solving.

Recovery Pivot

+3 to +6

Initial approach fails → recognizes dead end → pivots strategy → succeeds. Shows resilience, experience, and intellectual honesty.

Context Blind

-5 to -10

Demonstrates skills but ignores task-specific context — doesn't read rubric files, misses key architectural patterns, applies generic solutions to specific problems.

Spray and Pray

-8 to -15

Immediate vague prompts → accept first output → no verification. The weakest signal of engineering judgment.

Comprehension Checks

After submission, candidates answer 3-5 targeted questions about their work. This prevents the "AI-generated code you don't understand" problem — research shows developers using AI score 17% lower on comprehension tests than those coding manually.

MCRoot cause identification — can they explain what was actually wrong?

MCApproach justification — why did they choose this approach over alternatives?

MCCode comprehension — what does this specific code do and what trade-offs does it make?

OpenReflection — what would they do differently with more time?

Grade Scale

The overall score (0-100) maps to a letter grade and performance band.

Grade	Score Range	Band	What It Means
S	90-100	Exceptional	Top-tier performance across all dimensions. Rare.
A	80-89	Strong	Consistently strong AI collaboration. Recommended hire.
B	70-79	Competent	Solid fundamentals with room to grow.
C	60-69	Developing	Some good patterns, but significant gaps.
D	50-59	Needs Work	Below expectations for the role.
F	<50	Significant Gaps	Fundamental skills not demonstrated.

Scoring Reliability

Zero Rater Variance

4 of 7 dimensions are fully deterministic — computed from telemetry signals with per-session weight randomization to prevent pattern memorization. The same session always produces the same base scores. No human rater, no subjectivity.

Bounded LLM Enhancement

For 3 hybrid dimensions, LLM analysis can adjust scores by at most ±15 points. Each adjustment includes confidence level (high/medium/low) and written justification.

Evidence Trail

Every score links to specific timestamped events in the session. You can review the raw evidence — prompts, edits, test runs — that produced each number.

Outcome Modifier

After dimension scoring, a small modifier adjusts the overall score based on final test results. Process matters most, but outcomes matter too.

Final Test Pass Rate	Score Adjustment
100%	No penalty
70-99%	-5 points
30-69%	-10 points
1-29%	-15 points
0%	-20 points

Why Process Over Output

Two engineers can produce identical, passing code through fundamentally different processes. One methodically reads the codebase, identifies the root cause, provides targeted context to AI, and verifies each change. The other pastes the error message into AI, accepts the first suggestion, hits a wall, re-prompts three times, and eventually lands on the same fix by trial and error.

The output is the same. The engineering behind it is not. And which process an engineer uses predicts how they will perform on the problems that actually matter in production — ambiguous bugs, unfamiliar codebases, cascading failures, and high-stakes refactors where brute-forcing with AI does not work.

Engineer A: Methodical

1.Reads the failing test and traces execution through 3 files
2.Identifies the race condition in the connection pool
3.Prompts AI with the specific function, mutex pattern, and constraints
4.Reviews the suggestion, catches a missing edge case, adjusts
5.Runs tests, verifies the fix, submits

Score: 84 (Grade A) — Strong calibrated trust, good decomposition

Engineer B: Trial-and-Error

1.Pastes error message into AI chat immediately
2.Accepts first suggestion, tests fail differently
3.Re-prompts: “that didn’t work, try something else”
4.Third attempt produces a passing fix
5.Submits without reviewing what changed

Score: 47 (Grade F) — No decomposition, uncalibrated trust, no verification

How This Compares to Traditional Assessments

Most technical assessments were designed for a pre-AI world. They measure the wrong things or measure the right things in the wrong way.

Method	What It Measures	Blind Spots
Take-Home Projects	Final output quality	No visibility into process. AI can generate an entire project in minutes. You cannot distinguish thoughtful engineering from sophisticated prompting. Unpaid labor deters strong candidates.
Whiteboard / LeetCode	Algorithm recall under pressure	Does not reflect actual work. No one implements a red-black tree in production. Measures memorization, not engineering judgment. Biased toward recent grads who just studied.
Timed Coding Challenges	Speed of code production	Rewards fast typing over careful thinking. Penalizes engineers who verify. With AI generating code instantly, speed of code production is meaningless.
DynaLab	Full engineering process: how engineers think, investigate, use AI, verify, and recover	Captures 30+ telemetry signals across 7 dimensions. Scores the process that produces the output, not just the output itself. Detects behavioral patterns that predict production reliability.

Example Scorecard Walkthrough

Here is what each dimension captures in practice, using a production debugging task (fixing a connection pool exhaustion bug) as a concrete example.

Calibrated Trust

25%91/100

The candidate accepted a simple import fix without extra verification (appropriate for low risk) but ran the full test suite after the AI suggested restructuring the pool’s mutex strategy (appropriate for high risk). Their verification intensity matched the risk level of each change.

Context Engineering

20%78/100

Attached the pool configuration and the specific failing test to their prompt, not the entire file. Lost points for not including the existing timeout logic from a related module, which the task rubric flagged as critical context.

Problem Decomposition

20%85/100

Spent 2.5 minutes reading the pool implementation and test output before their first AI prompt. First prompt was specific: “The pool leaks connections when acquire times out mid-handshake — the cleanup in line 142 only fires on success.” Strong signal of structured thinking.

Debugging & Recovery

12%72/100

First fix attempt introduced a deadlock. The candidate recognized the problem from test output within 30 seconds and pivoted to a channel-based approach. Good recovery speed, but the initial approach showed limited consideration of concurrency constraints.

Architectural Judgment

8%88/100

Fix touched only 2 files and followed the existing error-handling pattern. Rejected an AI suggestion to add a new retry middleware (scope creep) and kept changes within the pool module.

Code Review Quality

5%65/100

Reviewed the AI’s output but missed a subtle resource leak in the error path. Comments were general (“looks good”) rather than specific. Demonstrates awareness of review but not depth.

Workflow Efficiency

10%80/100

Used the terminal to reproduce the issue and run targeted tests. Read files before writing. Maintained good flow between editor, AI, and terminal without unnecessary context switching.

Overall Result

Score: 81/100

Grade: A

Band: Strong

Pattern: Explore-Plan-Execute + Recovery Pivot

This candidate demonstrates strong AI collaboration fundamentals with excellent calibrated trust and decomposition. Recovery from the deadlock was fast and decisive. Main growth area is code review depth — they tend to skim rather than critically evaluate AI-generated code.

See It In Action

View a complete sample scorecard with all dimensions scored, behavioral patterns detected, and evidence linked.

View Sample Scorecard For Hiring Teams