How does Future Work Academy grade student essays?

Essays are evaluated on 4 criteria worth 25 points each: Evidence Quality, Reasoning Coherence, Trade-off Analysis, and Stakeholder Consideration. AI scores every submission using GPT-4o, and instructors can override any score with full audit trail.

Can students attach charts and visualizations to their submissions?

Yes. Students can attach up to 5 image visualizations per weekly response. These are evaluated by GPT-4o vision and contribute to the Evidence Quality criterion score.

What are the scoring bands for the rubric?

Each of the 4 criteria is scored 0–25 points in bands: Distinguished (21–25), Proficient (16–20), Developing (11–15), Beginning (6–10), and Insufficient (0–5). Total possible score is 100 points per week.

Do instructors see the AI's reasoning or just the score?

Instructors see the full AI rationale for each criterion score, the evidence the AI cited from the student's essay, and can override any score with their own judgment. All overrides are logged.

Is student submission data used to train AI models?

No. Future Work Academy explicitly opts out of model training. Student submissions are sent to OpenAI for grading evaluation only and are not used to train any AI model.

Radical Transparency

How We Grade

The AI sees the same rubric you see. No hidden criteria, no secret formulas. This page documents exactly how every essay is evaluated — and the same methodology applies across every simulation on the platform.

Theoretical Foundation

Grounded in research.
Not just built — designed.

Every design decision in this assessment system maps to established pedagogical research, ensuring the grading methodology serves learning — not just measurement.

Black & Wiliam (1998)

Formative Assessment

Rubric criteria are displayed while students write — not hidden until after submission. This implements Black & Wiliam's seminal finding that achievement improves when students understand evaluation criteria before performing tasks.

Hattie & Timperley (2007)

Feed-Forward Feedback

Per-criterion AI feedback after each weekly submission informs the next performance rather than merely evaluating the last. Each cycle's feedback becomes input for the next week's decision-making.

Kolb (1984); Kayes (2002)

Iterative Experiential Cycles

Each multi-week simulation repeats Kolb's experiential learning cycle — experience, reflection, conceptualization, experimentation — with compounding consequences, directly addressing Kayes's critique of single-iteration designs.

Kapur (2008, 2016)

Productive Failure

Students who struggle with complex problems before receiving instruction outperform those who receive instruction first. The compounding metric system — where early decisions create downstream consequences — embodies this principle.

Wood, Bruner & Ross (1976)

Scaffolded Complexity

Three difficulty tiers progressively reduce scaffolding as student capability increases — fewer advisor uses, tighter crisis thresholds — implementing Vygotsky's Zone of Proximal Development through structured support withdrawal.

Shermis & Burstein (2013)

Automated Essay Scoring

AI-assisted evaluation achieves inter-rater reliability comparable to human raters when rubric criteria are clearly defined. Our transparent, criterion-level rubric design is built on this established AES research foundation.

Situated Cognition

Brown, Collins & Duguid (1989) established that knowledge is most effectively acquired within authentic contexts. The simulation's CEO role, stakeholders with quantified traits, and industry-sourced articles create a situated learning environment where strategic reasoning is embedded in realistic organizational dynamics.

Stakeholder Salience

The stakeholder system's influence, hostility, flexibility, and risk-tolerance dimensions mirror Mitchell, Agle & Wood's (1997) stakeholder salience framework of power, urgency, and legitimacy — creating the organizational complexity that makes decision-making consequential and grading contextually rich.

The Rubric

Four criteria.
100 points total.

Every essay is scored on these four dimensions — the same criteria displayed on the decision page while students write their responses.

Evidence Quality

25pts

Cite specific data, statistics, or case studies from Intel articles using source codes (AIM, APX, WFT)

What the AI evaluates:

Does the response reference concrete data points, industry benchmarks, or research findings? Are sources identified by code? Vague references to 'studies' without specifics score lower.

Reasoning Coherence

25pts

Present a logical argument connecting chosen strategy settings to evidence and outcomes

What the AI evaluates:

Is there a clear thesis or decision rationale? Do the paragraphs build on each other logically? Are cause-and-effect relationships articulated, not just asserted?

Trade-off Analysis

25pts

Acknowledge sacrifices, identify biggest risks, and explain contingency plans

What the AI evaluates:

Does the response name what is being given up? Are risks specific (not generic)? Is there a 'Plan B' or mitigation strategy? One-sided arguments score lower.

Stakeholder Consideration

25pts

Address how decisions affect 2-3+ stakeholder groups and balance competing interests

What the AI evaluates:

Are at least two distinct stakeholder perspectives named? Does the response acknowledge tension between groups? Are trade-offs between stakeholders addressed rather than ignored?

Scoring Bands

Clear expectations.
Published thresholds.

Every score maps to a published band — students know exactly where they stand and what it takes to improve.

Per-Criterion Bands (25 points each)

24–25

Thorough

Specific, well-supported, addresses the criterion comprehensively with depth and nuance

21–23

Solid

Strong work with minor gaps — may lack one citation or one stakeholder perspective

15–20

General

Demonstrates understanding of general concepts but limited depth, specificity, or evidence

10–14

Basic

Shows basic awareness of the topic but lacks citations, reasoning, or stakeholder analysis

< 10

Insufficient

No evidence of research use, off-topic, or too brief to evaluate meaningfully

Overall Quality Thresholds

Excellent93–100%

Exceptional depth with specific data citations, multi-stakeholder analysis, and risk mitigation

Good72–92%

Solid analysis with clear reasoning, relevant evidence, and recognition of competing interests

Adequate52–71%

General understanding demonstrated but missing depth, specificity, or balanced perspective

Poor< 52%

Insufficient evidence, unclear reasoning, or does not address the prompt requirements

The Process

How the AI evaluates
your essay.

From submission to final grade — a transparent, five-step process where every decision point is visible.

Step 1

Student Submits Essay

The response is submitted through the weekly decision page, where the rubric and recommended sources are always visible.

Step 2

AI Evaluates Against Rubric

The AI independently scores each of the four criteria using the same rubric published to students — no hidden criteria, no secret formulas.

Step 3

Per-Criterion Feedback

Each criterion receives a numeric score (out of 25) and written feedback identifying specific strengths and areas for improvement.

Step 4

Overall Quality Assessment

Scores are totaled and mapped to a quality label (Excellent, Good, Adequate, Poor) using the published thresholds.

Step 5

Instructor Reviews

The instructor sees every AI score alongside the original essay. They can adjust scores, add comments, and override any grade before finalizing.

Calibration

Calibrated against
exemplar responses.

The grading engine is calibrated so that the scores match what experienced faculty would assign. We validate against exemplar essays to ensure consistency.

Excellent responses (93–96%)

Cite specific statistics, address 3+ stakeholder groups with contingency plans, and provide multi-layered risk analysis. These consistently score in the 93–96 range across repeated evaluations.

Good responses (72–88%)

Present clear reasoning with relevant evidence but may miss a stakeholder group or provide generic rather than specific risk mitigation.

Adequate responses (52–68%)

Show understanding of the topic but rely on general statements without specific data, acknowledge fewer trade-offs, or skip stakeholder analysis.

Quality Assurance

Consistent and
reproducible.

The AI evaluates each criterion independently, reducing the halo effect common in holistic grading. The same essay produces consistent scores across multiple evaluations.

Independent criterion scoring

Each of the four criteria is evaluated separately to prevent one strong area from inflating the others.

Granular scoring bands

Five distinct bands per criterion (not just pass/fail) reduce variance and reward nuanced work.

Written feedback per dimension

Students receive specific comments on each criterion — not just a number, but actionable guidance.

Human Authority

AI assists.
Instructors decide.

AI scores are formative — they give students immediate feedback and help instructors work efficiently. But the instructor always has the final word.

Review every AI-generated score alongside the original essay

Adjust individual criterion scores up or down

Add written comments and qualitative feedback

Override the overall score with a single click

Finalize grades on your timeline, not the AI's

Optional Feature

Curved scoring.
Opt-in only.

When different weeks have different difficulty levels, curved scoring normalizes results so students aren't penalized for tackling harder scenarios.

How it works

1.Statistical normalization centers the class around a target mean

2.Requires a minimum number of submissions to activate (avoids distortion with small samples)

3.Curved scores are bounded to prevent extreme outliers

4.Off by default — instructors enable it via a toggle in the grading module

Note: When curved scoring is disabled, all curved score columns, chart datasets, and PDF references are hidden. Raw scores are the only scores displayed.

Student Experience

What students see
at every step.

Transparency isn't just about publishing criteria — it's about making them visible in the moment they matter most.

While Writing

The weekly decision page displays the full rubric and recommended source articles alongside the essay input. Students can reference criteria and source codes while composing their response.

Rubric criteria with point values

Recommended reading panel with source codes

Word count tracking against minimums

3-part structured prompts for depth

After Submission

The week results page shows a per-criterion score breakdown with the same quality labels (Excellent, Good, Adequate, Poor) and written feedback on each dimension.

Per-criterion scores out of 25

Written feedback per dimension

Overall quality label with color coding

Comparison to published scoring bands

Integrity Safeguards

How we keep AI grades honest.

Every submission is anchored to a vetted exemplar, scored by two independent raters, and fact-checked against a curated evidence corpus before any score reaches the student.

95-point exemplar anchor

A vetted human-graded response sets the 95 mark. Scores above 95 require a written justification from the model.

Two-rater consensus

Two independent passes at different temperatures must agree within 3 points per criterion. Divergent scores trigger a tiebreaker pass.

Evidence verification

Cited statutes, codes, and case studies are checked against a whitelist. Evidence Quality is capped by the count of verified citations.

AI-writing screening

Heuristic stylometric analysis flags responses with hallmark AI patterns for instructor review before any grade is released.

Withholding & review queue

Flagged submissions are confirmed received but their scores are withheld pending instructor release in the Needs Review queue.

Cohort calibration report

Instructors see distribution statistics, divergence rates, and flag breakdowns to detect grade inflation across a cohort.

References

Read the full White Paper →

Black, P., & Wiliam, D. (1998). Assessment and classroom learning. Assessment in Education: Principles, Policy & Practice, 5(1), 7-74. doi:10.1080/0969595980050102

Brown, J. S., Collins, A., & Duguid, P. (1989). Situated cognition and the culture of learning. Educational Researcher, 18(1), 32-42. doi:10.3102/0013189X018001032

Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81-112. doi:10.3102/003465430298487

Kapur, M. (2016). Examining productive failure, productive success, and unproductive failure in learning. Educational Psychologist, 51(2), 289-299. doi:10.1080/00461520.2016.1155457

Kayes, D. C. (2002). Experiential learning and its critics: Preserving the role of experience in management learning and education. Academy of Management Learning & Education, 1(2), 137-149. doi:10.5465/amle.2002.8509336

Kolb, D. A. (1984). Experiential learning: Experience as the source of learning and development. Prentice-Hall.

Mitchell, R. K., Agle, B. R., & Wood, D. J. (1997). Toward a theory of stakeholder identification and salience. Academy of Management Review, 22(4), 853-886. doi:10.5465/amr.1997.9711022105

Shermis, M. D., & Burstein, J. (Eds.). (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.

Wood, D., Bruner, J. S., & Ross, G. (1976). The role of tutoring in problem solving. Journal of Child Psychology and Psychiatry, 17(2), 89-100. doi:10.1111/j.1469-7610.1976.tb00381.x

See it in action.

Request a 30-day demo to experience the full grading workflow — submit essays, review AI feedback, and explore the instructor dashboard.

How We Grade

Grounded in research.Not just built — designed.

Formative Assessment

Feed-Forward Feedback

Iterative Experiential Cycles

Productive Failure

Scaffolded Complexity

Automated Essay Scoring

Situated Cognition

Stakeholder Salience

Four criteria.100 points total.

Evidence Quality

Reasoning Coherence

Trade-off Analysis

Stakeholder Consideration

Clear expectations.Published thresholds.

Per-Criterion Bands (25 points each)

Overall Quality Thresholds

How the AI evaluatesyour essay.

Student Submits Essay

AI Evaluates Against Rubric

Per-Criterion Feedback

Overall Quality Assessment

Instructor Reviews

Calibrated againstexemplar responses.

Consistent andreproducible.

AI assists.Instructors decide.

Curved scoring.Opt-in only.

How it works

What students seeat every step.

While Writing

After Submission

How we keep AI grades honest.

95-point exemplar anchor

Two-rater consensus

Evidence verification

AI-writing screening

Withholding & review queue

Cohort calibration report

References

See it in action.

Grounded in research.
Not just built — designed.

Four criteria.
100 points total.

Clear expectations.
Published thresholds.

How the AI evaluates
your essay.

Calibrated against
exemplar responses.

Consistent and
reproducible.

AI assists.
Instructors decide.

Curved scoring.
Opt-in only.

What students see
at every step.