GPT-5 Capabilities: The Complete Technical Guide for Developers & Founders
Everything that actually matters about OpenAI’s flagship model — benchmarks, pricing, hallucinations, and what it means for your product in 2025–2026.
On August 7, 2025, OpenAI didn’t just release a new model. It collapsed its entire model portfolio into one, and then the flagship feature broke on launch day. Nine months later, GPT-5 is the engine behind 900 million weekly active users and a $25 billion revenue run rate. This guide separates what GPT-5 actually delivers from what OpenAI wants you to believe it delivers.
What Is GPT-5?
GPT-5 is OpenAI’s flagship large language model, released on August 7, 2025 at 10AM PT. It’s available across ChatGPT (all user tiers), the OpenAI API platform, and the GitHub Models Playground.
The defining architectural move: GPT-5 is a unified system, a single model that houses a fast conversational sub-model for routine queries and a deep reasoning sub-model (“GPT-5 Thinking”) for complex tasks. A real-time router decides which mode engages, based on query complexity, tool requirements, and signals like a user typing “think carefully about this.”
Before GPT-5, users had to manually choose between the GPT-4o series (fast, conversational) and the o-series reasoning models (o1, o3, slower, more accurate on math and science). GPT-5 eliminates that decision entirely. Or it was supposed to, the router malfunctioned on launch day, which we’ll get to.
“It’s like talking to an expert. A legitimate, PhD-level expert in any area you need.”
Sam Altman, CEO, OpenAI, Pre-recorded press briefing, August 7, 2025
That PhD-level framing maps to specific benchmarks: 88.4% on GPQA Diamond (graduate-level science) and 67.2% on HealthBench (medical conversations). The claim isn’t hype without data. Whether the data holds up in your production environment is a different question.
GPT-5 Benchmark Scores: The Complete Breakdown
Benchmarks are the language enterprises use to justify procurement and the numbers engineers use to set expectations. Here’s what GPT-5 actually scored, source-attributed, with methodology noted.
| Benchmark | GPT-5 Score | What It Measures | Why It Matters |
|---|---|---|---|
| AIME 2025 | 94.6% | High school olympiad mathematics | Stumps most adults. Signals deep reasoning without tools. |
| SWE-bench Verified | 74.9% | Real-world software engineering (bug-fixing) | GPT-4.1 scored 54.6% four months earlier — a 20-point jump. |
| Aider Polyglot | 88% | Cross-language coding ability | Multi-language production relevance for full-stack teams. |
| GPQA Diamond | 88.4% | PhD-level physics, chemistry, biology | Curated to be hard even for the PhDs who wrote the questions. |
| MMMU | 84.2% | Multimodal understanding | Image + text reasoning for document-heavy workflows. |
| HealthBench | 67.2% | Clinical conversation quality | Benchmark for medical AI deployments in regulated settings. |
The SWE-bench figure deserves special attention. OpenAI’s developer page documents the trajectory: GPT-4o scored 33.2%, GPT-4.1 reached 54.6%, and GPT-5 hit 74.9%, all within a 12-month window. For engineering teams, that isn’t a benchmark number. That’s the delta between “AI assists with code” and “AI autonomously closes GitHub issues.”
GPT-5’s token efficiency is a hidden financial story. OpenAI reports 50–80% fewer output tokens than o3 for equivalent performance, meaning if your pipeline previously ran on o3, switching to GPT-5 can cut token costs roughly in half before factoring in any price-per-token differences.
How GPT-5 Differs from GPT-4o and o3
The simplest framing: GPT-5 is what you’d get if GPT-4o and o3 had a child that also knew when to think slowly.
GPT-4o was fast and conversational. o3 was slow and brilliant at math and science. Users had to choose between them depending on the task, a friction point that caused constant miscategorization. GPT-5’s real-time router eliminates that choice.
Three concrete differences that change day-to-day developer experience:
- No manual model selection. The router decides whether to engage fast or deep reasoning based on query complexity. In practice, this works better for ambiguous tasks than users tended to perform at self-selection.
- 45% fewer factual errors than GPT-4o in OpenAI’s internal testing. In reasoning mode, the figure climbs to 80% fewer errors versus o3. (Independent validation is mixed, see Section 7.)
- Front-end web development outperforms o3 70% of the time in OpenAI’s internal evaluations. For developers doing full-stack work, that’s not marginal, that’s a genuine first-pass quality shift.
The routing feature — GPT-5’s central innovation, malfunctioned on August 7, 2025. The flagship technical differentiator did not function correctly on day one. Additionally, OpenAI published benchmark bar charts that visually contradicted their own numerical data: the “coding deception” chart showed GPT-5 with a shorter bar than o3, despite GPT-5’s lower number indicating better performance. InfoQ documented both issues in detail. OpenAI issued corrections. Both errors raised legitimate questions about internal quality control for the company’s most important launch in two years.
GPT-5 API Pricing: What You’ll Actually Pay
This is the section that should be pinned to every startup’s engineering Slack. GPT-5 launched at a price point that made it seem like the cost curve was finally working in developers’ favor. What happened next was not that.
| Model Version | Release Date | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| GPT-5 (launch) | August 7, 2025 | $1.25 | $10.00 |
| GPT-5.4 | ~March 2026 | $2.50 | — |
| GPT-5.5 (“Spud”) | April 23, 2026 | $5.00 | $30.00 |
API input pricing quadrupled in eight months. Output pricing tripled. During the same period, NVIDIA CEO Jensen Huang stated that hardware costs per inference token dropped approximately 35×. OpenAI’s pricing trajectory is not following infrastructure economics. It’s following market demand and competitive positioning.
Any product with significant token throughput that was budgeted at $1.25/M input is now facing 4× the cost if it has migrated to current models. That’s not a price increase, it’s a category change in unit economics.
NeuralWired Research Desk analysis, May 2026
For ChatGPT users: Plus ($20/month) includes GPT-5 with usage limits on thinking-mode messages. Pro ($100–$200/month, restructured from launch’s $200 flat) includes GPT-5 Pro with extended reasoning and no token budget restriction. Ed Zitron, tech critic and writer, framed the launch bluntly:
“Meaningful functionality… is being completely removed for ChatGPT Plus and Team subscribers.”
Ed Zitron, Technology Critic — “Where’s Your Ed At” newsletter, August 2025, via Voiceflow
Our read: Zitron’s critique is specifically about model-selection removal and rate limits, not raw capability. Both things can be true, GPT-5 is technically more capable than GPT-4o, and Plus users received fewer choices with the upgrade. Whether that trade is acceptable depends entirely on your use case.
GPT-5 Context Window and Technical Specs
| Parameter | GPT-5 (August 2025) | GPT-5.5 (April 2026) |
|---|---|---|
| Context Window | 400,000 tokens | 1,050,000 tokens (1M+) |
| Max Output | 128,000 tokens | — |
| Knowledge Cutoff | September 2024 | — |
| Latency (tokens/sec) | ~77.7 (Artificial Analysis) | — |
| Training Infrastructure | Microsoft Azure AI supercomputers | |
| Distribution at Launch | ChatGPT, OpenAI API, GitHub Models, Agents SDK | |
The 400K context window matters for enterprise document workflows, processing full legal contracts, entire codebases, or multi-year financial filings in a single call. GPT-5.5’s 1M+ token context is available via the API and makes whole-repository code analysis practically viable for the first time in the OpenAI stack.
GPT-5 vs Claude and Gemini
The short answer: neither model is comprehensively superior. Benchmark leadership is task-specific, and it’s shifting faster than procurement cycles can track.
| Benchmark | GPT-5.5 (Apr 2026) | Claude Opus 4.7 (Apr 2026) | Leader |
|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 69.4% | GPT-5.5 |
| ARC-AGI-2 | 85.0% | 75.8% | GPT-5.5 |
| SWE-Bench Pro | 58.6% | 64.3% | Claude Opus 4.7 |
The competitive moat OpenAI held during the GPT-4 era has narrowed materially. Artificial Analysis scores GPT-5 at 45/100 on their Intelligence Index — above most models but not the categorical lead OpenAI commanded in 2023. ChatGPT’s US mobile app daily active user share fell from 69.1% in January 2025 to 38.7% by May 2026. Anthropic’s Claude app went from under 2% to 10% DAU share in three months.
GPT-5 is still the market leader by revenue and user count. It isn’t the unchallenged technical leader on every dimension.
Does GPT-5 Still Hallucinate?
Yes. Less than before — but the gap between what OpenAI claims and what independent testers find is real and worth understanding before you deploy in a regulated environment.
OpenAI’s claim: 45% fewer factual errors versus GPT-4o; 80% fewer errors in reasoning mode versus o3.
Independent testing: Vectara found GPT-5.2 had an 8.4% hallucination rate in their methodology, trailing DeepSeek. OpenAI’s own figure for GPT-5.2 was a reduction from 8.8% to 6.2%: a more modest 30% improvement, not the dramatic leap marketing suggested.
PCMag’s Ruben Circelli, who reviewed GPT-5 against real-world production tasks rather than benchmark conditions, was direct:
“GPT-5 is an ‘insignificant update.’ While it has some upgrades, it ‘doesn’t solve the problems that actually matter’ and he has not ‘noticed a significant improvement’ in areas like hallucination reduction.”
Ruben Circelli, Senior Analyst, PCMag — August 2025, via Voiceflow
That’s the practitioner gap: benchmark-measured hallucination uses controlled scenarios with defined correct answers. Production use involves open-ended, ambiguous queries where the model can’t know what it doesn’t know. GPT-5 is more reliable than GPT-4o. It’s not hallucination-free. Deploy accordingly.
One genuinely encouraging signal: a peer-reviewed study by Polat et al. (six MDs across four Turkish hospitals, published November 2025 in Letters to the Editor, NCBI) concluded that GPT-5’s measurable reduction in hallucination rates represents a meaningful milestone for medical and scientific writing, one of the first published academic assessments from clinical practitioners in a domain where errors cost lives. That’s cautious optimism, not a blanket clearance.
GPT-5 for Developers: Coding, Agents, and the Agents SDK
If you’re building software with or on AI, GPT-5 changes three things materially, and creates one significant risk.
What changes in practice
74.9% SWE-bench means autonomous issue resolution, not just code suggestions. At GPT-4o’s 33.2%, AI-assisted coding meant “AI suggests, human implements.” At 74.9%, the model can autonomously close real GitHub issues in verified test conditions. Combined with the Agents SDK (which provides orchestration, tracing, and MCP connectivity to external tools like CRM, payment, and support systems), multi-step autonomous pipelines are production-grade for the first time.
GPT-5 beats o3 at front-end web development 70% of the time. For developers doing full-stack work, that’s not marginal assistance, it’s output-quality output at first pass. The net result is that senior engineering time spent on routine implementation patterns (API integrations, UI scaffolding, documentation) can shift toward architecture and review.
What to do right now
Audit your current stack for tasks that consume disproportionate senior engineering time but follow a pattern: bug triage, code review, documentation, API integration. These are GPT-5’s highest-ROI targets. Evaluate the Agents SDK as an integration layer before building a custom orchestration system from scratch.
The risk you need to price in
API pricing quadrupled from August 2025 to April 2026. Any product budgeted at GPT-5 launch pricing with significant token throughput is now 4× the cost if it has migrated to current models. Build pricing escalation assumptions into any business case that relies on the GPT-5 stack. A multi-vendor or open-source fallback strategy isn’t optional caution at this point — it’s basic financial hygiene.
GPT-5 for Founders: What Changes in Your Build-vs-Buy Decisions
The uncomfortable truth: GPT-5 compressed the moat of a large class of AI startups in a single launch. If your competitive advantage was “we built a better AI wrapper,” that advantage has narrowed to the point where you need to name what specifically you still do better than the base model.
The opportunity is real too. Enterprise deployments at GPT-5 launch included Morgan Stanley (financial workflows), Amgen (scientific research), and T-Mobile (customer operations). Fortune 500 procurement of AI tools has accelerated. If you serve any of those verticals, GPT-5 integration is now a procurement requirement, not a differentiator.
42% of new SaaS platforms with AI capabilities launched in 2025 relied on OpenAI models. That means GPT-5 is infrastructure. The differentiation layer has shifted up the stack, to proprietary data, domain-specific fine-tuning, and integration quality. Prompt engineering alone isn’t a moat anymore. It arguably never was, but GPT-5 made that unavoidable.
Invest now in proprietary data pipelines and fine-tuning infrastructure. The competitive question for any AI-native product is no longer “is our model good?”, it’s “do we have data the base model doesn’t?” That’s where defensible differentiation now lives.
The Skeptic’s Case: What GPT-5 Doesn’t Solve
Balanced coverage means saying the things OpenAI’s press releases don’t.
The AGI framing is marketing
Sam Altman’s description of GPT-5 as offering “PhD-level expertise” maps directly to one benchmark: GPQA Diamond. In controlled academic tests with defined answers, GPT-5 performs at a PhD level on scientific knowledge retrieval. On open-ended reasoning chains involving novel problems, ambiguous real-world data, or multi-domain synthesis, it remains significantly below expert human performance.
GPT-5 performs comparably to or better than human experts in roughly half of cases across 40+ occupations. That means it performs worse than human experts in the other half. At NeurIPS 2025, only 2 of 5,000 papers mentioned AGI. Prominent researchers including Demis Hassabis have emphasized that scaling transformers hits a cognitive scaling wall, current paradigms require paradigm-level innovation, not just larger models, to reach genuine general intelligence.
Agentic reliability isn’t solved yet
GPT-5’s agentic capabilities are real. The reliability math is not flattering for complex pipelines. A 95% success rate per tool call yields approximately 60% end-to-end success over 10 sequential steps. Enterprises deploying GPT-5 agents in customer-facing workflows without robust human-in-the-loop checkpoints are assuming a reliability threshold the model doesn’t yet consistently meet.
Regulatory exposure in regulated sectors
GPT-5’s use in healthcare, legal, and financial services creates EU AI Act exposure. OpenAI hasn’t published a conformity assessment for GPT-5 under the Act’s high-risk provisions. Companies deploying it in these domains are accepting compliance risk that OpenAI itself hasn’t fully addressed publicly. If you’re a CTO in a regulated vertical, that’s not a footnote, it’s a procurement risk factor that belongs in your security review.
The GPT-5 Model Family: From 5.1 to 5.5
GPT-5 is not a single model, it’s an ongoing release cadence. Five significant versions shipped in the nine months after launch.
| Version | Release Date | Key Changes |
|---|---|---|
| GPT-5 | August 7, 2025 | Flagship launch — unified routing system, 400K context |
| GPT-5.1 | ~January 2026 | Incremental refinements |
| GPT-5.2 | December 11, 2025 | 400K context confirmed, 3 variants (Instant / Thinking / Pro), ARC-AGI-1 >90% |
| GPT-5.4 | ~March 2026 | Coding and agentic focus, front-end design improvements |
| GPT-5.5 “Spud” | April 23, 2026 | 1M+ token context, Terminal-Bench 2.0 at 82.7%, API pricing doubled from 5.4 |
The pace is deliberate. Sam Altman reportedly referred to GPT-5.5 as “the last big milestone before AGI” in internal remarks reported by the Financial Times in April 2026. Read carefully: that statement describes the current training paradigm having one or two more generations of runway before requiring a fundamental architectural shift, not a claim that AGI is imminent. It’s being read by many outlets as a promise it isn’t.
Our read: the GPT-5 series demonstrates that OpenAI has internalized the launch-iterate model from consumer software. The implication for anyone building on it is that the model you ship against today may be meaningfully different in six months, for better (capability) and worse (pricing).
Frequently Asked Questions
GPT-5 is OpenAI’s flagship large language model, released August 7, 2025. It’s a unified system combining a fast conversational sub-model and a deep reasoning sub-model, with an automatic router that selects the right mode per query. It powers ChatGPT by default and is available via the OpenAI API. GPT-5 sets leading benchmarks in math (94.6% AIME 2025), coding (74.9% SWE-bench), and science (88.4% GPQA Diamond).
GPT-5 unifies GPT-4o’s conversational speed with the o-series reasoning models into one system, eliminating manual model selection. It reduces factual errors by 45% compared to GPT-4o, scores 20 percentage points higher on SWE-bench (74.9% vs. GPT-4o’s ~54%), and introduces a real-time routing system that decides when to engage deeper reasoning without user input.
GPT-5’s official benchmark scores: 94.6% on AIME 2025 (advanced math), 74.9% on SWE-bench Verified (software engineering), 88% on Aider Polyglot (coding), 84.2% on MMMU (multimodal), 88.4% on GPQA Diamond (PhD-level science, Pro reasoning), and 67.2% on HealthBench (medical). Published by OpenAI at launch, August 2025.
GPT-5 launched at $1.25/M input tokens and $10/M output tokens (August 2025). Pricing escalated significantly: GPT-5.4 (March 2026) costs $2.50/M input; GPT-5.5 (April 2026) costs $5.00/M input and $30/M output, a 4× input increase in eight months. ChatGPT Plus ($20/month) includes access with usage limits; ChatGPT Pro ($100–$200/month) includes GPT-5 Pro with full extended reasoning.
GPT-5 launched with a 400,000-token context window and a maximum output of 128,000 tokens per response. Knowledge cutoff is September 2024. GPT-5.5 (April 2026) extended the context window to over 1,050,000 tokens (1M+) via the API, making whole-repository code analysis and large-document processing viable in a single call.
It depends on the task. GPT-5.5 leads Claude Opus 4.7 on Terminal-Bench 2.0 (82.7% vs. 69.4%) and ARC-AGI-2 (85.0% vs. 75.8%). Claude Opus 4.7 leads on SWE-Bench Pro (64.3% vs. 58.6%). Neither model is comprehensively superior, and benchmark leadership is shifting faster than it has at any prior point in the LLM competitive cycle.
Yes, less than before, but not eliminated. OpenAI reports 45% fewer errors versus GPT-4o. Independent testing by Vectara found an 8.4% hallucination rate in GPT-5.2. PCMag reviewers reported no significant improvement in real-world use. The gap between benchmark hallucination and production hallucination is real; GPT-5 is more reliable than its predecessors but not hallucination-free.
GPT-5 Pro is the maximum-compute reasoning variant of GPT-5, exclusive to ChatGPT Pro subscribers ($100–$200/month as of April 2026). It enables extended “thinking” reasoning with no token budget restriction, producing more thorough answers on complex tasks. It scores higher than standard GPT-5 on GPQA Diamond (88.4%) and FrontierMath benchmarks.
GPT-5 was officially released on August 7, 2025, at 10AM PT. OpenAI teased the launch the previous day via a post on X embedding the number “5” in the announcement text. The model launched simultaneously on ChatGPT (all user tiers), the OpenAI API platform, and the GitHub Models Playground.
What You Now Know | And Where This Goes Next
GPT-5 is the most commercially successful AI model ever released. It is also an imperfect product that malfunctioned on launch day, shipped benchmark charts that contradicted their own data, and has since quadrupled its API pricing while hardware costs fell 35×.
Both things are simultaneously true. The model is genuinely capable, 74.9% SWE-bench and 88.4% GPQA Diamond are not noise. The commercial moat is real, $25B+ ARR and 900 million weekly users are not accidents. And the operational risks are real: pricing escalation, benchmark-to-production hallucination gaps, regulatory exposure in high-risk sectors, and compounding error rates in agentic pipelines.
Three things to watch over the next 6–18 months:
- The competitive parity story. Claude Opus 4.7 already leads on SWE-Bench Pro. Gemini 3.1 competes on multimodal benchmarks. ChatGPT’s US mobile market share is below 40% for the first time. GPT-5 may not hold the benchmark lead across all dimensions by the end of 2026.
- The pricing ceiling. There’s no economic argument for API pricing increasing 4× in 8 months when inference costs are dropping. OpenAI is pricing against demand, not against cost. Watch for whether competition forces a reversal, or whether the market absorbs it.
- Agentic deployment reliability. The gap between GPT-5’s agentic capabilities and production-grade reliability in multi-step autonomous pipelines is the defining technical question for enterprise AI in 2026. The teams that figure out human-in-the-loop architectures that are fast enough to be useful will define what enterprise AI actually becomes.
GPT-5 is infrastructure now, the same way GPT-4 became infrastructure. The question isn’t whether to use it. It’s how to build on it without being entirely at the mercy of OpenAI’s pricing decisions, and where to differentiate above the model layer.
Stay Ahead of the AI Model Cycle
The Neural Loop covers frontier model releases, benchmark analysis, and what they actually mean for your product, before the hype settles.
Subscribe to The Neural Loop →