Meta Muse Spark AI Model: Benchmarks, Strengths & Gaps

Meta’s first model from its Superintelligence Labs is genuinely impressive on vision, health reasoning, and token efficiency. It’s also not the coding model you want. Here’s the unvarnished picture.

On April 7, 2026, Meta released a model it had been building for months inside a newly formed internal unit called Meta Superintelligence Labs. The model is called Muse Spark. It runs Meta AI on the Meta AI app and meta.ai right now, with WhatsApp, Instagram, Facebook, Messenger, and the Ray-Ban Meta AI glasses to follow in the coming weeks.

The launch generated the usual wave of breathless coverage mixed with instant skepticism, which is roughly what you’d expect whenever a company with Meta’s reach announces a new frontier model. But if you’re a developer assessing whether to integrate it, a CTO deciding whether to move budget, or a researcher tracking the competitive dynamics of the frontier model race, the breathless/skeptical binary isn’t particularly useful. You need actual numbers, an honest accounting of where the model fits and where it doesn’t, and some sense of what the broader strategy actually is.

That’s what this piece is for.

The organizational context you need to understand first

Muse Spark didn’t emerge from Meta’s existing AI research pipeline. It came from a new unit, Meta Superintelligence Labs, that was stood up specifically because Mark Zuckerberg was reportedly dissatisfied with the progress of Meta’s Llama program. That’s not a minor footnote. It signals that Zuckerberg looked at where Llama was heading and concluded it wasn’t going to get Meta where it needed to be fast enough.

To lead the new lab, Meta recruited Alexandr Wang, co-founder and former CEO of Scale AI. Shortly before the launch, Meta also invested $14.3 billion in Scale AI for a 49% stake, securing not just Wang’s leadership but a massive data labeling pipeline. That kind of capital commitment tells you something about how seriously Meta is treating this bet. Analyst commentary frames Meta’s total AI spend, including infrastructure and partnerships, somewhere in the $115–135 billion range across the coming years.

There’s one more structural fact worth registering: unlike Llama, Muse Spark is closed-source. Meta says it hopes to open-source future versions, but for now the model is proprietary. That’s a deliberate pivot away from the open-source positioning that made Llama popular with researchers and developers worldwide. Whether that’s a strategic shift or just a temporary posture for the flagship line is an open question, but for anyone who built their stack on the assumption that Meta’s models would remain open, it’s a significant change.

What Muse Spark actually is

The clearest way to describe Muse Spark is as a natively multimodal model designed to be small, fast, and capable at reasoning tasks, especially those involving images, charts, health information, and scientific content. Meta describes it as “small and fast by design, yet capable enough to reason through complex questions in science, math, and health.”

“Small” here is relative, and Meta hasn’t disclosed exact parameter counts. But the design philosophy is deliberate: rather than scaling up a single massive model, Muse Spark uses what Meta’s team calls “thought compression” — a test-time scaling approach where multiple parallel subagents collaborate to solve hard problems. The idea is to spend more compute at inference time without making the base model grotesquely large. Alexandr Wang has framed this as a new scaling regime focused on efficient reasoning rather than brute-force parameter growth, a contrarian thesis relative to the prevailing assumption that bigger models always win.

In practice, this manifests as two modes in the consumer product: an Instant mode for quick answers and a Contemplating mode that spins up the multi-agent reasoning pipeline for harder queries. The latter is where Muse Spark’s reasoning capabilities show up most clearly, and it’s also the mode that carries higher infrastructure cost — something developers will need to account for when thinking about scale.

Natively multimodal means the model was built from the ground up to handle images, not retrofitted with a vision adapter. It can read charts, parse scientific diagrams, analyze product images, interpret health-related visuals, and process visual data in ways that are architecturally integrated rather than bolted on.

The benchmark picture, unvarnished

AI Intelligence Index
(Artificial Analysis)

58M

Output tokens for Index
(vs 157M for Claude Opus)

86.4

CharXiv visual reasoning
(beats GPT-5.4 at 82.8)

42.8

HealthBench Hard
(leads all models)

Artificial Analysis’s independent evaluation gives Muse Spark a score of 52 on their AI Intelligence Index — a composite measure running across reasoning, coding, multimodal understanding, and knowledge tasks. GPT-5.4 and Claude Opus 4.6 sit around 57-58; Gemini 3.1 Pro falls around 54-55. That 5-6 point gap is real but not catastrophic. The more interesting number is what it costs to get there.

Muse Spark used 58 million output tokens to complete the Intelligence Index evaluation. Claude Opus 4.6 used 157 million tokens for the same run. GPT-5.4 used 120 million. Gemini 3.1 Pro Preview came in at 57 million — essentially tied with Muse Spark. For teams running high-volume inference at scale, this efficiency gap has real cost implications. A model that gets you most of the way there at less than half the token count of its nearest competitor on raw intelligence deserves serious consideration.

Benchmark	Muse Spark	GPT-5.4	Claude Opus 4.6	Gemini 3.1 Pro
AI Intelligence Index	52	~57–58	~57–58	~54–55
Output tokens (Index run)	58M Most efficient	120M	157M	57M
MMMU-Pro (multimodal)	80.5%	~78–79%	~77–78%	82.4% Leads
CharXiv visual reasoning	86.4 Leads	82.8	~80	80.2
HealthBench Hard	42.8 Leads	High 30s–low 40s	Similar band	Slightly lower
GDPval-AA (agentic)	1427	1676 Leads	1648	1320
TerminalBench Hard (coding)	Below leaders	75.1	80.8% SWE-bench	68.5
τ²-Bench Telecom	92% Top tier	—	—	—
CritPT (hard physics)	11% Above Claude, Gemini Flash	—	3%	9%

Sources: Artificial Analysis, LushBinary, Meta AI blog. Competitor figures are approximate ranges from independent sources. All benchmarks reflect April 2026 evaluations.

Muse Spark is the second-most capable vision model we have benchmarked. Agentic performance does not stand out, it scores 1427 on GDPval-AA, behind Claude Sonnet 4.6 and GPT-5.4, but ahead of Gemini 3.1 Pro Preview at 1320.

Artificial Analysis — Independent AI benchmarking, April 7, 2026

The overall pattern is consistent across sources. The New York Times noted that Muse Spark “performed better than Meta’s previous AI models but lags rivals on coding ability.” That framing is accurate as far as it goes, though it undersells the multimodal and health performance story.

Where Muse Spark genuinely leads

Visual reasoning and multimodal understanding

This is the clearest competitive advantage. On CharXiv, a benchmark for reading charts, figures, and scientific diagrams, Muse Spark scores 86.4. GPT-5.4 comes in at 82.8, Gemini at 80.2, Claude Opus at around 80. That’s a meaningful lead, not a rounding error. For any workflow that involves parsing research papers, analyzing dashboards, extracting data from medical imaging reports, or reading technical schematics, Muse Spark has a real edge right now.

On MMMU-Pro, which tests broader multimodal understanding across academic disciplines, Muse Spark scores 80.5%, just behind Gemini 3.1 Pro’s 82.4%, ahead of GPT and Claude. Artificial Analysis labeled it the second-most capable vision model they’ve evaluated, which tracks with these numbers.

The key word is “natively.” Because multimodal processing is built into the architecture rather than added as a separate module, the model handles complex visual inputs with less prompt engineering overhead. Developers building visual Q&A systems, document parsing pipelines, or science-adjacent applications will find this integration practically useful, not just benchmark-impressive.

Health reasoning

Muse Spark leads HealthBench Hard with a score of 42.8, outperforming all major competitors on this evaluation. Meta has explicitly positioned health as a priority, noting that health questions represent one of the top reasons people turn to AI assistants. The benchmark performance backs this up.

Important caveat for builders: HealthBench Hard measures question-answering accuracy, not clinical safety. Deploying Muse Spark in contexts that inform real medical decisions requires regulatory compliance, validation against clinical standards, and guardrails that are entirely beyond what any benchmark measures. The score tells you the model is good at health Q&A. It doesn’t tell you it’s ready for a clinical workflow without substantial additional work.

Token efficiency

The token efficiency picture is one of the most practically significant findings from independent evaluations. At 58 million output tokens to complete the Intelligence Index, less than half of Claude Opus 4.6’s 157 million, and less than half of GPT-5.4’s 120 million, Muse Spark offers a materially different cost profile at scale. If you’re running millions of reasoning queries per day, this number translates directly into infrastructure budgets.

A 5-point gap from the leaders on raw intelligence is meaningful but not insurmountable, especially given Muse Spark’s strong cost-efficiency profile.

LushBinary benchmark analysis, April 2026

Domain-specific reasoning

On τ²-Bench Telecom, Muse Spark scores 92%, placing it among the highest-performing models on telecom-domain reasoning tasks. On CritPT, a hard physics benchmark where every model scores in single or low double digits, Muse Spark reaches 11% against Claude’s 3% and Gemini Flash’s 9%. These numbers are low in absolute terms because the tasks are genuinely hard, but the relative gaps suggest Muse Spark carries an advantage on scientific reasoning that may generalize to other technical domains.

Where it falls short, and why that matters

Coding and software engineering

This is the cleanest weakness in the profile. On TerminalBench Hard, a benchmark that evaluates models on real coding tasks interacting with a terminal environment, Muse Spark trails Claude Sonnet 4.6, GPT-5.4, and Gemini 3.1 Pro. Claude’s performance on SWE-bench Verified, the standard benchmark for software engineering tasks, sits at 80.8%. GPT-5.4 scores 75.1 on Terminal-Bench 2.0. Muse Spark’s specific score hasn’t been consistently reported, but the direction is clear across sources.

For teams building coding copilots, automated code review pipelines, or software engineering agents, this isn’t a minor limitation. The gap is large enough that defaulting to Claude or GPT-5.x for these use cases is the rational choice, not a matter of preference. Muse Spark’s test-time scaling advantage through multi-agent Contemplating mode may close this gap on complex reasoning-heavy coding problems, but on general software engineering tasks, it’s behind today.

Agentic and multi-step work

On GDPval-AA, a benchmark designed to evaluate models on real-world, multi-step office workflows, Muse Spark scores 1427, against GPT-5.4’s 1676 and Claude Sonnet 4.6’s 1648. It beats Gemini 3.1 Pro Preview at 1320, but the gap with the top performers is significant. For anyone building long-running agents that need to orchestrate multi-step workflows, research automation, enterprise task execution, complex data pipelines, the top two are still GPT and Claude.

The irony here is partially structural: Muse Spark’s own Contemplating mode uses multi-agent orchestration. But that architecture is optimized for single complex queries, not for sustained multi-step task execution of the kind GDPval-AA is testing.

Closed-source means lock-in

For organizations that have built their AI strategies partly around open-source models, using Llama as a foundation, running fine-tuned versions on their own infrastructure, controlling data flows and model behavior, Muse Spark’s closed-source design is a structural problem. You can’t fine-tune it, you can’t self-host it, and you’re entirely dependent on Meta’s API access decisions. Meta has said it hopes to open-source future versions, but “hopes to” is not a roadmap commitment.

This is a legitimate concern for enterprises in regulated sectors, for research institutions with data governance requirements, and for any team that has learned to be cautious about single-vendor dependencies. The developer community that embraced Llama explicitly because it was open now faces a different proposition.

Decision framework: who should actually use this

Choose Muse Spark when

Your workloads are vision-heavy or health-adjacent

Parsing charts, figures, scientific diagrams
Health Q&A at scale (with appropriate guardrails)
Document intelligence on mixed text-image content
Cost-sensitive high-volume reasoning inference
Deep integration with Meta’s social surfaces

Stick with GPT-5.x or Claude when

Coding quality and agentic execution are the priority

Software engineering copilots and code review
Long-running multi-step agent pipelines
Enterprise stacks needing mature governance tooling
Open-source flexibility and fine-tuning requirements
Mission-critical agentic workflow execution

Choose Gemini when

Google Workspace integration and search grounding matter

Tight integration with Google Cloud or Workspace
Top-tier MMMU-Pro multimodal score (82.4%)
Factual grounding through Google Search
Token efficiency matching Muse Spark’s profile

The key principle for CTOs making this call: model selection should follow workload composition, not brand affinity. A team with 70% of their AI usage in visual document parsing and 30% in code generation probably wants Muse Spark for the former and Claude for the latter. Running a single model for everything because it simplifies billing isn’t a good enough reason to accept a material performance gap in either direction.

Strategic implications for different stakeholders

For ML engineers and developers

The practical question right now is whether you’re on the API waitlist. Muse Spark is in private API preview for select partners. Broader developer access isn’t confirmed on a timeline yet. That matters for planning, you can evaluate the model’s benchmark profile today, but you can’t build production systems against it unless you’re in the preview cohort.

For teams that do get access, the architecture is worth understanding before you deploy. Contemplating mode’s multi-agent design means per-query costs won’t scale linearly the way they do with a simpler inference call. Building Contemplating mode into a high-frequency pipeline without understanding the token and latency characteristics first is a straightforward way to blow past cost budgets.

For CTOs and CIOs

The most significant strategic signal from this launch isn’t Muse Spark’s specific benchmark scores. It’s the closed-source pivot. Meta is now building a proprietary frontier model alongside Llama, not instead of it. That gives Meta two distinct competitive levers, an open-source community play through Llama, and a proprietary capability play through Muse Spark. Watching how the two coexist over the next 12-18 months will tell you a lot about where Meta thinks the commercial value actually is.

For CTO-level vendor strategy decisions, the practical implication is straightforward: Muse Spark is worth a pilot on visual and health workloads, but not worth treating as a primary strategic dependency until API access is broadly available, pricing is disclosed, and there’s at least 6-12 months of production usage data from early adopters.

For VCs and investors

Meta’s $14.3 billion Scale AI investment, combined with the Superintelligence Labs structure and Alexandr Wang’s leadership, signals a serious long-term capital commitment to personal AI at social scale. The model’s consumer deployment, rolling out across WhatsApp, Instagram, Facebook, and glasses, gives Meta an inference volume that no other frontier lab can match. That volume creates a data flywheel that other closed-source model providers don’t have access to. The strategic moat here isn’t the model itself. It’s the distribution.

For investors evaluating AI infrastructure plays, this matters because Meta is essentially running a 24/7 real-world evaluation of Muse Spark at consumer scale. The feedback signal from billions of interactions on social surfaces will compound over time in ways that benchmark suites can’t capture.

For policy makers and regulators

The health positioning and the multimodal surveillance surface are the two things worth watching most carefully here. A model that leads HealthBench Hard and rolls out across WhatsApp and Meta glasses is, in practice, a health advisory system at population scale. The benchmark performance doesn’t resolve questions about misinformation risk, appropriate medical advice boundaries, or liability when the model gets something wrong in a health context.

The multimodal perception capability combined with glasses deployment creates a different kind of regulatory surface, one that involves real-time visual data processing in the physical world. These aren’t hypothetical concerns. They’re the precise scenarios that existing AI safety frameworks were designed for, and Muse Spark’s deployment timeline moves faster than most regulatory processes can currently track.

How to access Muse Spark today

The simplest answer: use the Meta AI app or meta.ai. Muse Spark powers both right now. You can access Instant mode for quick queries and Contemplating mode for harder questions that benefit from the multi-agent reasoning pipeline.

For API access, the model is in private preview. Meta has indicated that broader enterprise and developer API access will come, but no specific timeline or pricing has been announced. If your organization has an existing Meta partnership or is part of Meta’s developer ecosystem, it’s worth checking whether you qualify for preview access. For everyone else, the path is to watch Meta’s developer blog and the Meta AI technical blog for access announcements.

The model will roll out to WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban Meta glasses in the coming weeks. For most consumer-facing applications, that’s where exposure will initially come from rather than direct API integration.

Frequently asked questions

Muse Spark is Meta’s first model from Meta Superintelligence Labs, announced on April 7, 2026. It’s a natively multimodal, closed-source frontier model designed to be small, fast, and capable at reasoning tasks, particularly those involving images, charts, health information, and scientific content. It powers Meta AI on the Meta AI app and meta.ai, with rollout to WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban glasses coming in the following weeks.

On Artificial Analysis’s AI Intelligence Index, Muse Spark scores 52 versus GPT-5.4 and Claude Opus 4.6 at around 57-58 and Gemini 3.1 Pro at 54-55. Muse Spark leads on visual reasoning (CharXiv: 86.4 vs GPT-5.4’s 82.8) and HealthBench Hard (42.8, best in class). It trails on coding (TerminalBench Hard) and complex multi-step agentic tasks (GDPval-AA: 1427 vs GPT-5.4’s 1676). Token efficiency is a standout: 58 million output tokens on the Intelligence Index versus Claude’s 157 million.

No. Unlike Meta’s Llama models, Muse Spark is closed-source and proprietary. Meta has stated it hopes to open-source future versions, but there’s no confirmed timeline. This is a significant departure from Meta’s previous AI strategy and has direct implications for organizations that relied on Llama’s open-source nature for fine-tuning, self-hosting, or data governance reasons.

Consumer access is available now through the Meta AI app and meta.ai. API access is in private preview for select Meta partners, with broader developer access not yet announced. The model will also roll out across WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban Meta glasses in the coming weeks. No pricing for API access has been disclosed.

Not as a primary coding model. Multiple independent evaluations confirm that Muse Spark trails Claude Sonnet 4.6 and GPT-5.4 on coding benchmarks including TerminalBench Hard. For software engineering copilots, automated code review, or complex software agent workflows, Claude (which leads SWE-bench Verified at 80.8%) or GPT-5.4 are the stronger current choices. Muse Spark may close this gap over time, but as of April 2026 the coding weakness is clear and consistent across sources.

Several things fundamentally distinguish them. Muse Spark is closed-source; Llama is open-source. Muse Spark is natively multimodal from the ground up; Llama’s vision capabilities have been added incrementally. Muse Spark uses a multi-agent Contemplating mode for hard reasoning tasks; standard Llama deployments don’t have this architecture. And Muse Spark comes from an entirely new organizational unit, Meta Superintelligence Labs, while Llama continues under the existing Meta AI research line.

Contemplating mode is Muse Spark’s test-time scaling approach. Rather than running a single large inference pass, it spins up multiple parallel subagents that collaborate to solve hard problems, spending more compute at inference time without making the base model larger. Meta describes this as “thought compression.” The Instant mode is a direct, fast response for simpler queries; Contemplating mode activates the multi-agent pipeline for complex reasoning tasks. Developers should account for higher per-query costs in Contemplating mode compared to Instant mode.

It performs better than competitors on HealthBench Hard (scoring 42.8), which measures health question-answering accuracy. But benchmark performance and clinical safety are different things. Deploying Muse Spark in applications that inform real medical decisions requires regulatory compliance, clinical validation, and guardrails well beyond what any benchmark measures. Policy observers have already flagged concerns about health AI at social scale without adequate safety infrastructure.

Meta has confirmed that API access is available in private preview for select partners, with broader access expected in the future. No pricing, SLAs, or specific enterprise contract terms have been disclosed. Organizations planning integrations should monitor Meta’s developer channels for access announcements and factor in the current access limitations when building 2026 AI roadmaps.

Four primary limitations matter for enterprise decision-making: (1) coding performance trails Claude and GPT-5.4, making it unsuitable as a primary development tool; (2) agentic task execution on GDPval-AA is behind the top two competitors; (3) closed-source design eliminates fine-tuning, self-hosting, and some data governance options; (4) API access is still in private preview with no disclosed pricing or SLAs. For regulated industries, the health deployment at consumer scale also raises compliance and liability questions that enterprises will need to address before adopting.

The bottom line

Muse Spark is a genuinely capable model in a specific and well-defined set of domains. The vision reasoning story is real, CharXiv at 86.4, MMMU-Pro near the top of the pack, HealthBench Hard leading the field. The token efficiency picture is also real and practically significant for anyone running reasoning tasks at scale. This isn’t hype padding. Independent benchmarkers at Artificial Analysis and LushBinary measured it, and the numbers hold up.

The coding and agentic weaknesses are equally real, and equally well-documented. If your primary use case involves writing or reviewing software, or running complex multi-step workflows through an AI agent, Muse Spark isn’t the right tool today. That may change, Meta’s investment trajectory and the “thought compression” scaling philosophy suggest a serious long-term R&D commitment, but it’s the current reality.

The closed-source pivot is probably the most strategically significant aspect of this launch, and it’s gotten less attention than the benchmark numbers. Meta is building a proprietary frontier model for the first time. Whether that ends up being a long-term strategic direction or a temporary posture for the flagship line will shape the competitive dynamics of the model market over the next 2-3 years. Watch for: broader API availability and pricing transparency (likely Q3 2026), Llama’s path forward now that Muse Spark holds the flagship position, and whether any of the health regulatory scrutiny around large-scale AI deployments on social platforms gains legislative traction in the EU or US in 2026.

For your own organizations: if you work with visual data, scientific documents, or health content at scale, put Muse Spark in your evaluation queue now and request API preview access. If your stack is primarily about code and software agents, focus your attention elsewhere for the time being. And if you’re a policymaker or regulator, the combination of health positioning and imminent deployment across billions of WhatsApp and Instagram users probably warrants a closer look than a typical model launch would require.

For ongoing frontier model coverage, benchmarks, and weekly AI intelligence, follow NeuralWired, and share this piece with someone who needs the unvarnished picture.

Sources & further reading

Meta AI BlogIntroducing Muse Spark: Scaling Towards Personal Superintelligence
Meta NewsroomIntroducing Muse Spark: Meta’s Most Powerful Model Yet
Artificial AnalysisMuse Spark: Everything you need to know, benchmark deep dive
Artificial Analysis XBenchmark highlights thread with token efficiency data
LushBinaryMeta Muse Spark: Benchmarks, Modes & Developer Guide
LushBinaryMuse Spark vs GPT-5.4 vs Claude vs Gemini, comparison
TechCrunchMeta debuts the Muse Spark model in a ‘ground-up overhaul’ of its AI
The VergeMeta is reentering the AI race with a new model called Muse Spark
CNBCMeta debuts first major AI model since $14.3B Scale AI deal
New York TimesMeta unveils new AI model, its first from the superintelligence lab
Silicon RepublicMeta’s Superintelligence Labs debuts first product Muse Spark