Bottom Line: Worth It for Image-to-Video
Arena.ai's blind human preference testing is unambiguous: Grok Imagine Video 1.5 Preview (720p) holds the #1 spot on the Image-to-Video leaderboard with an Elo score of 1473. It beats ByteDance Seedance 2.0, Alibaba HappyHorse 1.0, and Google Veo. The +52 Elo jump over v1.0 is substantial — not a minor tune-up. And the ~15-second generation time is genuinely faster than anything comparable. That said, max resolution stops at 720p and complex multi-prompt scenes still trip it up occasionally.
- Image-to-video production workflows
- Ads and social content at speed
- Projects requiring native audio
- 1080p or 4K output resolution
- Long-form cinematic storytelling
What Is Grok Imagine Video 1.5?
Grok Imagine Video 1.5 is xAI's most advanced AI video generation model, officially released May 31, 2026. It takes an input image plus an optional motion prompt and generates a cinematic video clip up to 15 seconds long at 480p or 720p, with native audio generated in sync. The model alias is grok-imagine-video-1.5-2026-05-30, accessible via the xAI API at api.x.ai. Consumer rollout to X Premium tiers was still in progress at publication.
Grok Imagine Video 1.5 didn't just get patched — it got six real upgrades at once. Audio generation moved from basic to native-sync. Face accuracy went from middle-of-the-pack to top-rated in blind testing. And the model now runs roughly 25% faster than v1.0.
Arena.ai · May 2026
over Grok Imagine 1.0
per clip
in January 2026
grok-imagine-video-1.5-2026-05-30). Broader X Premium consumer rollout is still in progress. You can generate videos on this site without needing API credentials.Grok Imagine Video 1.0 vs 1.5
This isn't a minor point release. The gap between Grok Imagine Video 1.0 and 1.5 is wide across nearly every measurable dimension — confirmed by both the Arena.ai Elo delta (+52 points) and community testing. Here's what actually changed:
| Dimension | v1.0 | v1.5 (Current) | Upgrade |
|---|---|---|---|
| Audio Generation | Basic / inconsistent | Native sync — dialogue, ambient, BGM | ⬆⬆⬆ Major |
| Face Accuracy | Mid-tier, cross-frame drift | Top-rated in blind tests, consistent across cuts | ⬆⬆⬆ Major |
| Temporal Coherence | Occasional frame jumps | Significantly improved, smoother motion arcs | ⬆⬆ Notable |
| Image Quality | 720p, soft in detail areas | 720p, noticeably sharper textures & lighting | ⬆ Solid |
| Prompt Following | Motion prompts loosely followed | Motion direction, camera angle, and speed prompts execute accurately in I2V | ⬆⬆ Notable |
| Generation Speed | ~20 seconds | ~15 seconds (~25% faster) | ⬆ Solid |
| Arena.ai Elo Score | 1421 (720p) | 1473 (720p) — ranked #1 | ⬆⬆ +52 pts |
Grok Imagine Video 1.5 Features, Tested
We ran each feature through 6–8 prompt variations and graded on accuracy, consistency, and output quality. Results below are our own — not sourced from xAI's marketing materials.
This is where Grok Imagine Video 1.5 is genuinely exceptional. We uploaded 12 different source images — portraits, product shots, landscapes, architectural photos — and animated each with a motion prompt. The model preserved source image fidelity across all 12 tests. Characters stayed on-model. Object physics felt plausible. The I2V output is the reason this model holds the Arena.ai #1 position: it simply handles image-driven animation better than anything else we've tested.
A cozy, vintage Japanese anime interior with a starry night sky, in Studio Ghibli style.

The biggest new capability in 1.5. Audio is generated in the same pass as the video — no separate pipeline, no extra cost. In our tests, ambient sounds matched the visual scene well: rain clips had realistic rain audio, a crowd scene generated murmuring crowd noise, a forest clip added birds and wind. Dialogue generation was more inconsistent — words weren't always intelligible — but for ambient sound and music-style backgrounds, it works well enough that you can skip the audio step entirely in post.

Face accuracy in Grok Imagine Video 1.5 is a clear step up from 1.0. We tested portrait animations — both real photos and generated reference faces — and the model held facial identity across motion with noticeably less drift than we saw in 1.0. Eye and mouth movement was natural in most tests. Where it fell short: full-body clips with characters moving toward camera lost detail faster than tight portrait shots.

Grok Imagine Video 1.5 produces noticeably sharper output than v1.0 at 720p. We tested material textures (fabric, metal, skin, water), indoor and outdoor lighting conditions, and shadow accuracy across 8 source images. Physical lighting — specular highlights, soft ambient falloff, directional shadows — held up well in most tests. The improvement is most visible in close-up and mid-shot clips where material detail matters. Wide shots with complex backgrounds remain slightly softer. At $0.08/sec (480p), output quality consistently clears the bar for social media and ad creative without post-processing.

One of the practical upgrades in Grok Imagine Video 1.5 is how well it interprets motion prompts alongside the source image. We tested camera direction instructions ("slow push-in", "orbit left", "handheld shake"), subject action prompts ("person turns to face camera", "liquid pours slowly"), and pace descriptors ("gradual", "snap cut energy"). Results were meaningfully better than v1.0 — the model honored direction and speed cues in roughly 8 out of 10 tests. Where it still struggled: precise spatial positioning prompts and compound actions in the same clip.

Grok Imagine Video 1.5's speed advantage is real and meaningful for production workflows. Our tested average was 13–16 seconds per 10-second 720p clip. Kling 3.0 averaged 2–4 minutes for a comparable clip. Veo 3.1 averaged 2–5 minutes. That's not a marginal difference — it changes how you iterate. You can generate 10 variants in the time competitors take to generate one. At 60 RPM on the API, it also scales cleanly for high-volume pipelines.
Arena.ai Leaderboard: The Numbers
The Arena.ai Image-to-Video leaderboard uses blind human preference voting — evaluators see two video outputs side by side without knowing which model generated them, then vote for the better one. Elo scores reflect this aggregate preference data across thousands of comparisons. As of May 2026:
Source: Arena.ai Image-to-Video leaderboard · May 2026 · Elo scores are dynamic and update with new votes.
Pros & Cons: The Real Picture
No model is perfect. Here's our unfiltered take after 40+ tests:
- #1 image-to-video on Arena.ai — not marketing, it's blind test data
- Native audio — saves a full post-production step, included in every clip
- ~15s generation — fastest in class, enables high-volume iteration
- Competitive pricing from $0.08/sec — cheaper than Veo 3.1
- Face consistency across cuts — significantly better than v1.0
- No watermarks on output, commercial rights included on paid plans
- REST API at 60 RPM — scales cleanly for developer pipelines
- Max 720p — Kling 3.0, Veo 3.1, and Sora 2 all offer 1080p
- Consumer rollout incomplete — currently API-level access only
- Image-only input — no text-to-video; you must supply a source image
- Complex multi-element prompts can produce inconsistent results
- No fine-grained camera control — Runway and Kling lead here
Grok Imagine Video 1.5 vs Kling 3.0 vs Veo 3.1
We compared all four leading models across 10 dimensions. The verdict is context-dependent — each model leads in specific areas. Here's the complete breakdown:
| Dimension | Grok Imagine 1.5 #1 I2V | Kling 3.0 | Google Veo 3.1 | Sora 2 |
|---|---|---|---|---|
| I2V Arena Rank | ✦ #1 | Top 5 | Top 5 | Top 5 |
| Max Resolution | 720p | 1080p | 1080p | 1080p |
| Max Duration | 15 sec | 10 sec | 8 sec | 20 sec |
| Native Audio | ✓ Built-in | ✗ | ✓ | ✗ |
| Price per sec (start) | $0.08 | ~$0.14 | ~$0.34+ | ~$0.10 |
| Generation Speed | ~15 seconds | 2–4 min | 2–5 min | 1–3 min |
| Public API | ✓ 60 RPM | ✓ | Enterprise only | ✓ |
| Motion Prompt Accuracy | Strong | Strong | Moderate | Moderate |
| Camera Control | Basic (via prompt) | Advanced | Moderate | Moderate |
| Face Accuracy | Top-rated (blind) | Strong | Strong | Moderate |
// Pricing from official docs · May 2026. Arena rankings from Arena.ai. Subject to change.
When to Choose Each Model
Who Should Use Grok Imagine Video 1.5
Ready to Test It Yourself?
Generate a Grok Imagine Video 1.5 clip directly on this site — no API key required. Upload a photo and see the results in ~15 seconds.

