Benchmarks
Numbers that back the claims
SocialVideo Bench results, inference speed, and cost metrics — sourced from the official technical report.
Single H100 GPU
First frame latency
Continuous generation
Generation cost
Output quality in practice
Benchmarks measure scores — this is what SOTA looks like in motion. Real MaineCoon generation with synchronized audio (unmute to verify lip sync).
SocialVideo Bench — Overall Score
Catnip's benchmark for social-interaction video, covering 7 scenarios and 9 evaluation metrics. MaineCoon surpasses all 7 compared models.
7 Scenarios
- Dense speech
- Two-person interaction
- Musical performance
- Emotional acting
- Dance
- Creative challenges
- Social memes
9 Metrics
- Visual quality
- Motion quality
- Audio quality
- Audio-visual alignment
- Overall quality
- Temporal consistency
- Character consistency
- Lip sync accuracy
- Emotional expressiveness
Speed Comparison
| Model | FPS | Notes |
|---|---|---|
| MaineCoon (22B) | 47.5 | Single H100 |
| MaineCoon (22B) | 30+ | RTX Pro 6000 |
| Streaming AV peers | 6–7 | Typical range |
| 1.3B streaming video | 19.1 | MaineCoon is 2×+ faster despite 17× params |
What is SocialVideo Bench?+
A benchmark created by Catnip specifically for social-interaction video generation. It evaluates models across 7 social scenarios and 9 quality metrics including visual quality, motion, audio, alignment, and consistency.
How was the 47.5 FPS measured?+
On a single NVIDIA H100 GPU during streaming inference. RTX Pro 6000 achieves 30+ FPS — sufficient for real-time playback at standard frame rates.
Can I reproduce these benchmarks?+
The technical report on arXiv contains methodology details. Model weights and code are on Hugging Face and GitHub.
Experience MaineCoon live
Input a prompt and watch real-time streaming audio-visual generation on the official platform.