Benchmarks

Numbers that back the claims

SocialVideo Bench results, inference speed, and cost metrics — sourced from the official technical report.

Read Paper →

47.5FPS

Single H100 GPU

<3s

First frame latency

10min+

Continuous generation

<$0.001/s

Generation cost

Output quality in practice

Benchmarks measure scores — this is what SOTA looks like in motion. Real MaineCoon generation with synchronized audio (unmute to verify lip sync).

MaineCoon

SocialVideo Bench — Overall Score

Catnip's benchmark for social-interaction video, covering 7 scenarios and 9 evaluation metrics. MaineCoon surpasses all 7 compared models.

MaineCoon0.934

SoulX-FlashTalk0.895

Other baselines (×5)< 0.89

7 Scenarios

Dense speech
Two-person interaction
Musical performance
Emotional acting
Dance
Creative challenges
Social memes

9 Metrics

Visual quality
Motion quality
Audio quality
Audio-visual alignment
Overall quality
Temporal consistency
Character consistency
Lip sync accuracy
Emotional expressiveness

Speed Comparison

Model	FPS	Notes
MaineCoon (22B)	47.5	Single H100
MaineCoon (22B)	30+	RTX Pro 6000
Streaming AV peers	6–7	Typical range
1.3B streaming video	19.1	MaineCoon is 2×+ faster despite 17× params

What is SocialVideo Bench?+

A benchmark created by Catnip specifically for social-interaction video generation. It evaluates models across 7 social scenarios and 9 quality metrics including visual quality, motion, audio, alignment, and consistency.

How was the 47.5 FPS measured?+

On a single NVIDIA H100 GPU during streaming inference. RTX Pro 6000 achieves 30+ FPS — sufficient for real-time playback at standard frame rates.

Can I reproduce these benchmarks?+

The technical report on arXiv contains methodology details. Model weights and code are on Hugging Face and GitHub.

Experience MaineCoon live

Input a prompt and watch real-time streaming audio-visual generation on the official platform.

Try Experience Platform →Read Technical Report