Open-source LLM evaluation platform

Find the Best LLM Response for Your Data

Cross-evaluate models using LLM-as-a-Judge methodology.No bias, no guesswork — just consensus.

GPT-5.4
Claude Sonnet 4.6
Gemini 3.1 Pro
LLM-as-a-Judge
Compare the trade-offs between microservices and monolithic architecture for a startup.
Gemini 3.1 Pro#1
Claude Sonnet 4.6#2
GPT-5.4#3
Consensus Winner:Gemini 3.1 Pro— Score: 8.7/10
Enter your prompt...

How It Works

Three steps to finding the best response

Submit, evaluate, review. It's that simple.

01 — Submit Your Query

Enter your prompt with optional context. Select which models to compare from 50+ LLMs via OpenRouter — including models from OpenAI, Anthropic, Google, Meta, and Mistral.

02 — Models Judge Each Other

Each model evaluates the others using your choice of research-based evaluation methods and a custom rubric. Consensus emerges from cross-evaluation, not a single opinion — eliminates single-model bias and gives you choice.

03 — Review & Export Results

Compare ranked responses, see scores and reasoning from each judge. Human feedback (e.g. RL4F) is one of the evaluation methods you can use; when there's a tie we don't reinforce from any method — you stay in control. Export your findings for further analysis.

Features

Why LM Compass?

Built for researchers who need the best response for their data, not marketing benchmarks.

Multi-Model Comparison

Query 50+ LLMs simultaneously via OpenRouter. Compare responses side-by-side.

LLM-as-a-Judge

Automated cross-evaluation where models assess each other's responses.

Custom Rubrics

Define your own evaluation criteria for any use case or domain.

Consensus Rankings

Score-based grading with consensus-driven winner determination.

Human Feedback

Use human feedback or RL4F as one of your evaluation methods. When results are tied, we don't reinforce from any method — you choose.

Batch Experiments

Upload datasets for large-scale evaluations across models.

Built For

Who It's For

Researchers & Developers

Compare LLMs and SLMs with custom rubrics and automated evaluation to find the best response for your data.

AI Teams & Organizations

Determine best model responses at scale with batch experiments and exports.

Academic Community

Study model evaluation techniques with a research-backed, open-source platform.

Ready to find the best response for your data?

Support for OpenAI, Anthropic, Google, Meta, Mistral, and more via OpenRouter.

Get Started with LM Compass

Sign up to begin evaluating.

LM Compass

A peer-review evaluation platform for LLMs and SLMs.

Product

Resources

Legal

Rankings and evaluations are experimental and should not be considered definitive.

© 2026 LM Compass · CS 4ZP6