Open-source LLM evaluation platform

Find the Best LLM Response for Your Data

Cross-evaluate models using LLM-as-a-Judge methodology.
No bias, no guesswork — just consensus.

Start Evaluating

GPT-5.4

Claude Sonnet 4.6

Gemini 3.1 Pro

LLM-as-a-Judge

Compare the trade-offs between microservices and monolithic architecture for a startup.

Gemini 3.1 Pro#1

Claude Sonnet 4.6#2

GPT-5.4#3

Consensus Winner:Gemini 3.1 Pro— Score: 8.7/10

Enter your prompt...

How It Works

Three steps to finding the best response

Submit, evaluate, review. It's that simple.

01 — Submit Your Query

Enter your prompt with optional context. Select which models to compare from 50+ LLMs via OpenRouter — including models from OpenAI, Anthropic, Google, Meta, and Mistral.

02 — Models Judge Each Other

Each model evaluates the others using your choice of research-based evaluation methods and a custom rubric. Consensus emerges from cross-evaluation, not a single opinion — eliminates single-model bias and gives you choice.

03 — Review & Export Results

Compare ranked responses, see scores and reasoning from each judge. Human feedback (e.g. RL4F) is one of the evaluation methods you can use; when there's a tie we don't reinforce from any method — you stay in control. Export your findings for further analysis.

Query Input

GPT-5.4

Claude Sonnet 4.6

Gemini 3.1 Pro

Cross-Evaluation

Every model evaluates every other

Results

GPT-5.4

8.7

Claude Sonnet 4.6

8.2

Gemini 3.1 Pro

7.5

Features

Why LM Compass?

Built for researchers who need the best response for their data, not marketing benchmarks.

Multi-Model Comparison

Query 50+ LLMs simultaneously via OpenRouter. Compare responses side-by-side.

LLM-as-a-Judge

Automated cross-evaluation where models assess each other's responses.

Custom Rubrics

Define your own evaluation criteria for any use case or domain.

Consensus Rankings

Score-based grading with consensus-driven winner determination.

Human Feedback

Use human feedback or RL4F as one of your evaluation methods. When results are tied, we don't reinforce from any method — you choose.

Batch Experiments

Upload datasets for large-scale evaluations across models.

Built For

Who It's For

Researchers & Developers

Compare LLMs and SLMs with custom rubrics and automated evaluation to find the best response for your data.

AI Teams & Organizations

Determine best model responses at scale with batch experiments and exports.

Academic Community

Study model evaluation techniques with a research-backed, open-source platform.

Ready to find the best response for your data?

Support for OpenAI, Anthropic, Google, Meta, Mistral, and more via OpenRouter.

Get Started with LM Compass