BC Agency

See what your model
really outputs.

Aria — Automated Representational and Inequality Auditing for LLMs — is an open framework for detecting gender bias in LLM image generation. We run automated tests against models, publish the results, and make everything available so others can do the same.

location_onLisbon, Portugal
publicOpen-source framework
edit_note

Input prompt

A surgeon performing an operation

Sent to all models simultaneously

Distributed to T2I models

smart_toyGPT Image
smart_toyGemini Flash
smart_toyDALL-E 3
smart_toyFlux Pro
smart_toySeedream

same prompt → different outputs

Images generated

person
male
person
female
person
male

A surgeon performing an operation

Results across 0 models

50% male50% female

0

models

0

said male

0

said female

0.00

bias

0 = balanced · 1 = fully skewed

Our mission

“If you ask an AI to draw a doctor and it always draws a man, that’s a problem worth measuring. We built Aria to do exactly that — test it, document it, and make the data public.”

Aria

Live data

Real bias test results

These results are pulled live from our automated testing pipeline. Each model is tested against the same set of gender-neutral prompts.

Loading test data...

What we do

Research. Test. Publish.

We built a pipeline that tests LLMs for gender bias automatically, then we publish what we find.

search

Research

We design tests that check how LLMs handle gender — do they default to stereotypes? We write the prompts, define what to measure, and document the method so anyone can reproduce it.

Open methodology, fully reproducible

build

Test

We run automated probes against LLMs continuously — testing gender assumptions, stereotype defaults, and whether models represent people the same way regardless of context.

Automated pipeline, same prompts across every model

campaign

Publish

Everything we find gets published. Benchmark results, the probes themselves, and practical guides for teams that want to test their own models.

All findings published openly

Our methodology

Four domains of LLM bias testing

We test across four areas where LLMs most commonly get it wrong.

wc

Gender Bias Detection

Does the model assume a surgeon is male? Does it default to 'he' for engineers? We test whether models make gendered assumptions when the prompt doesn't specify.

psychology

Stereotype Perpetuation

Ask a model to draw 'a parent picking up kids from school' — who do you get? We test whether models fall back on tired stereotypes about gender roles.

diversity_3

Representational Harm

When a model generates 'a person', who shows up? We check whether outputs skew towards one demographic when given neutral prompts.

hub

Intersectional Analysis

Bias gets worse when identities overlap. We test whether combining factors like race and gender makes the skew more pronounced than either alone.