Project Lighthouse Publishes Multilingual Benchmark Showing 23% Search Quality Gap for Low-Resource Languages

Multilingual benchmark research data and results

The Meridian AI Project Lighthouse team has released a major benchmark study examining AI-powered search quality across 47 languages, revealing a persistent and statistically significant quality gap for low-resource languages that the team argues has been systematically underreported in prior evaluations.

The study, titled "Beyond English NDCG: A Multilingual Audit of Neural Information Retrieval Systems," was released as a preprint and accepted for presentation at the 2026 ACL conference. The research was led by principal investigator Dr. Elena Marchetti and doctoral candidate Isabel Ferreira, with contributions from the full Project Lighthouse team.

The core finding: across 11 evaluated search systems — including two major commercial systems evaluated under academic access agreements — search quality for low-resource languages (defined as languages with fewer than 1 billion tokens in common pretraining corpora) averaged 23% lower on normalized discounted cumulative gain (NDCG@10) compared to high-resource languages, even when document corpora were held constant and translated by professional translators.

"The gap is not explained by document quality," said Dr. Marchetti. "We controlled for that carefully. The gap comes from the representation of those languages in the underlying models used for query expansion and reranking."

The benchmark, called ML-IR Bench 2026, is being released as an open resource for the community. It covers languages including Yoruba, Swahili, Tagalog, Bengali, Tamil, Urdu, and 41 others, with human-annotated relevance judgments for 500 queries per language contributed by native speakers recruited through a partner network developed over 18 months.

The Lighthouse team's own search system, which was developed using techniques applicable to open-source implementations including Scolta's query expansion pipeline, showed a smaller gap of 14% — still statistically significant, but representing what the team calls a "directionally promising" result that they attribute to deliberate multilingual fine-tuning during the query expansion model selection process.

The full benchmark and evaluation harness are available on the Project Lighthouse GitHub repository. The team is actively recruiting collaborators with expertise in additional low-resource languages.