AI Evaluation Engineer (Knowledge & Research)

contractengineeringaidataremote FROM πŸ‡¨πŸ‡΄ πŸ‡§πŸ‡·
Open to candidates in: Colombia, Eg, Ke, Gh, Ng, Brazil
Gramian Consulting Group
🏭 IT Services and IT Consulting
πŸ“ Kumanovo, MK
πŸ‘€ 2-10

About Us

Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.

Role overview

We are looking for an AI Evaluation Engineer with a strong research background to design and evaluate complex, multi-agent tasks used to benchmark next-generation AI systems.

In this role, you will work at the intersection of research, data structuring, and AI evaluation, building high-quality tasks that require deep document understanding, structured reasoning, and multi-step synthesis. You will create datasets and evaluation frameworks that test whether AI agents can truly read, reason, and extract knowledge from large-scale unstructured data.

This is a high-precision, detail-oriented role requiring strong analytical thinking, structured problem decomposition, and the ability to translate research content into measurable evaluation tasks.

Commitments Required: 8 hours per day with an overlap of 4 hours with PST.

Employment type: Contractor assignment (no medical/paid leave)

Duration of contract: 5 weeks+

Location: Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam

Interview: take home assessment (60min)

Responsibilities

  • Build multi-agent benchmark tasks that require reading, analyzing, and synthesizing large document collections
  • Curate real-world research corpora β€” academic papers, case studies, technical reports β€” and design questions that require comprehensive analysis
  • Write structured ground-truth oracles (JSON) with specific, verifiable answers that prove the agent actually read the source material
  • Design LLM judge prompts that evaluate agent output field-by-field against the oracle
  • Create decomposition guides that split research across multiple parallel sub-agents (one per document, one per domain, then synthesis)

Requirements

  • 5+ years of experience in research (academic or industry) in a scientific, technical, or analytical domain
  • Strong ability to read, analyze, and extract structured information from unstructured documents
  • Experience designing or working with structured data formats (JSON, schemas, validation)
  • Proficiency in Python scripting (data processing, validation, or evaluation scripts)
  • Experience with AI evaluation, coding benchmarks, or structured reasoning tasks (e.g., SWE-bench, Terminal-bench, or similar)
  • Experience working with Docker (building images, debugging containers)
  • Strong attention to detail, especially when defining exact, verifiable outputs
  • Ability to design complex, multi-step problem-solving workflows
Gramian Consulting Group
🏭 IT Services and IT Consulting
πŸ“ Kumanovo, MK
πŸ‘€ 2-10