Posted 28 Apr 26

AI Evaluation Engineer (Software Engineering / Code)

contractengineeringsoftwareaidataremote FROM 🇨🇴 🇧🇷

Open to candidates in: Colombia, Eg, Ke, Gh, Ng, Brazil

Gramian Consulting Group

🏭 IT Services and IT Consulting

📍 Kumanovo, MK

👤 2-10

🌐 Website

Apply Now

About Us

Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.

Role overview

We are looking for an AI Evaluation Engineer specialized in software engineering to design benchmark tasks based on real-world coding workflows.

You will create scenarios where AI systems must analyze large codebases, apply precise changes (bug fixes, refactors, migrations), and produce correct, testable outputs.

Commitments Required: 8 hours per day with an overlap of 4 hours with PST.

Employment type: Contractor assignment (no medical/paid leave)

Duration of contract: 4 weeks+

Location: Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam

Interview: take home assessment

Responsibilities

Design and build multi-agent benchmark tasks based on real-world code changes (bug fixes, migrations, refactors)
Work with the Harbor evaluation framework to run and validate tasks in containerized environments
Write clear, precise task instructions (file paths, function signatures, expected behavior, constraints)
Develop Python-based verification scripts to validate correctness of code changes
Define task decomposition strategies across multiple specialized agents
Analyze and navigate large open-source codebases to extract realistic task scenarios
Run, debug, and refine tasks in Docker environments to ensure reproducibility
Improve task quality, clarity, and difficulty based on evaluation results

Requirements

5+ years of experience in software development (Python and JavaScript)
Strong experience working with large codebases (e.g., Django, Flask, FastAPI, Node.js or similar)
Familiarity with Git workflows (pull requests, diffs, commits, cherry-picking)
Experience writing tests or validation scripts (pytest, unittest, or similar)
Ability to write clear, precise technical specifications
Familiarity with AI coding benchmarks or evaluation frameworks (e.g., SWE-bench or similar)
Hands-on experience with Docker (Dockerfiles, image builds, debugging)

Nice to Have

Experience contributing to or maintaining open-source projects
Experience with code migrations or large-scale refactoring
Familiarity with CI/CD pipelines and automated testing workflows
Exposure to LLM-based coding tools or evaluation frameworks

APPLY NOW