AI Evaluation Engineer - Mathematics & Algorithms
About Us
Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.
Role overview
We are looking for a highly analytical and computationally strong professional with a solid research background in mathematics or quantitative fields.
In this role, you will design advanced benchmark tasks for multi-agent AI systems, focusing on complex mathematical reasoning, algorithmic problem-solving, and verifiable computational outputs. You will contribute by crafting challenging problems, building validation systems, and structuring tasks that require decomposition into coordinated sub-solutions.
Commitments Required: 8 hours per day with an overlap of 4 hours with PST.
Employment type: Contractor assignment (no medical/paid leave)
Duration of contract: 4 weeks+
Location: Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam
Interview: take home assessment (60min) + short interview
Responsibilities
- Design and build multi-agent benchmark tasks requiring multi-step mathematical reasoning and algorithmic problem-solving
- Create complex, decomposable problems across domains such as:
- Competition mathematics
- Numerical analysis
- Combinatorial optimization
- Statistical inference
- Develop verification scripts to validate:
- Numerical outputs (with tolerance thresholds)
- Proof correctness and logical steps
- Algorithmic outputs and constraints
- Write clear, structured problem statements with precise notation and defined outputs
- Design task decomposition strategies for parallel or multi-agent execution
- Implement computational solutions and validation pipelines using Python
- Work with containerized environments (Docker) for reproducibility and evaluation
Requirements
- 5+ years in mathematics, quantitative research, or computational science β competition math, university-level mathematics, or quantitative research background
- Python programming β NumPy, SciPy, or symbolic computation (SymPy) Experience writing mathematical proofs or formal derivations.
- Ability to create problems with precise, verifiable answers β not subjective or open-ended.
- Experience with AI coding benchmarks (SWE-bench, Terminal-bench)
- Comfortable with Docker β writing Dockerfiles, building images, and debugging container issues.
- Understanding of numerical methods β floating point tolerance, convergence criteria, error bounds.
Nice to Have
- Experience creating competition math problems (AMC, AIME, Putnam, IMO)
- Background in theoretical computer science or advanced mathematics research
- Exposure to automated theorem proving or formal verification
- Familiarity with AI reasoning benchmarks (GSM8K, MATH, AIME, GPQA, ARC-AGI)
- Experience in large-scale numerical or scientific computing