About the role
We are hiring an engineer who moves fluently between applied research and production. This role owns the end-to-end quality loop for our conversational systems — agent design, evaluation methodology, prompt and context engineering, dataset curation, and quantitative analysis of model behavior on multi-turn tasks. You will set the bar for what our product can do, and defend that bar with data.
What you'll do
- Design and ship agents that handle multi-turn dialogue in noisy, real-world conditions.
- Build rigorous evaluation suites — structured benchmarks, regression sets, and principled failure-mode analysis — that compound quality over time.
- Own the full quality loop, from dataset curation to prompt design to post-hoc error analysis.
- Make principled tradeoffs between latency, cost, and quality across model providers.
- Drive roadmap and architectural decisions based on rigorous experimentation, not intuition.
You're a good fit if you
- Hold an MS or PhD in Computer Science, Machine Learning, or a closely related field (greatly preferred).
- Have deep, hands-on experience with modern language models in production — not demos.
- Are an expert in Python and comfortable operating in a fast-moving, high-ownership codebase.
- Can design an evaluation, defend its methodology, and articulate its limitations.
- Have a track record of shipped, published, or open-sourced work that demonstrates depth of craft.
Strong candidates also have
- Background in reinforcement learning, fine-tuning, or post-training.
- Prior work on agentic systems or structured extraction from unstructured inputs.
- Peer-reviewed publications or substantive open-source contributions.