Back to interview questions

Interview Questions

Theory Interview Questions

Prepare for AI engineer interviews with theory questions on LLMs, RAG, AI agents, evaluation, monitoring, cost optimization, and safety.

Introduction

These theory interview questions were collected as part of our ongoing research into the AI engineer role, based on real market data. We focused on questions that candidates are actually asked in AI engineer interviews, rather than trying to compile every possible question about large language models (LLMs), retrieval-augmented generation (RAG), agents, evaluation, or safety. The list is based on candidate reports shared on Reddit, X, and personal blogs, where people describe their interview experiences.

Format

This part of the interview is usually 45 to 60 minutes and tends to be conversational. The interviewer asks conceptual questions to understand how well you grasp core AI and ML topics. There is typically no coding exercise or whiteboard task. Instead, the format is a back-and-forth discussion where the interviewer checks whether you can explain ideas clearly, reason through trade-offs, and connect concepts to practical system behavior.

Theory questions do not always appear as a fully separate round. In many interview processes, they are woven into other stages such as system design interviews, project deep dives, or broader AI and ML technical screens. Some companies do run a dedicated LLM theory or AI deep-dive round, but this is less common.

More often, these questions appear as follow-ups. You mention a concept such as RAG, agents, evaluation, or fine-tuning, and the interviewer uses that as an opening to test how well you actually understand it. In that sense, theory questions are often less about recall and more about depth. The interviewer is trying to see whether you can go beyond familiar terms and explain how the underlying ideas work in practice.

How to Prepare

Focus more on practical system thinking than on abstract theory. In most AI engineer interviews, interviewers care less about whether you can recite Transformer internals from memory and more about whether you understand how to build, evaluate, and operate AI systems in practice. The questions that come up most often tend to center on RAG systems, agents, evaluation, and production concerns.

A good preparation strategy is to concentrate on the areas that show up repeatedly:

  1. RAG systems: Be able to explain a complete pipeline end to end: ingestion, chunking, indexing, retrieval, ranking, prompt construction, response generation, and attribution. You should also be ready to discuss failure modes, debugging, and scaling.

  2. Agents: Understand the full architecture, not just the idea of “tool use.” That includes planning, tool selection, execution flow, memory, retries, and termination conditions. One especially important question is when not to use an agent, since interviewers often want to see whether you can distinguish between a useful agentic workflow and unnecessary complexity.

  3. Testing and evaluation: Be prepared to explain how you evaluate quality in practice. That includes building golden datasets, choosing useful metrics, reviewing outputs, and designing evaluations for systems like chatbots, RAG pipelines, and agents.

1. Working with Large Language Models (LLMs)

This section covers the practical fundamentals of working with large language models in real applications. It focuses on how LLMs generate text, how their outputs can be shaped through inference-time controls, and what constraints appear when you work with long prompts, memory, and context. These questions are meant to test whether someone understands the operational basics of using LLMs.

  1. 1 How do large language models work at a high level?
  2. 2 What parameters can you use to control LLM output, and how do they affect behavior?
  3. 3 How do LLMs handle context, and what practical limits does the context window introduce?
  4. 4 How do you manage memory and context effectively in LLM applications?

2. Retrieval-Augmented Generation (RAG)

This section focuses on systems that connect LLMs to external knowledge sources so their answers can be grounded in real documents, databases, and other data. It covers the full RAG pipeline, including retrieval strategy, document processing, attribution, scaling, and debugging. These questions assess whether someone can reason about RAG as a system, not just define it at a high level.

  1. 1 What is retrieval-augmented generation (RAG), and how does the full pipeline work?
  2. 2 What retrieval strategies can you use in RAG systems, and when would you use each?
  3. 3 How would you design a pipeline to process and retrieve information from very large PDF reports?
  4. 4 How would you prevent hallucinations when the retrieved context does not contain the answer?
  5. 5 What are the most common failure points in RAG systems, and how do you debug them?
  6. 6 How do you implement citations and source attribution in a RAG system?
  7. 7 What is semantic caching, and when is it useful?
  8. 8 How would you scale a RAG system to tens of millions of documents?
  9. 9 What are the main design trade-offs in a RAG system?

3. Agents and Tool-Using Systems

This section covers a more advanced class of LLM applications: systems that do more than generate text and can instead choose tools, take actions, and operate across multiple steps. It includes questions about what makes a system agentic, how tools are selected and executed, how to prevent failure modes such as loops or unsafe behavior, and how to design agent workflows for realistic use cases. These questions are useful for understanding whether a candidate can move from simple prompting to building action-oriented systems.

  1. 1 What makes an AI system agentic?
  2. 2 What components does an agent need beyond the language model itself?
  3. 3 How should an agent decide when and how to use tools?
  4. 4 When is an agent the wrong solution?
  5. 5 How would you explain an agentic system to non-technical stakeholders?
  6. 6 How do you control agent execution, including loop detection, termination, retries, and idempotency?
  7. 7 How do you sandbox tool execution safely in agent systems?
  8. 8 What are the biggest security risks in tool-using agents?
  9. 9 How would you design an agent that analyzes customer support tickets, drafts replies, and escalates complex cases?
  10. 10 How would you design an agent that reviews code and suggests improvements?

4. Testing and Evaluation

This section focuses on how to measure whether an LLM system is actually working well. Because model outputs are probabilistic and task-dependent, evaluation is more complex than in traditional software systems. The questions in this section cover consistency, accuracy, hallucination detection, benchmark design, golden datasets, and end-to-end evaluation for chatbots, RAG pipelines, and agents.

  1. 1 How do you make LLM outputs more consistent and accurate?
  2. 2 How do you evaluate conversational AI systems such as chatbots?
  3. 3 What metrics matter when evaluating LLM systems?
  4. 4 How do you build a high-quality evaluation or golden dataset?
  5. 5 What causes hallucinations in LLM systems, and how do you detect and mitigate them?
  6. 6 How would you reduce factual errors in a summarization system?
  7. 7 How do you debug a RAG chatbot that gives confident but incorrect answers?
  8. 8 How do you evaluate a RAG pipeline end to end?
  9. 9 How do you evaluate agent performance, including tool selection quality, action progress, and context adherence?

5. Monitoring and Production Observability

This section looks at what happens after deployment. Once an LLM system is live, the work shifts from building to observing, measuring, and maintaining quality over time. These questions cover operational and business metrics, online monitoring, rollout strategy, hallucination tracking, and production visibility into agent behavior. They are intended to assess whether someone understands how AI systems behave in real environments, where performance can drift and failures are often subtle.

  1. 1 What operational and business metrics matter for AI systems in production?
  2. 2 How do you evaluate and monitor a model in production, not just offline?
  3. 3 How would you test a new model before rolling it out fully?
  4. 4 How do you estimate and monitor hallucination rate in production?
  5. 5 How do you monitor and observe autonomous agent behavior in production?

6. Cost and Latency Optimization

This section covers the engineering trade-offs involved in making LLM systems fast enough and affordable enough to use in production. It includes questions about latency bottlenecks, token costs, model routing, benchmarking, and cost-quality trade-offs at scale. The goal is to understand whether someone can reason not only about model quality, but also about system efficiency, budget constraints, and user experience under real traffic.

  1. 1 How do you reduce latency in GenAI applications?
  2. 2 What is time to first token, and why does it matter for user experience?
  3. 3 How would you benchmark a multi-step LLM pipeline to identify latency bottlenecks?
  4. 4 What are the main levers for reducing token usage and overall LLM cost?
  5. 5 How do you think about cost-versus-quality trade-offs, and when is a smaller model good enough?
  6. 6 What is model tiering, and when should you route requests to a smaller model versus a larger one?
  7. 7 How would you optimize cost for an application serving one million queries per day?
  8. 8 How would you estimate the budget for an enterprise-scale RAG pipeline, such as one built on 300,000 legal contracts?

7. Safety, Security, and Guardrails

This section focuses on the safeguards needed to make LLM systems safe to deploy. It covers technical and product-level risks such as prompt injection, jailbreaks, unsafe code execution, harmful content, privacy issues, and exposure of sensitive data in prompts or logs. These questions are meant to evaluate whether someone can think beyond functionality and account for how AI systems can be misused, exploited, or cause harm if they are not designed with proper controls.

  1. 1 When should you implement LLM guardrails, and what forms can they take?
  2. 2 How do you handle data privacy and personally identifiable information in prompts, logs, and outputs?
  3. 3 How do you defend against prompt injection and jailbreak attempts?
  4. 4 How would you build a system that detects policy-violating or offensive content?
  5. 5 How would you prevent unsafe code generation and execution in an application that runs model-generated code?

Common Mistakes

  1. Being able to describe how to build a system, but not how to evaluate, monitor, or improve it after deployment.

  2. Knowing what a concept is, but not being able to explain the trade-offs behind using it.

    Interviewers usually care less about whether you can define RAG or agents and more about whether you can explain when they are the right choice, when they are not, and what problems they introduce.

  3. Ignoring cost, latency, and failure modes.

    Many candidates answer as if they are describing a prototype or demo. Interviewers are usually looking for production thinking: how the system behaves under real constraints, how it fails, what it costs to run, and how you would make it more reliable over time.