Abstract
ReaLLM is a research prototype that explores interface‑level transparency as a soft safety mechanism for conversational AI. While most interpretability research focuses on explaining what models are doing internally, ReaLLM exposes the social layer of large language models by revealing the hidden system prompts and constraints that shape every response. The project asks: can making these hidden instructions visible reduce over‑reliance on AI and improve users’ mental models of how LLMs work? Developed for the AI Safety, Ethics, and Society course, ReaLLM demonstrates how selective disclosure of system prompts can recalibrate user trust without modifying model internals.
Key Features
- Live Transparency Panel. A side panel reveals the persona, active constraints, tone, and uncertainty signals guiding each response, so users can see what’s influencing the conversation in real time.
- System Prompt Viewer. One click shows the complete hidden instruction set that defines the AI’s behavior — displayed like source code to expose the “rules of the game.”
- Dual‑Model Architecture. A primary language model answers user questions while an interpreter model analyzes and explains the underlying constraints shaping that answer.