ELSA, Speak

ELSA, Speak

ELSA, Speak

ELSA, Speak

(2025)

(2025)

(2025)

(2025)

Humanising AI: The making of ELSA avatars

Humanising AI: The making of ELSA avatars

Humanising AI: The making of ELSA avatars

Humanising AI: The making of ELSA avatars

Discover ↓

Learning a new language is scary. Speaking it out loud? Even scarier. That’s where our avatars come in.

This project talks about how we turned AI into a human-feeling coach learners trust. I led the design and cross-functional collaboration behind it - from framing the problem and aligning stakeholders, to setting the art direction and building a design system that scales across 2D and 3D tutors. Along the way I hired rive & unity designers, and partnered closely with speech/ML, engineering, and external Unity & Rive creators to ship avatars that feel alive yet run on low-end devices.

Outcome: Learners spoke sooner (time-to-first-speak down 12%), stayed longer (session length up ~8%), and more converted from install to trial (+6%). Not bad for a few blinking faces.

This project talks about how we turned AI into a human-feeling coach learners trust. I led the design and cross-functional collaboration behind it - from framing the problem and aligning stakeholders, to setting the art direction and building a design system that scales across 2D and 3D tutors. Along the way I hired rive & unity designers, and partnered closely with speech/ML, engineering, and external Unity & Rive creators to ship avatars that feel alive yet run on low-end devices.

Outcome: Learners spoke sooner (time-to-first-speak down 12%), stayed longer (session length up ~8%), and more converted from install to trial (+6%). Not bad for a few blinking faces.

This project talks about how we turned AI into a human-feeling coach learners trust. I led the design and cross-functional collaboration behind it - from framing the problem and aligning stakeholders, to setting the art direction and building a design system that scales across 2D and 3D tutors. Along the way I hired rive & unity designers, and partnered closely with speech/ML, engineering, and external Unity & Rive creators to ship avatars that feel alive yet run on low-end devices.

Outcome: Learners spoke sooner (time-to-first-speak down 12%), stayed longer (session length up ~8%), and more converted from install to trial (+6%). Not bad for a few blinking faces.

This project talks about how we turned AI into a human-feeling coach learners trust. I led the design and cross-functional collaboration behind it - from framing the problem and aligning stakeholders, to setting the art direction and building a design system that scales across 2D and 3D tutors. Along the way I hired rive & unity designers, and partnered closely with speech/ML, engineering, and external Unity & Rive creators to ship avatars that feel alive yet run on low-end devices.

Outcome: Learners spoke sooner (time-to-first-speak down 12%), stayed longer (session length up ~8%), and more converted from install to trial (+6%). Not bad for a few blinking faces.

(001)

The team

The team

The team

Project Lead

Me

UX Designers

Sailee Borkar

Manager

Kautuk Trivedi

Art Direction,
Web design,
Interaction design

My role

My role

My role

I led the design and cross-functional collaboration behind it - from framing the problem and aligning stakeholders, to setting the art direction and building a design system that scales across 2D and 3D tutors. Along the way I hired rive & unity designers, and partnered closely with speech/ML, engineering, and external Unity & Rive creators to ship avatars that feel alive yet run on low-end devices.

I led the design and cross-functional collaboration behind it - from framing the problem and aligning stakeholders, to setting the art direction and building a design system that scales across 2D and 3D tutors. Along the way I hired rive & unity designers, and partnered closely with speech/ML, engineering, and external Unity & Rive creators to ship avatars that feel alive yet run on low-end devices.

I led the design and cross-functional collaboration behind it - from framing the problem and aligning stakeholders, to setting the art direction and building a design system that scales across 2D and 3D tutors. Along the way I hired rive & unity designers, and partnered closely with speech/ML, engineering, and external Unity & Rive creators to ship avatars that feel alive yet run on low-end devices.

I led the design and cross-functional collaboration behind it - from framing the problem and aligning stakeholders, to setting the art direction and building a design system that scales across 2D and 3D tutors. Along the way I hired rive & unity designers, and partnered closely with speech/ML, engineering, and external Unity & Rive creators to ship avatars that feel alive yet run on low-end devices.

TLDR

TLDR

TLDR

Learners loved ELSA’s personalised lessons, but the experience still felt… solo. We built coach avatars to add warmth, guidance, and timely feedback.

We treated the avatar as core product, not a sticker: it shows up in onboarding, practice, feedback, and celebrations.

  1. Design goals: warm, competent, bias‑aware, low‑latency, and tiny on device.

  2. Under the hood: a viseme system for lip‑sync, plus state machines for idle, prompt, feedback, and celebration animations.

  3. Shipped as a phased rollout: a small, fair set of male/female characters → A/B learnings → expansion.

Learners loved ELSA’s personalised lessons, but the experience still felt… solo. We built coach avatars to add warmth, guidance, and timely feedback.

We treated the avatar as core product, not a sticker: it shows up in onboarding, practice, feedback, and celebrations.

  1. Design goals: warm, competent, bias‑aware, low‑latency, and tiny on device.

  2. Under the hood: a viseme system for lip‑sync, plus state machines for idle, prompt, feedback, and celebration animations.

  3. Shipped as a phased rollout: a small, fair set of male/female characters → A/B learnings → expansion.

Learners loved ELSA’s personalised lessons, but the experience still felt… solo. We built coach avatars to add warmth, guidance, and timely feedback.

We treated the avatar as core product, not a sticker: it shows up in onboarding, practice, feedback, and celebrations.

  1. Design goals: warm, competent, bias‑aware, low‑latency, and tiny on device.

  2. Under the hood: a viseme system for lip‑sync, plus state machines for idle, prompt, feedback, and celebration animations.

  3. Shipped as a phased rollout: a small, fair set of male/female characters → A/B learnings → expansion.

Learners loved ELSA’s personalised lessons, but the experience still felt… solo. We built coach avatars to add warmth, guidance, and timely feedback.

We treated the avatar as core product, not a sticker: it shows up in onboarding, practice, feedback, and celebrations.

  1. Design goals: warm, competent, bias‑aware, low‑latency, and tiny on device.

  2. Under the hood: a viseme system for lip‑sync, plus state machines for idle, prompt, feedback, and celebration animations.

  3. Shipped as a phased rollout: a small, fair set of male/female characters → A/B learnings → expansion.

The problem we kept hearing

The problem we kept hearing

The problem we kept hearing

Our learners are serious about learning English. They’re practising English for promotions, visas, or job interviews or immigration. They’re motivated. But they also told us something we couldn’t ignore:

“The lessons are great, but it still feels like I’m practising alone.”

Some even admitted they used Duolingo for fun, or ChatGPT voice mode when commuting.
Translation: we were missing the warmth of someone on the other side.

So we asked ourselves: What if the AI itself became a companion?

Our learners are serious about learning English. They’re practising English for promotions, visas, or job interviews or immigration. They’re motivated. But they also told us something we couldn’t ignore:

“The lessons are great, but it still feels like I’m practising alone.”

Some even admitted they used Duolingo for fun, or ChatGPT voice mode when commuting.
Translation: we were missing the warmth of someone on the other side.

So we asked ourselves: What if the AI itself became a companion?

Our learners are serious about learning English. They’re practising English for promotions, visas, or job interviews or immigration. They’re motivated. But they also told us something we couldn’t ignore:

“The lessons are great, but it still feels like I’m practising alone.”

Some even admitted they used Duolingo for fun, or ChatGPT voice mode when commuting.
Translation: we were missing the warmth of someone on the other side.

So we asked ourselves: What if the AI itself became a companion?

Our learners are serious about learning English. They’re practising English for promotions, visas, or job interviews or immigration. They’re motivated. But they also told us something we couldn’t ignore:

“The lessons are great, but it still feels like I’m practising alone.”

Some even admitted they used Duolingo for fun, or ChatGPT voice mode when commuting.
Translation: we were missing the warmth of someone on the other side.

So we asked ourselves: What if the AI itself became a companion?

(003)

Mascot: fun but too cute

Mascot: fun but too cute

Mascot: fun but too cute

We started with what we had – our planet mascot. With some motion magic in Rive, it blinked, nodded, even looked bored if you took too long.
Users loved it. But then came the complaints:

“I’m 35, preparing for my IELTS. Do I really want to talk to a cartoon planet?”

Fair point. The mascot had charm, but not enough gravitas.

We started with what we had – our planet mascot. With some motion magic in Rive, it blinked, nodded, even looked bored if you took too long.
Users loved it. But then came the complaints:

“I’m 35, preparing for my IELTS. Do I really want to talk to a cartoon planet?”

Fair point. The mascot had charm, but not enough gravitas.

We started with what we had – our planet mascot. With some motion magic in Rive, it blinked, nodded, even looked bored if you took too long.
Users loved it. But then came the complaints:

“I’m 35, preparing for my IELTS. Do I really want to talk to a cartoon planet?”

Fair point. The mascot had charm, but not enough gravitas.

We started with what we had – our planet mascot. With some motion magic in Rive, it blinked, nodded, even looked bored if you took too long.
Users loved it. But then came the complaints:

“I’m 35, preparing for my IELTS. Do I really want to talk to a cartoon planet?”

Fair point. The mascot had charm, but not enough gravitas.

Humans enter the chat

Humans enter the chat

Humans enter the chat

Next up: real human avatars. Six of them to start, three male, three female - each with distinct personalities and accents. We used a viseme system (fancy word for mouth shapes) to sync lip movement with speech. Suddenly, the tutor wasn’t just talking at you - it was talking with you.
The response? Electric. Learners picked favourites, debated accents, and, fun fact, female avatars were chosen far more often. Culture mattered too: Japan still loved the planet, while Indonesia leaned human.

Next up: real human avatars. Six of them to start, three male, three female - each with distinct personalities and accents. We used a viseme system (fancy word for mouth shapes) to sync lip movement with speech. Suddenly, the tutor wasn’t just talking at you - it was talking with you.
The response? Electric. Learners picked favourites, debated accents, and, fun fact, female avatars were chosen far more often. Culture mattered too: Japan still loved the planet, while Indonesia leaned human.

Next up: real human avatars. Six of them to start, three male, three female - each with distinct personalities and accents. We used a viseme system (fancy word for mouth shapes) to sync lip movement with speech. Suddenly, the tutor wasn’t just talking at you - it was talking with you.
The response? Electric. Learners picked favourites, debated accents, and, fun fact, female avatars were chosen far more often. Culture mattered too: Japan still loved the planet, while Indonesia leaned human.

Next up: real human avatars. Six of them to start, three male, three female - each with distinct personalities and accents. We used a viseme system (fancy word for mouth shapes) to sync lip movement with speech. Suddenly, the tutor wasn’t just talking at you - it was talking with you.
The response? Electric. Learners picked favourites, debated accents, and, fun fact, female avatars were chosen far more often. Culture mattered too: Japan still loved the planet, while Indonesia leaned human.

3D makes it real

3D makes it real

3D makes it real

Still, learners said: “Feels cartoonish. Can I talk to an actual human?”
Enter 3D. With Unity specialists, we built lightweight characters that ran smoothly on low-end phones. They blinked, nodded, and celebrated your wins in real time.
The result? A clear lift in conversions. More people spoke sooner. More stayed engaged. And suddenly, our “study buddy” felt less like an app and more like a real coach.

Still, learners said: “Feels cartoonish. Can I talk to an actual human?”
Enter 3D. With Unity specialists, we built lightweight characters that ran smoothly on low-end phones. They blinked, nodded, and celebrated your wins in real time.
The result? A clear lift in conversions. More people spoke sooner. More stayed engaged. And suddenly, our “study buddy” felt less like an app and more like a real coach.

Still, learners said: “Feels cartoonish. Can I talk to an actual human?”
Enter 3D. With Unity specialists, we built lightweight characters that ran smoothly on low-end phones. They blinked, nodded, and celebrated your wins in real time.
The result? A clear lift in conversions. More people spoke sooner. More stayed engaged. And suddenly, our “study buddy” felt less like an app and more like a real coach.

Still, learners said: “Feels cartoonish. Can I talk to an actual human?”
Enter 3D. With Unity specialists, we built lightweight characters that ran smoothly on low-end phones. They blinked, nodded, and celebrated your wins in real time.
The result? A clear lift in conversions. More people spoke sooner. More stayed engaged. And suddenly, our “study buddy” felt less like an app and more like a real coach.

Under the hood

Under the hood

Under the hood

Designing avatars was as much about constraints as creativity. A few highlights:

  • Choosing the face. We had to strike a balance: too cartoonish felt childish, too realistic risked uncanny valley. We used cross‑regional facial research and averaged proportions to arrive at a “world‑friendly” design - approachable and trustworthy in multiple cultures.

  • Bias and inclusion. Every face was audited for stereotypes. We launched with balanced gender representation and diverse skin tones, then used A/B tests and regional feedback to refine.

  • Viseme system. Our mouth‑shape set was carefully reduced to the smallest number that still looked believable in motion. Each viseme was redrawn to match the proportions of our style - wide enough for clarity, subtle enough not to look exaggerated.

  • State machines. Each avatar runs on a real‑time state machine: idle (blinks, micro‑gaze shifts), prompting, listening, speaking (lip‑sync driven), feedback (gentle corrections), celebration (short loops). Multiple states can blend - speaking while nodding, listening while blinking.

  • Performance budgets. Avatars had to feel alive without slowing devices. Rive assets were kept vector‑light; Unity models were custom‑optimised to stay under strict memory and package size caps. Average footprint per avatar stayed under 5 MB, even with full viseme sets.

  • Low latency. The system reacts in real time: if a learner interrupts the tutor, the avatar stops speaking; if the learner repeats a phrase, the avatar switches seamlessly to listening.

Designing avatars was as much about constraints as creativity. A few highlights:

  • Choosing the face. We had to strike a balance: too cartoonish felt childish, too realistic risked uncanny valley. We used cross‑regional facial research and averaged proportions to arrive at a “world‑friendly” design - approachable and trustworthy in multiple cultures.

  • Bias and inclusion. Every face was audited for stereotypes. We launched with balanced gender representation and diverse skin tones, then used A/B tests and regional feedback to refine.

  • Viseme system. Our mouth‑shape set was carefully reduced to the smallest number that still looked believable in motion. Each viseme was redrawn to match the proportions of our style - wide enough for clarity, subtle enough not to look exaggerated.

  • State machines. Each avatar runs on a real‑time state machine: idle (blinks, micro‑gaze shifts), prompting, listening, speaking (lip‑sync driven), feedback (gentle corrections), celebration (short loops). Multiple states can blend - speaking while nodding, listening while blinking.

  • Performance budgets. Avatars had to feel alive without slowing devices. Rive assets were kept vector‑light; Unity models were custom‑optimised to stay under strict memory and package size caps. Average footprint per avatar stayed under 5 MB, even with full viseme sets.

  • Low latency. The system reacts in real time: if a learner interrupts the tutor, the avatar stops speaking; if the learner repeats a phrase, the avatar switches seamlessly to listening.

Designing avatars was as much about constraints as creativity. A few highlights:

  • Choosing the face. We had to strike a balance: too cartoonish felt childish, too realistic risked uncanny valley. We used cross‑regional facial research and averaged proportions to arrive at a “world‑friendly” design - approachable and trustworthy in multiple cultures.

  • Bias and inclusion. Every face was audited for stereotypes. We launched with balanced gender representation and diverse skin tones, then used A/B tests and regional feedback to refine.

  • Viseme system. Our mouth‑shape set was carefully reduced to the smallest number that still looked believable in motion. Each viseme was redrawn to match the proportions of our style - wide enough for clarity, subtle enough not to look exaggerated.

  • State machines. Each avatar runs on a real‑time state machine: idle (blinks, micro‑gaze shifts), prompting, listening, speaking (lip‑sync driven), feedback (gentle corrections), celebration (short loops). Multiple states can blend - speaking while nodding, listening while blinking.

  • Performance budgets. Avatars had to feel alive without slowing devices. Rive assets were kept vector‑light; Unity models were custom‑optimised to stay under strict memory and package size caps. Average footprint per avatar stayed under 5 MB, even with full viseme sets.

  • Low latency. The system reacts in real time: if a learner interrupts the tutor, the avatar stops speaking; if the learner repeats a phrase, the avatar switches seamlessly to listening.

Designing avatars was as much about constraints as creativity. A few highlights:

  • Choosing the face. We had to strike a balance: too cartoonish felt childish, too realistic risked uncanny valley. We used cross‑regional facial research and averaged proportions to arrive at a “world‑friendly” design - approachable and trustworthy in multiple cultures.

  • Bias and inclusion. Every face was audited for stereotypes. We launched with balanced gender representation and diverse skin tones, then used A/B tests and regional feedback to refine.

  • Viseme system. Our mouth‑shape set was carefully reduced to the smallest number that still looked believable in motion. Each viseme was redrawn to match the proportions of our style - wide enough for clarity, subtle enough not to look exaggerated.

  • State machines. Each avatar runs on a real‑time state machine: idle (blinks, micro‑gaze shifts), prompting, listening, speaking (lip‑sync driven), feedback (gentle corrections), celebration (short loops). Multiple states can blend - speaking while nodding, listening while blinking.

  • Performance budgets. Avatars had to feel alive without slowing devices. Rive assets were kept vector‑light; Unity models were custom‑optimised to stay under strict memory and package size caps. Average footprint per avatar stayed under 5 MB, even with full viseme sets.

  • Low latency. The system reacts in real time: if a learner interrupts the tutor, the avatar stops speaking; if the learner repeats a phrase, the avatar switches seamlessly to listening.

Where the avatar shows up (by design)

Where the avatar shows up (by design)

Where the avatar shows up (by design)

Onboarding: introduces ELSA as your coach and sets expectations for speaking.

Lesson prompts: gives concise instructions before you speak.
Pronunciation practice: lip‑syncs in sync with the model; shifts to listening state as the mic opens.
Feedback moments: shows how to fix a sound; celebrates clear improvements.
Micro‑celebrations: short, delightful loops without confetti overload.

Onboarding: introduces ELSA as your coach and sets expectations for speaking.

Lesson prompts: gives concise instructions before you speak.
Pronunciation practice: lip‑syncs in sync with the model; shifts to listening state as the mic opens.
Feedback moments: shows how to fix a sound; celebrates clear improvements.
Micro‑celebrations: short, delightful loops without confetti overload.

Onboarding: introduces ELSA as your coach and sets expectations for speaking.

Lesson prompts: gives concise instructions before you speak.
Pronunciation practice: lip‑syncs in sync with the model; shifts to listening state as the mic opens.
Feedback moments: shows how to fix a sound; celebrates clear improvements.
Micro‑celebrations: short, delightful loops without confetti overload.

Onboarding: introduces ELSA as your coach and sets expectations for speaking.

Lesson prompts: gives concise instructions before you speak.
Pronunciation practice: lip‑syncs in sync with the model; shifts to listening state as the mic opens.
Feedback moments: shows how to fix a sound; celebrates clear improvements.
Micro‑celebrations: short, delightful loops without confetti overload.

Shipping strategy

Shipping strategy

Shipping strategy

  • Start small. Ship a minimal but fair roster; instrument everything.

  • Learn fast. Run A/Bs on avatar presence, placement, and intensity (e.g., when to celebrate, how often to nudge).

  • Expand. Add new characters and refine the viseme set based on real learner outcomes.

  • Start small. Ship a minimal but fair roster; instrument everything.

  • Learn fast. Run A/Bs on avatar presence, placement, and intensity (e.g., when to celebrate, how often to nudge).

  • Expand. Add new characters and refine the viseme set based on real learner outcomes.

  • Start small. Ship a minimal but fair roster; instrument everything.

  • Learn fast. Run A/Bs on avatar presence, placement, and intensity (e.g., when to celebrate, how often to nudge).

  • Expand. Add new characters and refine the viseme set based on real learner outcomes.

  • Start small. Ship a minimal but fair roster; instrument everything.

  • Learn fast. Run A/Bs on avatar presence, placement, and intensity (e.g., when to celebrate, how often to nudge).

  • Expand. Add new characters and refine the viseme set based on real learner outcomes.

(004)

What we measure?

What we measure?

What we measure?

  • Task completion & time‑to‑first‑speak: does the avatar lower the barrier to speaking?

  • Error recovery: do learners fix mispronunciations faster when shown mouth cues?

  • Session length & return rate: does presence improve engagement without adding friction?

  • Performance: memory footprint, load time, and frame stability on low‑end devices.

Note: as part of the broader redesign work at ELSA, onboarding completion has risen from ~64% to ~87%, and overall engagement has improved significantly. Avatars are designed to continue that momentum by reducing speaking anxiety and clarifying feedback.

  • Task completion & time‑to‑first‑speak: does the avatar lower the barrier to speaking?

  • Error recovery: do learners fix mispronunciations faster when shown mouth cues?

  • Session length & return rate: does presence improve engagement without adding friction?

  • Performance: memory footprint, load time, and frame stability on low‑end devices.

Note: as part of the broader redesign work at ELSA, onboarding completion has risen from ~64% to ~87%, and overall engagement has improved significantly. Avatars are designed to continue that momentum by reducing speaking anxiety and clarifying feedback.

  • Task completion & time‑to‑first‑speak: does the avatar lower the barrier to speaking?

  • Error recovery: do learners fix mispronunciations faster when shown mouth cues?

  • Session length & return rate: does presence improve engagement without adding friction?

  • Performance: memory footprint, load time, and frame stability on low‑end devices.

Note: as part of the broader redesign work at ELSA, onboarding completion has risen from ~64% to ~87%, and overall engagement has improved significantly. Avatars are designed to continue that momentum by reducing speaking anxiety and clarifying feedback.

  • Task completion & time‑to‑first‑speak: does the avatar lower the barrier to speaking?

  • Error recovery: do learners fix mispronunciations faster when shown mouth cues?

  • Session length & return rate: does presence improve engagement without adding friction?

  • Performance: memory footprint, load time, and frame stability on low‑end devices.

Note: as part of the broader redesign work at ELSA, onboarding completion has risen from ~64% to ~87%, and overall engagement has improved significantly. Avatars are designed to continue that momentum by reducing speaking anxiety and clarifying feedback.

Next steps

Next steps

Next steps

  • More characters based on preference data.

  • Contextual coaching: adapt the avatar’s prompts to the learner’s confidence and goals.

  • Richer listening states: subtle micro‑expressions during silence to keep learners engaged.

  • Exploration of 3D/2.5D only if it remains performant on low‑end devices.

  • More characters based on preference data.

  • Contextual coaching: adapt the avatar’s prompts to the learner’s confidence and goals.

  • Richer listening states: subtle micro‑expressions during silence to keep learners engaged.

  • Exploration of 3D/2.5D only if it remains performant on low‑end devices.

  • More characters based on preference data.

  • Contextual coaching: adapt the avatar’s prompts to the learner’s confidence and goals.

  • Richer listening states: subtle micro‑expressions during silence to keep learners engaged.

  • Exploration of 3D/2.5D only if it remains performant on low‑end devices.

  • More characters based on preference data.

  • Contextual coaching: adapt the avatar’s prompts to the learner’s confidence and goals.

  • Richer listening states: subtle micro‑expressions during silence to keep learners engaged.

  • Exploration of 3D/2.5D only if it remains performant on low‑end devices.

Next case

Next case

Next case

Next case

Pooja sinha
product designer

Pooja sinha
product designer

Pooja sinha
product designer

Pooja sinha
product designer