Introduction to AI · Lesson 3: Speech Recognition and Synthesis (1 hour)

Lesson 3: Speech Recognition and Synthesis (1 hour)

Learning Objectives

Understand how speech recognition works
Understand how speech synthesis (text-to-speech) works
Recognize applications of speech technologies
Use speech recognition and synthesis tools hands-on

Materials Needed

Internet-connected devices
Microphone access
Speech recognition and synthesis demos
Student notebooks
Examples of speech technology applications

Time Breakdown

Review NLP (5 min)
Introduction to speech technologies (15 min)
How speech recognition works (15 min)
How speech synthesis works (10 min)
Hands-on: Speech tools (12 min)
Wrap-up (3 min)

Activities

1. Review NLP (5 min)

What is NLP?
How does AI understand text?
Bridge: "Today we'll see how AI understands and generates speech"

2. Introduction to Speech Technologies (15 min)

Two Main Areas:

Speech Recognition (ASR): Speech → Text
- Converting spoken words into written text
- Also called: Automatic Speech Recognition, Speech-to-Text
Speech Synthesis (TTS): Text → Speech
- Converting written text into spoken words
- Also called: Text-to-Speech

Real-World Applications:

Speech Recognition:

Voice Assistants: Siri, Alexa, Google Assistant
Transcription: Meeting notes, interviews, captions
Dictation: Speaking instead of typing
Voice Commands: "Hey Siri", "OK Google"
Accessibility: Voice control for people with disabilities
Customer Service: Phone systems understanding spoken requests

Speech Synthesis:

Voice Assistants: Responding with speech
Audiobooks: Text-to-speech for books
Accessibility: Screen readers for visually impaired
Navigation: GPS giving directions
Entertainment: Virtual characters, games
Announcements: Public address systems

Why Speech Technologies Matter:

More natural human-computer interaction
Hands-free operation
Accessibility for people with disabilities
Faster input (speaking vs. typing)
Enables new applications

3. How Speech Recognition Works (15 min)

The Challenge:

Speech is complex: accents, speed, noise, context
Same sound can be different words ("two" vs. "too")
Continuous speech (no pauses between words)

The Process:

Audio Input: Capture sound waves from microphone
Preprocessing: Remove noise, normalize volume
Feature Extraction: Convert sound to features (spectrograms)
Acoustic Model: Recognizes sounds (phonemes)
Language Model: Predicts likely words/phrases
Decoding: Combines acoustic and language models
Output: Text transcription

Key Concepts:

Phonemes:

Smallest units of sound in language
Example: "cat" = /k/ /æ/ /t/
Different from letters (spelling)

Acoustic Model:

Recognizes sounds in speech
Trained on many voice samples
Handles: Different speakers, accents, speeds

Language Model:

Predicts likely words/phrases
Uses context: "I went to the..." → likely "store" not "stork"
Helps resolve ambiguity

Neural Networks:

Modern ASR uses deep learning
Can learn complex patterns
Better accuracy than older methods

Challenges:

Background noise
Different accents and dialects
Speaking speed
Homophones (words that sound the same)
Context-dependent pronunciation

4. How Speech Synthesis Works (10 min)

The Process:

Text Input: Written words
Text Processing: Normalize text (numbers, abbreviations)
Phonetic Analysis: Convert to phonemes
Prosody: Add stress, intonation, rhythm
Waveform Generation: Create sound waves
Output: Speech audio

Key Concepts:

Concatenative TTS:

Uses pre-recorded speech segments
Combines segments to form words
Older method, can sound robotic

Neural TTS:

Uses neural networks to generate speech
More natural-sounding
Can learn different voices, emotions

Prosody:

Stress, intonation, rhythm of speech
Makes speech natural and expressive
Example: Question vs. statement intonation

Challenges:

Natural-sounding prosody
Different voices and emotions
Handling unusual words, names
Speed and clarity

5. Hands-On: Speech Tools (12 min)

Activity 1: Speech Recognition (6 min)

Use speech-to-text tool (Google Docs voice typing, or online tool)
Try speaking:
- Clear, simple sentences
- Longer paragraphs
- With background noise
- Different accents (if available)
- Numbers, names, technical terms
Observe: Accuracy, what works, what doesn't

Activity 2: Speech Synthesis (6 min)

Use text-to-speech tool
Try different texts:
- Simple sentences
- Questions vs. statements
- Numbers and dates
- Different languages (if available)
- Different voices
Observe: Naturalness, accuracy, limitations

Reflection Questions:

What worked well? What didn't?
How accurate was speech recognition?
How natural did speech synthesis sound?
What are the limitations?

6. Wrap-Up (3 min)

Speech recognition: Speech → Text
Speech synthesis: Text → Speech
Uses neural networks for better accuracy
Many applications in daily life
Preview: Next lesson - Robotics and autonomous systems

Differentiation Strategies

Younger students: Focus on fun demos, hands-on exploration, simpler explanations
Older students: Explore how neural networks are used, research specific techniques, analyze limitations
Struggling learners: Use guided exploration, simpler tools, more support
Advanced learners: Research specific ASR/TTS models, explore voice cloning, analyze ethical concerns

Assessment

Participation in hands-on activities
Quality of observations
Understanding of speech technology concepts
Reflection journal entry