Lesson 3: Speech Recognition and Synthesis (1 hour)
Learning Objectives
- Understand how speech recognition works
- Understand how speech synthesis (text-to-speech) works
- Recognize applications of speech technologies
- Use speech recognition and synthesis tools hands-on
Materials Needed
- Internet-connected devices
- Microphone access
- Speech recognition and synthesis demos
- Student notebooks
- Examples of speech technology applications
Time Breakdown
- Review NLP (5 min)
- Introduction to speech technologies (15 min)
- How speech recognition works (15 min)
- How speech synthesis works (10 min)
- Hands-on: Speech tools (12 min)
- Wrap-up (3 min)
Activities
1. Review NLP (5 min)
- What is NLP?
- How does AI understand text?
- Bridge: "Today we'll see how AI understands and generates speech"
2. Introduction to Speech Technologies (15 min)
Two Main Areas:
-
Speech Recognition (ASR): Speech → Text
- Converting spoken words into written text
- Also called: Automatic Speech Recognition, Speech-to-Text
-
Speech Synthesis (TTS): Text → Speech
- Converting written text into spoken words
- Also called: Text-to-Speech
Real-World Applications:
Speech Recognition:
- Voice Assistants: Siri, Alexa, Google Assistant
- Transcription: Meeting notes, interviews, captions
- Dictation: Speaking instead of typing
- Voice Commands: "Hey Siri", "OK Google"
- Accessibility: Voice control for people with disabilities
- Customer Service: Phone systems understanding spoken requests
Speech Synthesis:
- Voice Assistants: Responding with speech
- Audiobooks: Text-to-speech for books
- Accessibility: Screen readers for visually impaired
- Navigation: GPS giving directions
- Entertainment: Virtual characters, games
- Announcements: Public address systems
Why Speech Technologies Matter:
- More natural human-computer interaction
- Hands-free operation
- Accessibility for people with disabilities
- Faster input (speaking vs. typing)
- Enables new applications
3. How Speech Recognition Works (15 min)
The Challenge:
- Speech is complex: accents, speed, noise, context
- Same sound can be different words ("two" vs. "too")
- Continuous speech (no pauses between words)
The Process:
- Audio Input: Capture sound waves from microphone
- Preprocessing: Remove noise, normalize volume
- Feature Extraction: Convert sound to features (spectrograms)
- Acoustic Model: Recognizes sounds (phonemes)
- Language Model: Predicts likely words/phrases
- Decoding: Combines acoustic and language models
- Output: Text transcription
Key Concepts:
Phonemes:
- Smallest units of sound in language
- Example: "cat" = /k/ /æ/ /t/
- Different from letters (spelling)
Acoustic Model:
- Recognizes sounds in speech
- Trained on many voice samples
- Handles: Different speakers, accents, speeds
Language Model:
- Predicts likely words/phrases
- Uses context: "I went to the..." → likely "store" not "stork"
- Helps resolve ambiguity
Neural Networks:
- Modern ASR uses deep learning
- Can learn complex patterns
- Better accuracy than older methods
Challenges:
- Background noise
- Different accents and dialects
- Speaking speed
- Homophones (words that sound the same)
- Context-dependent pronunciation
4. How Speech Synthesis Works (10 min)
The Process:
- Text Input: Written words
- Text Processing: Normalize text (numbers, abbreviations)
- Phonetic Analysis: Convert to phonemes
- Prosody: Add stress, intonation, rhythm
- Waveform Generation: Create sound waves
- Output: Speech audio
Key Concepts:
Concatenative TTS:
- Uses pre-recorded speech segments
- Combines segments to form words
- Older method, can sound robotic
Neural TTS:
- Uses neural networks to generate speech
- More natural-sounding
- Can learn different voices, emotions
Prosody:
- Stress, intonation, rhythm of speech
- Makes speech natural and expressive
- Example: Question vs. statement intonation
Challenges:
- Natural-sounding prosody
- Different voices and emotions
- Handling unusual words, names
- Speed and clarity
5. Hands-On: Speech Tools (12 min)
Activity 1: Speech Recognition (6 min)
- Use speech-to-text tool (Google Docs voice typing, or online tool)
- Try speaking:
- Clear, simple sentences
- Longer paragraphs
- With background noise
- Different accents (if available)
- Numbers, names, technical terms
- Observe: Accuracy, what works, what doesn't
Activity 2: Speech Synthesis (6 min)
- Use text-to-speech tool
- Try different texts:
- Simple sentences
- Questions vs. statements
- Numbers and dates
- Different languages (if available)
- Different voices
- Observe: Naturalness, accuracy, limitations
Reflection Questions:
- What worked well? What didn't?
- How accurate was speech recognition?
- How natural did speech synthesis sound?
- What are the limitations?
6. Wrap-Up (3 min)
- Speech recognition: Speech → Text
- Speech synthesis: Text → Speech
- Uses neural networks for better accuracy
- Many applications in daily life
- Preview: Next lesson - Robotics and autonomous systems
Differentiation Strategies
- Younger students: Focus on fun demos, hands-on exploration, simpler explanations
- Older students: Explore how neural networks are used, research specific techniques, analyze limitations
- Struggling learners: Use guided exploration, simpler tools, more support
- Advanced learners: Research specific ASR/TTS models, explore voice cloning, analyze ethical concerns
Assessment
- Participation in hands-on activities
- Quality of observations
- Understanding of speech technology concepts
- Reflection journal entry