Unit 4: AI Applications

Lesson 3: Speech Recognition and Synthesis (1 hour)

Lesson content from Unit 4: AI Applications

Lesson 3: Speech Recognition and Synthesis (1 hour)

Learning Objectives

  • Understand how speech recognition works
  • Understand how speech synthesis (text-to-speech) works
  • Recognize applications of speech technologies
  • Use speech recognition and synthesis tools hands-on

Materials Needed

  • Internet-connected devices
  • Microphone access
  • Speech recognition and synthesis demos
  • Student notebooks
  • Examples of speech technology applications

Time Breakdown

  • Review NLP (5 min)
  • Introduction to speech technologies (15 min)
  • How speech recognition works (15 min)
  • How speech synthesis works (10 min)
  • Hands-on: Speech tools (12 min)
  • Wrap-up (3 min)

Activities

1. Review NLP (5 min)

  • What is NLP?
  • How does AI understand text?
  • Bridge: "Today we'll see how AI understands and generates speech"

2. Introduction to Speech Technologies (15 min)

Two Main Areas:

  1. Speech Recognition (ASR): Speech → Text

    • Converting spoken words into written text
    • Also called: Automatic Speech Recognition, Speech-to-Text
  2. Speech Synthesis (TTS): Text → Speech

    • Converting written text into spoken words
    • Also called: Text-to-Speech

Real-World Applications:

Speech Recognition:

  • Voice Assistants: Siri, Alexa, Google Assistant
  • Transcription: Meeting notes, interviews, captions
  • Dictation: Speaking instead of typing
  • Voice Commands: "Hey Siri", "OK Google"
  • Accessibility: Voice control for people with disabilities
  • Customer Service: Phone systems understanding spoken requests

Speech Synthesis:

  • Voice Assistants: Responding with speech
  • Audiobooks: Text-to-speech for books
  • Accessibility: Screen readers for visually impaired
  • Navigation: GPS giving directions
  • Entertainment: Virtual characters, games
  • Announcements: Public address systems

Why Speech Technologies Matter:

  • More natural human-computer interaction
  • Hands-free operation
  • Accessibility for people with disabilities
  • Faster input (speaking vs. typing)
  • Enables new applications

3. How Speech Recognition Works (15 min)

The Challenge:

  • Speech is complex: accents, speed, noise, context
  • Same sound can be different words ("two" vs. "too")
  • Continuous speech (no pauses between words)

The Process:

  1. Audio Input: Capture sound waves from microphone
  2. Preprocessing: Remove noise, normalize volume
  3. Feature Extraction: Convert sound to features (spectrograms)
  4. Acoustic Model: Recognizes sounds (phonemes)
  5. Language Model: Predicts likely words/phrases
  6. Decoding: Combines acoustic and language models
  7. Output: Text transcription

Key Concepts:

Phonemes:

  • Smallest units of sound in language
  • Example: "cat" = /k/ /æ/ /t/
  • Different from letters (spelling)

Acoustic Model:

  • Recognizes sounds in speech
  • Trained on many voice samples
  • Handles: Different speakers, accents, speeds

Language Model:

  • Predicts likely words/phrases
  • Uses context: "I went to the..." → likely "store" not "stork"
  • Helps resolve ambiguity

Neural Networks:

  • Modern ASR uses deep learning
  • Can learn complex patterns
  • Better accuracy than older methods

Challenges:

  • Background noise
  • Different accents and dialects
  • Speaking speed
  • Homophones (words that sound the same)
  • Context-dependent pronunciation

4. How Speech Synthesis Works (10 min)

The Process:

  1. Text Input: Written words
  2. Text Processing: Normalize text (numbers, abbreviations)
  3. Phonetic Analysis: Convert to phonemes
  4. Prosody: Add stress, intonation, rhythm
  5. Waveform Generation: Create sound waves
  6. Output: Speech audio

Key Concepts:

Concatenative TTS:

  • Uses pre-recorded speech segments
  • Combines segments to form words
  • Older method, can sound robotic

Neural TTS:

  • Uses neural networks to generate speech
  • More natural-sounding
  • Can learn different voices, emotions

Prosody:

  • Stress, intonation, rhythm of speech
  • Makes speech natural and expressive
  • Example: Question vs. statement intonation

Challenges:

  • Natural-sounding prosody
  • Different voices and emotions
  • Handling unusual words, names
  • Speed and clarity

5. Hands-On: Speech Tools (12 min)

Activity 1: Speech Recognition (6 min)

  • Use speech-to-text tool (Google Docs voice typing, or online tool)
  • Try speaking:
    • Clear, simple sentences
    • Longer paragraphs
    • With background noise
    • Different accents (if available)
    • Numbers, names, technical terms
  • Observe: Accuracy, what works, what doesn't

Activity 2: Speech Synthesis (6 min)

  • Use text-to-speech tool
  • Try different texts:
    • Simple sentences
    • Questions vs. statements
    • Numbers and dates
    • Different languages (if available)
    • Different voices
  • Observe: Naturalness, accuracy, limitations

Reflection Questions:

  • What worked well? What didn't?
  • How accurate was speech recognition?
  • How natural did speech synthesis sound?
  • What are the limitations?

6. Wrap-Up (3 min)

  • Speech recognition: Speech → Text
  • Speech synthesis: Text → Speech
  • Uses neural networks for better accuracy
  • Many applications in daily life
  • Preview: Next lesson - Robotics and autonomous systems

Differentiation Strategies

  • Younger students: Focus on fun demos, hands-on exploration, simpler explanations
  • Older students: Explore how neural networks are used, research specific techniques, analyze limitations
  • Struggling learners: Use guided exploration, simpler tools, more support
  • Advanced learners: Research specific ASR/TTS models, explore voice cloning, analyze ethical concerns

Assessment

  • Participation in hands-on activities
  • Quality of observations
  • Understanding of speech technology concepts
  • Reflection journal entry