Part 1: Fetch one transcript

We start with one podcast episode because the shape of the data matters before we design the pipeline. YouTube gives us transcript snippets with timestamps in seconds. The agent will need readable transcript text with timestamps it can cite, so the first job is to turn snippets into subtitle lines.

Install the transcript library and Jupyter in the flow/ project if you did not do it in the overview:

uv add youtube-transcript-api
uv add --dev jupyter

Start Jupyter:

uv run jupyter notebook

Create notebook.ipynb. In the first cell, fetch a transcript for one video:

from youtube_transcript_api import YouTubeTranscriptApi

def fetch_transcript(video_id):
    ytt_api = YouTubeTranscriptApi()
    transcript = ytt_api.fetch(video_id)
    return transcript

video_id = 'D2rw52SOFfM'
transcript = fetch_transcript(video_id)

The example episode is "Reinventing a Career in Tech". When you look at the first transcript item, you will see fields like:

  • start, the timestamp in seconds.
  • duration, the duration of that segment.
  • text, the transcript text.

That shape is useful for machines, but it is not the form we want to store and retrieve. YouTube's own transcript display is closer to what we need:

0:00 Hi everyone, welcome to our event. This
0:03 event is brought to you by Data Dogs
0:04 Club which is a community of people who
0:06 love data.

The timestamp is useful later because the research agent can cite where a claim came from. Keep it in the indexed text.

Subtitle formatting

Create a helper that converts seconds to m:ss or h:mm:ss. This keeps short and long episodes readable:

def format_timestamp(seconds: float) -> str:
    total_seconds = int(seconds)
    hours, remainder = divmod(total_seconds, 3600)
    minutes, secs = divmod(remainder, 60)

    if hours == 0:
        return f"{minutes}:{secs:02}"
    return f"{hours}:{minutes:02}:{secs:02}"

Now turn every transcript snippet into one subtitle line. Replace embedded newlines so Elasticsearch receives clean text:

def make_subtitles(transcript) -> str:
    lines = []

    for entry in transcript:
        ts = format_timestamp(entry.start)
        text = entry.text.replace('\n', ' ')
        lines.append(ts + ' ' + text)

    return '\n'.join(lines)

Use the helper and look at the first few hundred characters:

subtitles = make_subtitles(transcript)
print(subtitles[:500])

We'll use this data format for the rest of the workshop:

  • one text document per video,
  • one timestamped line per transcript snippet,
  • title and video ID stored separately.

Cached transcript fallback

Direct YouTube transcript fetching can fail on cloud IP addresses. This is common in Codespaces and can also happen on a local machine after several videos. To keep the workshop reproducible, the workshop repo includes preprocessed transcript text files under temporal.io/data/.

Install requests:

uv add requests

Add a cached fetch function to the notebook:

import requests

def fetch_transcript_cached(video_id):
    url_prefix = 'https://raw.githubusercontent.com/alexeygrigorev/workshops/refs/heads/main/temporal.io/data'
    url = f'{url_prefix}/{video_id}.txt'

    raw_text = requests.get(url).content.decode('utf8')
    lines = raw_text.split('\n')

    video_title = lines[0]
    subtitles = '\n'.join(lines[2:]).strip()

    return {
        "video_id": video_id,
        "title": video_title,
        "subtitles": subtitles
    }

The cached files have the title on the first line, an empty line on the second line, and then subtitle lines. That is why the function joins lines[2:].

Use the fallback when direct YouTube access is blocked:

cached_doc = fetch_transcript_cached('D2rw52SOFfM')
print(cached_doc['title'])
print(cached_doc['subtitles'][:500])

The rest of the ingestion code can work with either direct YouTube data or this cached form. In the direct path we create the title ourselves from the podcast metadata. In the cached path the text file already gives it to us.

Next we put one transcript into Elasticsearch and make sure search works in Part 2: Run Elasticsearch.

Questions & Answers (0)

Sign in to ask questions