Part 3: Discover podcast videos

One transcript proves the search path works, so now we need all DataTalks.Club podcast episodes. The website is built from a GitHub repository, and the podcast metadata lives in _data/events.yaml.

Install the discovery dependencies:

uv add requests pyyaml tqdm

Read the metadata from a specific commit so the workshop stays reproducible even if the live website changes later:

import requests
import yaml

commit_id = '187b7d056a36d5af6ac33e4c8096c52d13a078a7'
events_url = f'https://raw.githubusercontent.com/DataTalksClub/datatalksclub.github.io/{commit_id}/_data/events.yaml'

raw_yaml = requests.get(events_url).content
events_data = yaml.load(raw_yaml, yaml.CSafeLoader)

podcasts = [
    d for d in events_data
    if (d.get('type') == 'podcast') and (d.get('youtube'))
]

print(f"Found {len(podcasts)} podcasts")

Extract video IDs from the YouTube URLs.

Skip videos that are known to cause ingestion problems:

videos = []

for podcast in podcasts:
    _, video_id = podcast['youtube'].split('watch?v=')

    if video_id in ['FRi0SUtxdMw', 's8kyzy8V5b8']:
        continue

    videos.append({
        'title': podcast['title'],
        'video_id': video_id
    })

print(f"Will process {len(videos)} videos")

Now each item has the structure the pipeline needs:

videos[0]

The expected fields are title and video_id. Everything after this point can ignore the original YAML structure.

Process all videos

The first version of the ingestion loop is plain Python.

It checks whether a video is already indexed, fetches subtitles when needed, and writes the document to Elasticsearch:

from tqdm.auto import tqdm

for video in tqdm(videos):
    video_id = video['video_id']
    video_title = video['title']

    if es.exists(index='podcasts', id=video_id):
        print(f'already processed {video_id}')
        continue

    transcript = fetch_transcript(video_id)
    subtitles = make_subtitles(transcript)

Then index the document:

    doc = {
        "video_id": video_id,
        "title": video_title,
        "subtitles": subtitles
    }

    es.index(index="podcasts", id=video_id, document=doc)

Run this loop and you hit the real problem. YouTube can block the IP after a small number of transcript requests, sometimes in the middle of the podcast archive. That's exactly the failure mode we need the workshop to handle.

Proxy support

The workaround for blocked cloud or local IP addresses is a proxy. Use a residential proxy when possible. Residential IPs look like normal computers, while cloud provider IPs are more likely to be blocked.

Put proxy credentials in .env:

PROXY_BASE_URL=...
PROXY_USER=...
PROXY_PASSWORD=...

Make sure .env is ignored:

echo '.env' >> .gitignore

Load the variables with dirdotenv:

echo 'alias dirdotenv="uvx dirdotenv"' >> ~/.bashrc
echo 'eval "$(dirdotenv hook bash)"' >> ~/.bashrc
source ~/.bashrc

You can also use python-dotenv inside Python:

uv add python-dotenv

Then load the file from the notebook or script:

from dotenv import load_dotenv

load_dotenv()

Configure the YouTube transcript client with a proxy:

import os
from youtube_transcript_api.proxies import GenericProxyConfig

proxy_user = os.environ['PROXY_USER']
proxy_password = os.environ['PROXY_PASSWORD']
proxy_base_url = os.environ['PROXY_BASE_URL']

proxy_url = f'http://{proxy_user}:{proxy_password}@{proxy_base_url}'

proxy = GenericProxyConfig(
    http_url=proxy_url,
    https_url=proxy_url,
)

Use the proxy when fetching transcripts:

def fetch_transcript(video_id):
    ytt_api = YouTubeTranscriptApi(proxy_config=proxy)
    transcript = ytt_api.fetch(video_id)
    return transcript

Proxies reduce IP blocking, but they introduce more network failures. A request now goes through your machine and the proxy service. Then it goes to a residential proxy host, YouTube, and back. Expect SSL errors, timeouts, and blocked proxy IPs.

Because of that gotcha, we can't rely on the notebook loop alone. We need retries, observable progress, and a workflow that can continue after failures, which is the Temporal part in Part 4: Temporal motivation.

Questions & Answers

Sign up to ask questions, track your progress, and get access to other workshops · Already have an account? Sign in