Strategyยทยท by Michael Wybraniec

Distributed Systems

A distributed architecture for managing metadata, media storage, feed generation, notifications, and analytics, focusing on both design decisions and implementation insights.

Building a Scalable Distributed System for Media Storage and Processing

I want to express my gratitude to my professional colleagues who have inspired me over the past few years: David Daupeyroux and YASH MAHENDRA JOSHI.

In todayโ€™s digital landscape, serving millions of users efficiently demands systems that scale horizontally, handle failures, and process vast amounts of data quickly.

This guide walks through a distributed architecture for managing metadata, media storage, feed generation, notifications, and analytics, focusing on both design decisions and implementation insights.


Schema Overview

User
 โ””โ”€โ”€> DNS
      โ””โ”€โ”€> Load Balancer
            โ”œโ”€โ”€> API Gateway 1
            โ”œโ”€โ”€> API Gateway 2
            โ””โ”€โ”€> CDN (for Static Content)
                  โ”œโ”€โ”€> Image/Thumbnail Storage
                  โ””โ”€โ”€> Video Storage

API Gateway
 โ”œโ”€โ”€ Authentication, Authorization
 โ”œโ”€โ”€ Caching, Transformation
 โ”œโ”€โ”€ Rate Limiting, Reverse Proxy
 โ”œโ”€โ”€ Monitoring, Logging, Serverless Functions

Load Balancer sends:
- Control to Metadata Server
- Data to Block Server

Metadata Server
 โ”œโ”€โ”€> Notification Service
 โ”‚     โ””โ”€โ”€> Notification Queue
 โ”œโ”€โ”€> Directory-based Partitioning
 โ”œโ”€โ”€> Shard Manager
 โ”‚     โ””โ”€โ”€> Feed Generation Service
 โ”‚           โ””โ”€โ”€> Feed Generation Queue
 โ”œโ”€โ”€> Search Results Aggregators
 โ”œโ”€โ”€> Cache (Redis/Memcached)
 โ””โ”€โ”€> Metadata Databases (Partitioned)

Block Server
 โ””โ”€โ”€> Distributed File Storage
       โ”œโ”€โ”€> Image/Thumbnail Storage
       โ””โ”€โ”€> Video Storage
             โ””โ”€โ”€> Video Processing Service
                   โ””โ”€โ”€> Video Processing Queue
                         โ””โ”€โ”€> Workers

Coordination and Support Systems
 โ”œโ”€โ”€ Coordination Service (Zookeeper)
 โ”œโ”€โ”€ Distributed Logging
 โ””โ”€โ”€ Distributed Tracing

Data Warehouse
 โ”œโ”€โ”€ Data Processing Systems (Hadoop/MapReduce, Spark)
 โ”‚     โ”œโ”€โ”€ Distributed Scheduler
 โ”‚     โ””โ”€โ”€ Workers
 โ””โ”€โ”€ Output (Metrics, Reports, etc.)
      โ””โ”€โ”€ Reports Viewing & Data Analysis
           โ””โ”€โ”€ Database

User Request Handling

Users interact via DNS โž” Load Balancer โž” API Gateway(s) for:

  • Fault Tolerance
  • Load Distribution
  • Scalability

Key Technical Details

  • Authentication, Authorization
  • Caching, Transformation
  • Rate Limiting, Reverse Proxy
  • Static content served via CDN
location /static/ {
    proxy_pass http://cdn.example.com/static/;
    proxy_cache cdn_cache;
}

๐Ÿ”น Metadata and Data Flow

  • Control Data โž” Metadata Server
  • Binary Data โž” Block Server

Key Technical Details

  • Metadata caching via Redis/Memcached
  • Directory-based partitioning for load balancing
def partition_directory(file_id):
    return f"{hash(file_id) % 1000:03d}"  # 1000 buckets

๐Ÿ”น Notification and Feed Generation

When users upload media:

  • Notification Service triggers Notifications
  • Feed Generation builds personalized feeds

๐Ÿ”ฅ Key Technical Details

  • Decoupling via queues for scalability
import boto3
import json

sqs = boto3.client('sqs')
sqs.send_message(
    QueueUrl='https://sqs.amazonaws.com/queue-url',
    MessageBody=json.dumps({
        'event_type': 'NEW_UPLOAD',
        'user_id': uploader_id,
        'followers': follower_ids
    })
)

๐Ÿ”น Video Processing Pipeline

  • Offload transcoding, thumbnail generation
  • Workers consume jobs from queues

๐Ÿ”ฅ Key Technical Details

  • Isolate heavy CPU/GPU workloads
def process_video(video_path):
    output_path = video_path.replace('.mov', '.mp4')
    subprocess.run(["ffmpeg", "-i", video_path, "-vcodec", "h264", "-acodec", "aac", output_path])

๐Ÿ”น Coordination and Distributed Management

  • Zookeeper for Service Discovery
  • Distributed Logging and Tracing for Observability
from kazoo.client import KazooClient

zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()

zk.create("/services/worker1", b"127.0.0.1:8000", ephemeral=True)

๐Ÿ”น Data Analytics and Reporting

  • Event Streams collected for business intelligence
  • Data Pipelines (Hadoop/Spark)
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("VideoAnalytics").getOrCreate()
data = spark.read.json("s3://bucket/videos/metrics.json")
aggregated = data.groupBy("video_id").count()
aggregated.write.parquet("s3://bucket/videos/aggregated/")

๐Ÿ“Š Why This Design Works

  • ๐Ÿ›ก๏ธ Fault Tolerance: Load Balancers, Queues, CDN
  • ๐Ÿ“ˆ Scalability: Independent Metadata/Data scaling
  • โšก Low Latency: Caching, Async jobs
  • ๐Ÿ“Š Big Data Analytics: Hadoop/Spark
  • ๐Ÿ”ง Maintainability: Service isolation, Distributed Tracing

๐Ÿ”ฅ Final Thoughts

This architecture offers scalability, fault tolerance, low latency, and observability โ€” critical for media platforms like YouTube, Instagram, or TikTok.

Adopting queues, cache layers, distributed processing, and separation of concerns is non-negotiable for long-term success.


Author: Michael Wybraniec

Michael Wybraniec

Michael Wybraniec

Freelance, MCP Servers, Full-Stack Dev, Architecture