40%
StrategyΒ·Β· by Michael Wybraniec

Distributed Systems

A a distributed architecture for managing metadata, media storage, feed generation, notifications, and analytics, focusing on both design decisions and implementation insights.

Building a Scalable Distributed System for Media Storage and Processing

I want to express my gratitude to my professional colleagues who have inspired me over the past few years: David Daupeyroux and YASH MAHENDRA JOSHI.

In today’s digital landscape, serving millions of users efficiently demands systems that scale horizontally, handle failures, and process vast amounts of data quickly.

This guide walks through a distributed architecture for managing metadata, media storage, feed generation, notifications, and analytics, focusing on both design decisions and implementation insights.


Schema Overview

User
 └──> DNS
      └──> Load Balancer
            β”œβ”€β”€> API Gateway 1
            β”œβ”€β”€> API Gateway 2
            └──> CDN (for Static Content)
                  β”œβ”€β”€> Image/Thumbnail Storage
                  └──> Video Storage

API Gateway
 β”œβ”€β”€ Authentication, Authorization
 β”œβ”€β”€ Caching, Transformation
 β”œβ”€β”€ Rate Limiting, Reverse Proxy
 β”œβ”€β”€ Monitoring, Logging, Serverless Functions

Load Balancer sends:
- Control to Metadata Server
- Data to Block Server

Metadata Server
 β”œβ”€β”€> Notification Service
 β”‚     └──> Notification Queue
 β”œβ”€β”€> Directory-based Partitioning
 β”œβ”€β”€> Shard Manager
 β”‚     └──> Feed Generation Service
 β”‚           └──> Feed Generation Queue
 β”œβ”€β”€> Search Results Aggregators
 β”œβ”€β”€> Cache (Redis/Memcached)
 └──> Metadata Databases (Partitioned)

Block Server
 └──> Distributed File Storage
       β”œβ”€β”€> Image/Thumbnail Storage
       └──> Video Storage
             └──> Video Processing Service
                   └──> Video Processing Queue
                         └──> Workers

Coordination and Support Systems
 β”œβ”€β”€ Coordination Service (Zookeeper)
 β”œβ”€β”€ Distributed Logging
 └── Distributed Tracing

Data Warehouse
 β”œβ”€β”€ Data Processing Systems (Hadoop/MapReduce, Spark)
 β”‚     β”œβ”€β”€ Distributed Scheduler
 β”‚     └── Workers
 └── Output (Metrics, Reports, etc.)
      └── Reports Viewing & Data Analysis
           └── Database

User Request Handling

Users interact via DNS βž” Load Balancer βž” API Gateway(s) for:

  • Fault Tolerance
  • Load Distribution
  • Scalability

Key Technical Details

  • Authentication, Authorization
  • Caching, Transformation
  • Rate Limiting, Reverse Proxy
  • Static content served via CDN
location /static/ {
    proxy_pass http://cdn.example.com/static/;
    proxy_cache cdn_cache;
}

πŸ”Ή Metadata and Data Flow

  • Control Data βž” Metadata Server
  • Binary Data βž” Block Server

Key Technical Details

  • Metadata caching via Redis/Memcached
  • Directory-based partitioning for load balancing
def partition_directory(file_id):
    return f"{hash(file_id) % 1000:03d}"  # 1000 buckets

πŸ”Ή Notification and Feed Generation

When users upload media:

  • Notification Service triggers Notifications
  • Feed Generation builds personalized feeds

πŸ”₯ Key Technical Details

  • Decoupling via queues for scalability
import boto3
import json

sqs = boto3.client('sqs')
sqs.send_message(
    QueueUrl='https://sqs.amazonaws.com/queue-url',
    MessageBody=json.dumps({
        'event_type': 'NEW_UPLOAD',
        'user_id': uploader_id,
        'followers': follower_ids
    })
)

πŸ”Ή Video Processing Pipeline

  • Offload transcoding, thumbnail generation
  • Workers consume jobs from queues

πŸ”₯ Key Technical Details

  • Isolate heavy CPU/GPU workloads
def process_video(video_path):
    output_path = video_path.replace('.mov', '.mp4')
    subprocess.run(["ffmpeg", "-i", video_path, "-vcodec", "h264", "-acodec", "aac", output_path])

πŸ”Ή Coordination and Distributed Management

  • Zookeeper for Service Discovery
  • Distributed Logging and Tracing for Observability
from kazoo.client import KazooClient

zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()

zk.create("/services/worker1", b"127.0.0.1:8000", ephemeral=True)

πŸ”Ή Data Analytics and Reporting

  • Event Streams collected for business intelligence
  • Data Pipelines (Hadoop/Spark)
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("VideoAnalytics").getOrCreate()
data = spark.read.json("s3://bucket/videos/metrics.json")
aggregated = data.groupBy("video_id").count()
aggregated.write.parquet("s3://bucket/videos/aggregated/")

πŸ“Š Why This Design Works

  • πŸ›‘οΈ Fault Tolerance: Load Balancers, Queues, CDN
  • πŸ“ˆ Scalability: Independent Metadata/Data scaling
  • ⚑ Low Latency: Caching, Async jobs
  • πŸ“Š Big Data Analytics: Hadoop/Spark
  • πŸ”§ Maintainability: Service isolation, Distributed Tracing

πŸ”₯ Final Thoughts

This architecture offers scalability, fault tolerance, low latency, and observability β€” critical for media platforms like YouTube, Instagram, or TikTok.

Adopting queues, cache layers, distributed processing, and separation of concerns is non-negotiable for long-term success.


Author: Michael Wybraniec

Michael Wybraniec

Michael Wybraniec

Full-Stack Development and Architecture