Distributed Systems
Building a Scalable Distributed System for Media Storage and Processing
I want to express my gratitude to my professional colleagues who have inspired me over the past few years: David Daupeyroux and YASH MAHENDRA JOSHI.
In todayโs digital landscape, serving millions of users efficiently demands systems that scale horizontally, handle failures, and process vast amounts of data quickly.
This guide walks through a distributed architecture for managing metadata, media storage, feed generation, notifications, and analytics, focusing on both design decisions and implementation insights.
Schema Overview
User
โโโ> DNS
โโโ> Load Balancer
โโโ> API Gateway 1
โโโ> API Gateway 2
โโโ> CDN (for Static Content)
โโโ> Image/Thumbnail Storage
โโโ> Video Storage
API Gateway
โโโ Authentication, Authorization
โโโ Caching, Transformation
โโโ Rate Limiting, Reverse Proxy
โโโ Monitoring, Logging, Serverless Functions
Load Balancer sends:
- Control to Metadata Server
- Data to Block Server
Metadata Server
โโโ> Notification Service
โ โโโ> Notification Queue
โโโ> Directory-based Partitioning
โโโ> Shard Manager
โ โโโ> Feed Generation Service
โ โโโ> Feed Generation Queue
โโโ> Search Results Aggregators
โโโ> Cache (Redis/Memcached)
โโโ> Metadata Databases (Partitioned)
Block Server
โโโ> Distributed File Storage
โโโ> Image/Thumbnail Storage
โโโ> Video Storage
โโโ> Video Processing Service
โโโ> Video Processing Queue
โโโ> Workers
Coordination and Support Systems
โโโ Coordination Service (Zookeeper)
โโโ Distributed Logging
โโโ Distributed Tracing
Data Warehouse
โโโ Data Processing Systems (Hadoop/MapReduce, Spark)
โ โโโ Distributed Scheduler
โ โโโ Workers
โโโ Output (Metrics, Reports, etc.)
โโโ Reports Viewing & Data Analysis
โโโ Database
User Request Handling
Users interact via DNS โ Load Balancer โ API Gateway(s) for:
- Fault Tolerance
- Load Distribution
- Scalability
Key Technical Details
- Authentication, Authorization
- Caching, Transformation
- Rate Limiting, Reverse Proxy
- Static content served via CDN
location /static/ {
proxy_pass http://cdn.example.com/static/;
proxy_cache cdn_cache;
}
๐น Metadata and Data Flow
- Control Data โ Metadata Server
- Binary Data โ Block Server
Key Technical Details
- Metadata caching via Redis/Memcached
- Directory-based partitioning for load balancing
def partition_directory(file_id):
return f"{hash(file_id) % 1000:03d}" # 1000 buckets
๐น Notification and Feed Generation
When users upload media:
- Notification Service triggers Notifications
- Feed Generation builds personalized feeds
๐ฅ Key Technical Details
- Decoupling via queues for scalability
import boto3
import json
sqs = boto3.client('sqs')
sqs.send_message(
QueueUrl='https://sqs.amazonaws.com/queue-url',
MessageBody=json.dumps({
'event_type': 'NEW_UPLOAD',
'user_id': uploader_id,
'followers': follower_ids
})
)
๐น Video Processing Pipeline
- Offload transcoding, thumbnail generation
- Workers consume jobs from queues
๐ฅ Key Technical Details
- Isolate heavy CPU/GPU workloads
def process_video(video_path):
output_path = video_path.replace('.mov', '.mp4')
subprocess.run(["ffmpeg", "-i", video_path, "-vcodec", "h264", "-acodec", "aac", output_path])
๐น Coordination and Distributed Management
- Zookeeper for Service Discovery
- Distributed Logging and Tracing for Observability
from kazoo.client import KazooClient
zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()
zk.create("/services/worker1", b"127.0.0.1:8000", ephemeral=True)
๐น Data Analytics and Reporting
- Event Streams collected for business intelligence
- Data Pipelines (Hadoop/Spark)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("VideoAnalytics").getOrCreate()
data = spark.read.json("s3://bucket/videos/metrics.json")
aggregated = data.groupBy("video_id").count()
aggregated.write.parquet("s3://bucket/videos/aggregated/")
๐ Why This Design Works
- ๐ก๏ธ Fault Tolerance: Load Balancers, Queues, CDN
- ๐ Scalability: Independent Metadata/Data scaling
- โก Low Latency: Caching, Async jobs
- ๐ Big Data Analytics: Hadoop/Spark
- ๐ง Maintainability: Service isolation, Distributed Tracing
๐ฅ Final Thoughts
This architecture offers scalability, fault tolerance, low latency, and observability โ critical for media platforms like YouTube, Instagram, or TikTok.
Adopting queues, cache layers, distributed processing, and separation of concerns is non-negotiable for long-term success.
Author: Michael Wybraniec