Distributed Systems
Building a Scalable Distributed System for Media Storage and Processing
I want to express my gratitude to my professional colleagues who have inspired me over the past few years: David Daupeyroux and YASH MAHENDRA JOSHI.
In todayβs digital landscape, serving millions of users efficiently demands systems that scale horizontally, handle failures, and process vast amounts of data quickly.
This guide walks through a distributed architecture for managing metadata, media storage, feed generation, notifications, and analytics, focusing on both design decisions and implementation insights.
Schema Overview
User
βββ> DNS
βββ> Load Balancer
βββ> API Gateway 1
βββ> API Gateway 2
βββ> CDN (for Static Content)
βββ> Image/Thumbnail Storage
βββ> Video Storage
API Gateway
βββ Authentication, Authorization
βββ Caching, Transformation
βββ Rate Limiting, Reverse Proxy
βββ Monitoring, Logging, Serverless Functions
Load Balancer sends:
- Control to Metadata Server
- Data to Block Server
Metadata Server
βββ> Notification Service
β βββ> Notification Queue
βββ> Directory-based Partitioning
βββ> Shard Manager
β βββ> Feed Generation Service
β βββ> Feed Generation Queue
βββ> Search Results Aggregators
βββ> Cache (Redis/Memcached)
βββ> Metadata Databases (Partitioned)
Block Server
βββ> Distributed File Storage
βββ> Image/Thumbnail Storage
βββ> Video Storage
βββ> Video Processing Service
βββ> Video Processing Queue
βββ> Workers
Coordination and Support Systems
βββ Coordination Service (Zookeeper)
βββ Distributed Logging
βββ Distributed Tracing
Data Warehouse
βββ Data Processing Systems (Hadoop/MapReduce, Spark)
β βββ Distributed Scheduler
β βββ Workers
βββ Output (Metrics, Reports, etc.)
βββ Reports Viewing & Data Analysis
βββ Database
User Request Handling
Users interact via DNS β Load Balancer β API Gateway(s) for:
- Fault Tolerance
- Load Distribution
- Scalability
Key Technical Details
- Authentication, Authorization
- Caching, Transformation
- Rate Limiting, Reverse Proxy
- Static content served via CDN
location /static/ {
proxy_pass http://cdn.example.com/static/;
proxy_cache cdn_cache;
}
πΉ Metadata and Data Flow
- Control Data β Metadata Server
- Binary Data β Block Server
Key Technical Details
- Metadata caching via Redis/Memcached
- Directory-based partitioning for load balancing
def partition_directory(file_id):
return f"{hash(file_id) % 1000:03d}" # 1000 buckets
πΉ Notification and Feed Generation
When users upload media:
- Notification Service triggers Notifications
- Feed Generation builds personalized feeds
π₯ Key Technical Details
- Decoupling via queues for scalability
import boto3
import json
sqs = boto3.client('sqs')
sqs.send_message(
QueueUrl='https://sqs.amazonaws.com/queue-url',
MessageBody=json.dumps({
'event_type': 'NEW_UPLOAD',
'user_id': uploader_id,
'followers': follower_ids
})
)
πΉ Video Processing Pipeline
- Offload transcoding, thumbnail generation
- Workers consume jobs from queues
π₯ Key Technical Details
- Isolate heavy CPU/GPU workloads
def process_video(video_path):
output_path = video_path.replace('.mov', '.mp4')
subprocess.run(["ffmpeg", "-i", video_path, "-vcodec", "h264", "-acodec", "aac", output_path])
πΉ Coordination and Distributed Management
- Zookeeper for Service Discovery
- Distributed Logging and Tracing for Observability
from kazoo.client import KazooClient
zk = KazooClient(hosts='127.0.0.1:2181')
zk.start()
zk.create("/services/worker1", b"127.0.0.1:8000", ephemeral=True)
πΉ Data Analytics and Reporting
- Event Streams collected for business intelligence
- Data Pipelines (Hadoop/Spark)
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("VideoAnalytics").getOrCreate()
data = spark.read.json("s3://bucket/videos/metrics.json")
aggregated = data.groupBy("video_id").count()
aggregated.write.parquet("s3://bucket/videos/aggregated/")
π Why This Design Works
- π‘οΈ Fault Tolerance: Load Balancers, Queues, CDN
- π Scalability: Independent Metadata/Data scaling
- β‘ Low Latency: Caching, Async jobs
- π Big Data Analytics: Hadoop/Spark
- π§ Maintainability: Service isolation, Distributed Tracing
π₯ Final Thoughts
This architecture offers scalability, fault tolerance, low latency, and observability β critical for media platforms like YouTube, Instagram, or TikTok.
Adopting queues, cache layers, distributed processing, and separation of concerns is non-negotiable for long-term success.
Author: Michael Wybraniec