How to build an enterprise MDM from scratch: Part 1
Part 1: Infrastructure — The Foundation
When people discuss MDM, they talk about policies. They talk about zero-trust. They talk about compliance dashboards and automated remediation. Nobody talks about the database that stores all of it. This is the first post in a series about building an enterprise MDM from scratch, rather than configuring a SaaS product. It covers the process of actually deploying, operating, and scaling an MDM platform on your own infrastructure. This blog covers infrastructure. It's not glamorous, but get it wrong, and nothing else matters.
The Decision That Shapes Everything Else
Before writing a single line of Terraform, you need to answer one question: Where does this thing run?
You have three realistic options:
Managed SaaS - Let the vendor handle infrastructure. You get a dashboard, and they handle uptime. This is what most companies do, and honestly, it's the right call for most companies. If you don't have specific requirements around data residency, logging integration, or deep customization, stop reading and go buy a SaaS solution. Seriously. The operational burden of self-hosting isn't worth it unless you need the control.
On-prem - Run it in your own data center. Unless you already have mature data center operations, this path adds complexity without a clear benefit. You're managing hardware, power, cooling, and physical security on top of everything else.
Self-hosted in the cloud - Deploy the platform on AWS, GCP, or Azure. You own the infrastructure, but you're not racking servers. You get control over your data and security posture without the physical infrastructure burden.
If your organization is considering self-hosted MDM, Option 3 is the sweet spot. The rest of this blog assumes you're going that route.
Containers: The Compute Decision
Most modern MDM platforms ship as Docker containers. That gives you several deployment options:
Option A: Containers on VMs
Run Docker on EC2 instances (or equivalent). You manage the underlying hosts — patching, capacity, Docker daemon health.
This makes sense when:
Large deployments where cost optimization matters (1,000+ devices)
Teams with existing EC2 automation and expertise
Environments where you need SSH access to debug container issues
When to avoid:
Small teams without dedicated platform engineers
Organizations prioritizing operational simplicity over cost
Option B: Kubernetes (EKS, GKE, AKS)
Full container orchestration. Powerful, but complex.
When this makes sense:
You already run Kubernetes for other workloads
You need advanced deployment patterns (canary, blue-green)
You have platform engineers who live and breathe k8s
When to avoid:
Single-service workloads (MDM is typically one application)
Teams without Kubernetes expertise
When you're trying to reduce operational complexity, not add it
Option C: Serverless Containers (Fargate, Cloud Run)
You define the container, and the cloud provider runs it. No instances to manage.
When this makes sense:
Small to medium deployments (under 1,000 devices)
Teams prioritizing operational simplicity
When you want to eliminate patching and capacity management
When to avoid:
Cost-sensitive large deployments (20-30% premium over VMs)
Workloads requiring GPU or specialized hardware
When you need deep OS-level debugging access
The Recommendation
For most organizations deploying MDM for a few hundred devices, serverless containers hit the right balance. The cost premium is minimal at a small scale, and the operational simplicity is significant. You're not waking up at 2 AM because a Docker daemon crashed. If you're scaling to thousands of devices or have a mature platform team, VM-backed containers become more attractive for cost reasons.
Typical sizing: 0.5-1 vCPU and 2-4 GB RAM per container. MDM platforms tend to be memory-hungry (caching device state) but CPU-light. Start with 2 containers for redundancy, configure auto-scaling based on CPU, and let it grow as needed.
The Database Question
MDM platforms store significant state: device inventory, policy results, query data, and user permissions. This needs a database, and the choice matters.
MySQL vs PostgreSQL
Check your platform's requirements. Some support both, some require one or the other. If you have flexibility:
PostgreSQL tends to have better query performance for complex analytics
MySQL has broader compatibility and simpler replication
Neither is wrong. Use what your team knows, or what your platform requires.
Standard Managed Database vs Enhanced Options
Cloud providers offer multiple tiers of managed databases. Using AWS as an example:
Standard RDS:
You pick instance size and storage
Multi-AZ adds a standby for failover
Failover takes 1-2 minutes
You manage read replicas manually
Aurora (or equivalent):
Storage auto-scales
Faster failover (under 30 seconds typical)
Automatic read replicas
~20% cost premium
The Recommendation
For production MDM, lean toward the enhanced option (Aurora, Cloud SQL with HA, Azure Database with zone redundancy). The cost premium is worth it for:
Faster failover: During database failover, your MDM platform can't write. Devices checking in get errors. Shorter failover == less disruption.
Storage simplicity: Auto-scaling storage means you don't wake up to "disk full" alerts. MDM databases grow unpredictably based on query volume and device count.
Operational peace of mind: When something goes wrong at 3 AM, you want the database that recovers fastest with least intervention.
Sizing: Start with burstable instances (t-class or equivalent). MDM database load is typically bursty and heavy during policy pushes, light otherwise. Burstable instances handle this pattern cost-effectively. Monitor CPU credit consumption; if you're consistently exhausting credits, upgrade to a fixed-performance instance.
Watch out for: Query result storage. If you run scheduled queries across hundreds of devices and store results, that data accumulates. Plan for 50-100 GB within the first year for a medium deployment.
Caching is More Important Than You Think
Here's something that surprises people: live queries typically require a caching layer. Live queries let you push ad-hoc queries to devices and collect results in real-time. "Show me all devices running vulnerable software." This is incredibly valuable during incident response. The technical requirement is pub/sub messaging. You publish a query, devices subscribe and respond, and results aggregate. Redis (or a compatible cache) handles this.
Options
In-memory (SQLite mode): Some platforms support running without an external cache for small deployments. Fine for testing, risky for production. You lose live query capability, and you're putting all state in a single container.
Managed Redis: ElastiCache, MemoryDB, Cloud Memorystore. Fully managed, automatic failover.
Self-managed Redis: Run Redis on EC2/VMs. More control, more operational burden.
The Recommendation
Use managed Redis: The cost is minimal (a small cluster runs $50-100/month), and the operational simplicity is worth it. You don't want to debug Redis replication issues when you should be investigating a security incident.
Sizing: Start small. A single node or small cluster handles thousands of devices. Redis is fast. Over-provisioning here is wasting money.
Edge case to plan for: Connection exhaustion during rapid container restarts. If your application containers restart frequently (bad deployment, health check issues), each new container opens Redis connections before old ones close. Configure connection pooling in your application, and set reasonable connection limits in Redis.
Networking: Boring But Critical
Networking decisions have security implications. Get them right upfront.
Dedicated VPC vs Shared VPC
Dedicated VPC for MDM:
Clear security boundary
Simple security group rules
Easy to explain to auditors
Slightly more overhead (separate NAT gateway, etc.)
Shared VPC with other services:
Cost savings (shared NAT gateway)
More complex security groups
Larger blast radius if something goes wrong
The Recommendation
Dedicated VPC: MDM is a security tool. It should be isolated. When auditors ask, "What can access the device management database?", you want a simple answer: "containers in this VPC, nothing else." The cost difference is minimal ($30-50/month for a separate NAT gateway), and the security clarity is significant.
Subnet Design
Keep it simple:
Public subnets: Load balancer, NAT gateway. These need internet access.
Private subnets: Everything else. Containers, database, cache. No direct internet access.
The load balancer is the only resource with a public IP. Everything else routes outbound through NAT (needed for reaching Apple's APNs, software download CDNs, etc.) but has no inbound internet path.
Don't Forget the Jump Host
You will, at some point, need direct database access. Maybe to debug a query. Maybe to recover from a disaster. Maybe to run a migration. Plan for this from day one. A small bastion host in the VPC costs almost nothing and saves hours of scrambling during incidents. Configure it with minimal access (SSH only, from specific IPs), and have a runbook ready for when you need it.
Cost Expectations
Infrastructure cost for self-hosted MDM scales with device count, but not linearly. The base cost is relatively fixed; you need a database, a cache, containers, and load balancing regardless of whether you have 100 or 500 devices.
Rough budgeting for a ~500 device deployment:
| Component | Monthly Range |
|---|---|
| Compute (containers) | $75–150 |
| Database (managed, HA) | $100–175 |
| Cache (managed Redis) | $50–100 |
| Load balancing | $20–35 |
| Networking (NAT, transfer) | $40–80 |
| Monitoring, logs, misc | $15–30 |
| Total | $300–570/month |
For context, commercial MDM solutions charge $3-8 per device per month. At 300 devices, that's $900-2,400/month. Self-hosted infrastructure is cheaper at scale, but factor in engineering time for setup and maintenance.
Hidden costs to watch:
NAT gateway data transfer: If your MDM downloads software packages for deployment, that data goes through NAT. Large packages add up.
Database storage growth: Auto-scaling storage is convenient until the bill arrives. Monitor growth and archive old data.
Cross-region replication: If you enable DR (you should), data transfer between regions has costs.
Common Mistakes to Avoid
No staging environment: Deploying configuration changes directly to production is risky. MDM configuration is code — treat it like code. Have somewhere to test changes before they hit real devices.
Skipping the jump host: You won't need it until you desperately need it. Add it from day one.
Forgetting DR: What happens if your primary region has an outage? At a minimum, enable cross-region database backups. Better: full cross-region replication.
Over-engineering for day one: Start small. Burstable instances, minimal node counts, basic monitoring. Scale up when you have data showing you need to, not based on a hypothetical future load.
Under-engineering security: Private subnets for data stores, encryption everywhere, minimal security group rules. These aren't optional.
Infrastructure as Code
Whatever decisions you make, implement them in Terraform, Pulumi, CloudFormation, or your IaC tool of choice. Every resource should be in version control.
This matters for three reasons:
Reproducibility: When you need to rebuild (and you will eventually), it's a terraform apply, not a week of clicking.
Auditability: "What's deployed?" has an answer: look at the repo.
Change control: Infrastructure changes go through PR review, just like application code.
Key Takeaways
For engineers: Serverless containers + managed database + managed cache is a solid, low-maintenance stack. Start with IaC from day one. Over-engineer security, under-engineer scale.
For IT leaders: Budget $300-600/month for infrastructure supporting a few hundred devices. The real cost is engineering time for setup and ongoing maintenance.
For security: Dedicated VPC, private subnets for data stores, encryption everywhere, minimal attack surface. When auditors ask questions, simple architecture means simple answers.
What's Next
Infrastructure is the foundation, but it doesn't do anything by itself. In Part 2, I’ll cover how to manage configuration as code GitOps workflows for policies, profiles, and queries. Including what happens when declarative configuration deletes something you forgot to define.