How to build an enterprise MDM from scratch: Part 1

Part 1: Infrastructure — The Foundation

When people discuss MDM, they talk about policies. They talk about zero-trust. They talk about compliance dashboards and automated remediation. Nobody talks about the database that stores all of it. This is the first post in a series about building an enterprise MDM from scratch, rather than configuring a SaaS product. It covers the process of actually deploying, operating, and scaling an MDM platform on your own infrastructure. This blog covers infrastructure. It's not glamorous, but get it wrong, and nothing else matters.

The Decision That Shapes Everything Else

Before writing a single line of Terraform, you need to answer one question: Where does this thing run?

You have three realistic options:

  1. Managed SaaS - Let the vendor handle infrastructure. You get a dashboard, and they handle uptime. This is what most companies do, and honestly, it's the right call for most companies. If you don't have specific requirements around data residency, logging integration, or deep customization, stop reading and go buy a SaaS solution. Seriously. The operational burden of self-hosting isn't worth it unless you need the control.

  2. On-prem - Run it in your own data center. Unless you already have mature data center operations, this path adds complexity without a clear benefit. You're managing hardware, power, cooling, and physical security on top of everything else.

  3. Self-hosted in the cloud - Deploy the platform on AWS, GCP, or Azure. You own the infrastructure, but you're not racking servers. You get control over your data and security posture without the physical infrastructure burden.

If your organization is considering self-hosted MDM, Option 3 is the sweet spot. The rest of this blog assumes you're going that route.

Containers: The Compute Decision

Most modern MDM platforms ship as Docker containers. That gives you several deployment options:

Option A: Containers on VMs

Run Docker on EC2 instances (or equivalent). You manage the underlying hosts — patching, capacity, Docker daemon health.

This makes sense when:

  • Large deployments where cost optimization matters (1,000+ devices)

  • Teams with existing EC2 automation and expertise

  • Environments where you need SSH access to debug container issues

When to avoid:

  • Small teams without dedicated platform engineers

  • Organizations prioritizing operational simplicity over cost

Option B: Kubernetes (EKS, GKE, AKS)

Full container orchestration. Powerful, but complex.

When this makes sense:

  • You already run Kubernetes for other workloads

  • You need advanced deployment patterns (canary, blue-green)

  • You have platform engineers who live and breathe k8s

When to avoid:

  • Single-service workloads (MDM is typically one application)

  • Teams without Kubernetes expertise

  • When you're trying to reduce operational complexity, not add it

Option C: Serverless Containers (Fargate, Cloud Run)

You define the container, and the cloud provider runs it. No instances to manage.

When this makes sense:

  • Small to medium deployments (under 1,000 devices)

  • Teams prioritizing operational simplicity

  • When you want to eliminate patching and capacity management

When to avoid:

  • Cost-sensitive large deployments (20-30% premium over VMs)

  • Workloads requiring GPU or specialized hardware

  • When you need deep OS-level debugging access

The Recommendation

For most organizations deploying MDM for a few hundred devices, serverless containers hit the right balance. The cost premium is minimal at a small scale, and the operational simplicity is significant. You're not waking up at 2 AM because a Docker daemon crashed. If you're scaling to thousands of devices or have a mature platform team, VM-backed containers become more attractive for cost reasons.

Typical sizing: 0.5-1 vCPU and 2-4 GB RAM per container. MDM platforms tend to be memory-hungry (caching device state) but CPU-light. Start with 2 containers for redundancy, configure auto-scaling based on CPU, and let it grow as needed.

The Database Question

MDM platforms store significant state: device inventory, policy results, query data, and user permissions. This needs a database, and the choice matters.

MySQL vs PostgreSQL

Check your platform's requirements. Some support both, some require one or the other. If you have flexibility:

  • PostgreSQL tends to have better query performance for complex analytics

  • MySQL has broader compatibility and simpler replication

Neither is wrong. Use what your team knows, or what your platform requires.

Standard Managed Database vs Enhanced Options

Cloud providers offer multiple tiers of managed databases. Using AWS as an example:

Standard RDS:

  • You pick instance size and storage

  • Multi-AZ adds a standby for failover

  • Failover takes 1-2 minutes

  • You manage read replicas manually

Aurora (or equivalent):

  • Storage auto-scales

  • Faster failover (under 30 seconds typical)

  • Automatic read replicas

  • ~20% cost premium

The Recommendation

For production MDM, lean toward the enhanced option (Aurora, Cloud SQL with HA, Azure Database with zone redundancy). The cost premium is worth it for:

  1. Faster failover: During database failover, your MDM platform can't write. Devices checking in get errors. Shorter failover == less disruption.

  2. Storage simplicity: Auto-scaling storage means you don't wake up to "disk full" alerts. MDM databases grow unpredictably based on query volume and device count.

  3. Operational peace of mind: When something goes wrong at 3 AM, you want the database that recovers fastest with least intervention.

  4. Sizing: Start with burstable instances (t-class or equivalent). MDM database load is typically bursty and heavy during policy pushes, light otherwise. Burstable instances handle this pattern cost-effectively. Monitor CPU credit consumption; if you're consistently exhausting credits, upgrade to a fixed-performance instance.

  5. Watch out for: Query result storage. If you run scheduled queries across hundreds of devices and store results, that data accumulates. Plan for 50-100 GB within the first year for a medium deployment.

Caching is More Important Than You Think

Here's something that surprises people: live queries typically require a caching layer. Live queries let you push ad-hoc queries to devices and collect results in real-time. "Show me all devices running vulnerable software." This is incredibly valuable during incident response. The technical requirement is pub/sub messaging. You publish a query, devices subscribe and respond, and results aggregate. Redis (or a compatible cache) handles this.

Options

  • In-memory (SQLite mode): Some platforms support running without an external cache for small deployments. Fine for testing, risky for production. You lose live query capability, and you're putting all state in a single container.

  • Managed Redis: ElastiCache, MemoryDB, Cloud Memorystore. Fully managed, automatic failover.

  • Self-managed Redis: Run Redis on EC2/VMs. More control, more operational burden.

The Recommendation

  • Use managed Redis: The cost is minimal (a small cluster runs $50-100/month), and the operational simplicity is worth it. You don't want to debug Redis replication issues when you should be investigating a security incident.

  • Sizing: Start small. A single node or small cluster handles thousands of devices. Redis is fast. Over-provisioning here is wasting money.

  • Edge case to plan for: Connection exhaustion during rapid container restarts. If your application containers restart frequently (bad deployment, health check issues), each new container opens Redis connections before old ones close. Configure connection pooling in your application, and set reasonable connection limits in Redis.

Networking: Boring But Critical

Networking decisions have security implications. Get them right upfront.

Dedicated VPC vs Shared VPC

Dedicated VPC for MDM:

  • Clear security boundary

  • Simple security group rules

  • Easy to explain to auditors

  • Slightly more overhead (separate NAT gateway, etc.)

Shared VPC with other services:

  • Cost savings (shared NAT gateway)

  • More complex security groups

  • Larger blast radius if something goes wrong

The Recommendation

Dedicated VPC: MDM is a security tool. It should be isolated. When auditors ask, "What can access the device management database?", you want a simple answer: "containers in this VPC, nothing else." The cost difference is minimal ($30-50/month for a separate NAT gateway), and the security clarity is significant.

Subnet Design

Keep it simple:

  • Public subnets: Load balancer, NAT gateway. These need internet access.

  • Private subnets: Everything else. Containers, database, cache. No direct internet access.

The load balancer is the only resource with a public IP. Everything else routes outbound through NAT (needed for reaching Apple's APNs, software download CDNs, etc.) but has no inbound internet path.

Don't Forget the Jump Host

You will, at some point, need direct database access. Maybe to debug a query. Maybe to recover from a disaster. Maybe to run a migration. Plan for this from day one. A small bastion host in the VPC costs almost nothing and saves hours of scrambling during incidents. Configure it with minimal access (SSH only, from specific IPs), and have a runbook ready for when you need it.

Cost Expectations

Infrastructure cost for self-hosted MDM scales with device count, but not linearly. The base cost is relatively fixed; you need a database, a cache, containers, and load balancing regardless of whether you have 100 or 500 devices.

Rough budgeting for a ~500 device deployment:

Component        Monthly Range
Compute (containers)      $75–150
Database (managed, HA)    $100–175
Cache (managed Redis)      $50–100
Load balancing      $20–35
Networking (NAT, transfer)   $40–80
Monitoring, logs, misc    $15–30
Total        $300–570/month

For context, commercial MDM solutions charge $3-8 per device per month. At 300 devices, that's $900-2,400/month. Self-hosted infrastructure is cheaper at scale, but factor in engineering time for setup and maintenance.

Hidden costs to watch:

  • NAT gateway data transfer: If your MDM downloads software packages for deployment, that data goes through NAT. Large packages add up.

  • Database storage growth: Auto-scaling storage is convenient until the bill arrives. Monitor growth and archive old data.

  • Cross-region replication: If you enable DR (you should), data transfer between regions has costs.

Common Mistakes to Avoid

  • No staging environment: Deploying configuration changes directly to production is risky. MDM configuration is code — treat it like code. Have somewhere to test changes before they hit real devices.

  • Skipping the jump host: You won't need it until you desperately need it. Add it from day one.

  • Forgetting DR: What happens if your primary region has an outage? At a minimum, enable cross-region database backups. Better: full cross-region replication.

  • Over-engineering for day one: Start small. Burstable instances, minimal node counts, basic monitoring. Scale up when you have data showing you need to, not based on a hypothetical future load.

  • Under-engineering security: Private subnets for data stores, encryption everywhere, minimal security group rules. These aren't optional.

Infrastructure as Code

Whatever decisions you make, implement them in Terraform, Pulumi, CloudFormation, or your IaC tool of choice. Every resource should be in version control.

This matters for three reasons:

  1. Reproducibility: When you need to rebuild (and you will eventually), it's a terraform apply, not a week of clicking.

  2. Auditability: "What's deployed?" has an answer: look at the repo.

  3. Change control: Infrastructure changes go through PR review, just like application code.

Key Takeaways

  • For engineers: Serverless containers + managed database + managed cache is a solid, low-maintenance stack. Start with IaC from day one. Over-engineer security, under-engineer scale.

  • For IT leaders: Budget $300-600/month for infrastructure supporting a few hundred devices. The real cost is engineering time for setup and ongoing maintenance.

  • For security: Dedicated VPC, private subnets for data stores, encryption everywhere, minimal attack surface. When auditors ask questions, simple architecture means simple answers.

What's Next

Infrastructure is the foundation, but it doesn't do anything by itself. In Part 2, I’ll cover how to manage configuration as code GitOps workflows for policies, profiles, and queries. Including what happens when declarative configuration deletes something you forgot to define.

Next
Next

The Anatomical Evolution of Account Takeover Attacks