How to build an enterprise MDM from scratch: Part 5
Logging, Observability & Lessons Learned for Operating MDM at Scale
The MDM is now built. The infrastructure is running, GitOps is deployed, identity is connected, policies are enforced, and devices are enrolled. Now you have to operate it and get the visibility you need to run MDM effectively: what to log, what to monitor, how to debug problems. And the lessons that only come from operating the system, the failures, the surprises, and what we'd do differently.
Logging
MDM generates a lot of data, such as device inventory, policy results, query responses, enrollment events, and admin actions. The question isn't whether to log, it's what to keep, where to send it, and how long to retain it.
Audit Logs: Every action taken by admins must be captured. Policy changes, profile deployments, software pushes, device commands like wipe or lock, user permission changes, and configuration modifications. This is your forensic trail because when something goes wrong or when security asks "who did what, when?" the audit logs provide the answer. Plan for at least a year of retention. Compliance requirements may dictate longer.
Device Activity Logs: What devices are doing tells you whether operations are healthy. Check-in timestamps, policy evaluation results, software installation status, errors, and enrollment events. Why did that device fail its policy check? When did it last connect? What happened during enrollment? These questions come up constantly, and detailed logs are the only way to answer them efficiently. Retain 30 to 90 days of detailed logs. Aggregate metrics can live longer.
Query Results: If you're running scheduled queries across your fleet for software inventory, user accounts, running processes, network connections, and security tool status, the results are valuable security telemetry. What's actually running on devices? What changed since yesterday? Retention depends on each use case. Security investigations benefit from longer retention. Compliance audits may require specific timeframes.
Where to Send Logs: Logs sitting inside your MDM platform are useful. Logs integrated with a broader observability stack are powerful. Most platforms store logs internally, which is fine for basic visibility but limited for correlation. For most organizations, the recommendation is straightforward: stream critical logs to wherever your security team already looks. If security lives in Splunk, send MDM audit logs to Splunk. If you're building a data lake, include MDM telemetry. Don't create another silo. For detailed device activity and query results, a data lake approach often makes sense. The volume is high, queries are ad-hoc, and the cost of SIEM ingestion for everything is prohibitive. Stream to S3, BigQuery, or equivalent, cheaper long-term storage, queryable with SQL, and flexible for custom dashboards. Implementation options include webhooks to a streaming service, native integrations that some platforms offer with major log destinations, or agent-based collection for query results and device telemetry. Use native integrations when available. Less custom code to maintain.
Log Pitfalls:
Logging everything but querying nothing: Logs only have value if someone looks at them. Define what questions you're trying to answer, and then ensure you're collecting the right data.
No retention policy: Storage isn't free, so define retention periods upfront and automate deletion.
Sensitive data in logs: Query results can contain usernames, file paths, and potentially sensitive data, so understand what you're logging and mask or exclude sensitive fields.
Log format inconsistency: If logs from different sources have different formats, correlation becomes painful. Normalize early in the pipeline.
Observability
Logs tell you what happened. Observability tells you when something is wrong, ideally before users notice.
Platform Health: Monitor the infrastructure you built, such as response times, database performance, container health, certificate expiration, and queue depths. If the MDM platform itself is unhealthy, everything else breaks.
Device Fleet Health: This tells you whether the fleet is healthy as a whole. Device check-in rates, enrollment success and failure rates, policy compliance percentages, pending command queues, and profile installation success rates. A drop in any of these metrics deserves investigation.
Anomalies: Some signals warrant immediate attention. A sudden drop in check-ins might indicate a network issue or an agent crash. A spike in policy failures could mean a bad configuration push. Unusual enrollment patterns might signal unauthorized devices. Mass unenrollment could be an attack or a misconfiguration. Watch for these patterns.
Alerting Strategy: Not everything needs to page someone. Think in tiers.
For immediate response: platform unreachable, database down, no devices checking in, certificate expired, mass unenrollment detected. These are "wake someone up" events.
For business hours: check-in rates below threshold, policy compliance dropping, enrollment failures elevated, and integration errors. These need attention, but can wait until morning.
For weekly review: slow trends in compliance, devices not seen in seven or more days, software deployment success rates, and capacity metrics. These inform longer-term decisions.
The most important rule is not to be alerted to everything. Alert fatigue is real. If everything alerts, nothing alerts. Be ruthless about what deserves a notification.
Dashboards: Build dashboards for different audiences.
An operations dashboard with platform health, check-in rates, error rates, and queue depths, this is for the team running the platform day to day.
A security dashboard with policy compliance rates, encryption status, security tool deployment, and failed authentication attempts. This is for security reviews and audit preparation.
An executive dashboard with fleet size over time, a single compliance percentage, incidents this month, and migration progress is for leadership updates. Keep that last one simple.
Observability Pitfalls
Watch for vanity metrics. "Number of policies deployed" isn't useful. "Percentage of devices compliant with critical policies" is.
Establish baselines before setting alert thresholds: you can't detect anomalies without knowing what normal looks like.
And don't just monitor the current state, but also monitor the rate of change. A metric that's fine today but trending worse will become a problem.
Lessons Learned
Every system teaches lessons. MDM teaches them through outages, support tickets, and late-night debugging.
Declarative Systems Delete Aggressively
We covered this in Part 2, but it bears repeating because it's the most common source of incidents. When your configuration is declarative, the repo is the truth, and anything not in the repo gets removed. Settings were configured before GitOps was set up. Manual changes were made in the console. Integrations added through the UI. The fix is discipline: everything in the repo, nothing in the console. But the lesson is usually learned the hard way.
Migration Is the Riskiest Phase
Moving from legacy MDM to new MDM has more failure modes than any other operation. Devices get stuck between MDMs. Removal scripts fail silently. Enrollment doesn't trigger. Users miss communications and panic. Applications break without old MDM configurations. What helps: verify each step before proceeding to the next, build automation that waits for confirmation, communicate early and often, have a rollback plan even if you don't use it, and budget more time than you think you need.
The Agent Is the Single Point of Failure
MDM only works if devices can reach the platform. The agent on each device is critical path. When agents crash, get uninstalled, or can't reach the server, policies don't update, software doesn't deploy, and you lose visibility. Monitor agent health as a first-class metric. Alert on check-in gaps. Have a reinstallation path for broken agents. Understand what can cause failures: OS updates, disk full, and network policy changes.
OS Updates Break Things
Every major OS update is a risk. MDM agent compatibility, profile format changes, API deprecations, new permission requirements, and changed system behaviors. Test OS betas on dedicated devices. Don't force OS updates immediately after release. Monitor forums and release notes. Build a delay between OS release and enforcement.
Users Will Surprise You
No matter how clear your policies, someone will find a workaround. Someone will uninstall the agent. Someone will ignore every notification. Someone will have a legitimate exception you didn't anticipate. Design for human behavior, not ideal behavior. Build exception processes before you need them. Make the right thing the easy thing. And listen to feedback, persistent complaints indicate design problems, not difficult users.
Documentation Matters More Than You Think
When something breaks at 2 AM, you need answers fast. How do you access the database? Where are the credentials? What's the rollback procedure? Who has access to what? Write runbooks for common operations. Document incident response procedures. Maintain architecture documentation. Define on-call rotation and escalation paths. Write all of this before you need it. You won't have time during an incident.
Start Simple, Add Complexity Later
It's tempting to build the perfect system from day one. Elaborate team structures, comprehensive policy sets, complex automation, and integrations with everything. This creates fragility. Each component is another thing that can break, another thing to maintain. Start with a minimal viable configuration. Add complexity only when needed. Justify each addition. Remove what isn't providing value. A simple system you understand beats a complex system you don't.
Security Tools Need Security
MDM has privileged access to every managed device. It can push software, run scripts, and wipe devices. If compromised, the impact is severe. MFA on admin accounts should be mandatory, no exceptions. Minimal permissions, not everyone needs admin. Audit logging enabled and reviewed. API tokens are rotated regularly. Network access is restricted where possible. GitOps for change control. The platform that manages security needs the highest security standards.
Communication Is an Underrated Skill
Technical implementation is half the job. The other half is organizational. Explaining why policies exist. Preparing users for changes. Managing stakeholder expectations. Navigating exceptions and edge cases. Building trust with users. Security that users understand and accept is more effective than security they resent and circumvent.
Perfect Is the Enemy of Deployed
You will never have 100% policy compliance. Zero support tickets. Every edge case handled. Complete documentation. Flawless automation. What matters is a good enough security baseline, improving over time, learning from incidents, and responding to feedback. Ship, monitor, iterate. Don't wait for perfect.
Wrapping Up the Series
Over five posts, we've covered building enterprise MDM from scratch.
Infrastructure: the cloud architecture that runs your platform.
GitOps: configuration as code and the declarative trap.
Identity and policy: who gets access and what gets enforced.
Profiles, teams, migration, and software: the operational reality.
And finally, logging, observability, and lessons learned: running it at scale.
This isn't a weekend project. The commercial solutions work for most organizations. If you've read this entire series, you're probably not in most organizations.
Build thoughtfully, Monitor aggressively, Learn continuously. That's the series. Thanks for reading.