How Internal Developer Platforms Give Engineering Teams Full Observability Without Manually Configuring Grafana

Why per-environment observability breaks down with manual setup, and how an Internal developer platform fixes it.

Apr 24, 2026

You have three environments: dev, staging, production. Each one has a Grafana instance someone set up months ago. Staging has different dashboard versions than production because whoever set it up used a different Helm chart version. The dev Prometheus stopped scraping correctly after a node replacement, but nobody noticed until a bug in dev could not be reproduced in staging. The production Grafana still has the default admin password because rotating it means logging into a cluster nobody touches unless something breaks.

This is not unusual. It is the default state for teams that treat observability as something you configure after provisioning an environment, rather than part of provisioning itself.

TL;DR

The Prometheus + Loki + Grafana stack covers infrastructure metrics and application logs automatically. Custom application metrics still require code-level instrumentation.
Multi-environment observability breaks down not at setup, but over time — through version drift, storage misconfiguration, and scrape failures that produce silence instead of errors.
An internal developer platform that provisions observability at environment creation removes per-environment setup work and the drift that follows.
Infrastructure metrics (CPU, memory, pod restarts) are not the same as application metrics. A pod with a normal CPU can be returning 500s on 30% of requests. Infrastructure metrics will not surface that.
For BYOC distribution and single-tenant enterprise deployments, centralized observability creates data residency problems. Per-environment, in-cluster observability fits better.
The ops.json instrumentation pattern reduces the friction of surfacing custom application metrics in Grafana without writing ServiceMonitor CRDs.

What Does “Built-In Observability” Actually Mean in an IDP

The phrase gets used loosely. It can mean anything from a link to your existing Datadog account to a fully provisioned monitoring stack running inside the same Kubernetes cluster as your application. Those are not equivalent.

The standard open-source stack has three components.

Prometheus is a time-series metrics scraper. It pulls data from instrumented endpoints and from Kubernetes system components at a configured interval. Two categories of data worth separating: infrastructure metrics (CPU utilization, memory pressure, pod restart counts, deployment replicas) collected automatically via kube-state-metrics and node-exporter, and application metrics (request rate, error rate, queue depth, cache hit ratio) which require explicit instrumentation in the application code. Most observability content blurs this line. The distinction matters when something goes wrong.

Loki is a log aggregator. It runs a DaemonSet-based agent called Promtail on each Kubernetes node, collecting container stdout and stderr. Unlike Elasticsearch, Loki indexes only labels — not log content. Log lines are grouped into streams and tagged with labels like pod name, namespace, and container. That keeps storage costs lower and makes logs available for querying within milliseconds.

Grafana connects to both Prometheus and Loki as data sources and is where the actual debugging happens. You see a CPU spike in Prometheus, jump to the same time window in Loki, read the log lines from the pod that caused it. Without that correlation, you are looking at two separate interfaces with no direct link.

One thing this stack does not cover: distributed tracing. Following a request through five microservices requires Tempo or Jaeger plus OpenTelemetry instrumentation — a separate layer entirely. Teams that set up Prometheus and Loki and consider themselves done will eventually hit a slow request they cannot explain, because they have no way to trace it through the service graph.

Why Multi-Environment Observability Breaks Down Without a Platform

The initial setup takes a day or two. The maintenance does not stop there.

Environment drift

Prometheus 0.62 on production, 0.58 on staging, because someone updated one and not the other. Different scrape configurations. Different alert rule formats. Dashboards exported from production as JSON fail on import to staging because label names shifted between versions. You find this out during an incident when you need staging to tell you something and it cannot.

The less visible failure mode is this: a developer ships a new feature and moves on. No metrics added, no log events instrumented, no alerts defined. The feature runs in production with no visibility into how it actually behaves under load. The gap does not surface until something breaks, at which point the investigation starts from scratch with no baseline to work from.

The node replacement problem

Prometheus stores time-series data on the local disk of whichever node it lands on. In an EKS managed node group, nodes get replaced during version updates or when the ASG replaces an unhealthy instance. If there is no PersistentVolumeClaim backed by an EBS volume that survives pod restarts, the metrics history is gone. Production setups require specifying a StorageClass like gp3 with a persistent volume claim to keep data across restarts. Most Helm-based tutorials skip this step. Teams learn it the first time a cluster update wipes two weeks of metrics.

The scrape configuration failure mode

When Prometheus fails to scrape a target, it does not throw an error. It just has no data. An engineer adds a ServiceMonitor, deploys a service, sees nothing in Grafana, and spends 45 minutes checking RBAC permissions, port declarations, pod annotations, and label selector mismatches — all of which produce the same symptom. Silence. It is one of the more tedious debugging loops in Kubernetes.

Per-environment provisioning cost

Every new environment is another Prometheus, Loki, and Grafana to install, configure, and connect. Dev, staging, production, a canary, and a few customer-dedicated deployments adds up to a non-trivial maintenance surface.

LocalOps handles this by provisioning Prometheus, Loki, and Grafana as Kubernetes companion deployments inside every environment, running in the same cluster as the application. It does not eliminate the underlying tools or their operational characteristics, but it removes the per-environment setup and the drift that accumulates when configuration is manual.

How IDP Handle Grafana and Prometheus Monitoring Out of the Box

“Out of the box” can mean a lot of things. The architecture is worth being specific about.

In a provisioned environment using EKS on AWS, an internal developer platform gives you Prometheus, Loki, and Grafana running as Kubernetes workloads inside the same VPC and EKS cluster as your application services. Prometheus scrapes internal endpoints. Loki receives logs from Promtail on each node. Grafana comes up with both already registered as data sources.

The co-location has practical consequences beyond tidiness. Prometheus scrape calls stay internal to the cluster. Loki log ingestion happens over the cluster network. For BYOC and single-tenant deployments, this matters: log and metric data does not need to cross the VPC boundary. That is often a hard requirement in enterprise procurement, not a preference.

LocalOps follows this model — Prometheus, Loki, and Grafana are provisioned as companion Kubernetes deployments inside every environment, with data sources pre-registered, running in the same cluster as the application.

“Pre-configured” means Grafana has Prometheus and Loki registered as data sources before anyone logs in. No separate step where someone types the Prometheus service name into the UI, tests the connection, and troubleshoots an unhelpful error because the URL had a typo. That step is where a significant portion of manual setups fail.

Worth being clear about one thing: infrastructure metrics and logs are available immediately. Custom application metrics still require instrumentation at the code level. The platform handles the plumbing; it does not write your /metrics endpoint for you.

Infrastructure Metrics vs. Application Metrics

A Grafana dashboard showing CPU, memory, and pod restart counts does not mean your application is working correctly. A pod running at 40% CPU with zero restarts can still return database timeout errors on every third request. Infrastructure metrics will not show that. Prometheus will not show it unless someone instrumented the application.

Infrastructure metrics are collected automatically once Prometheus is running: node CPU, memory, pod phase, deployment replicas, persistent volume capacity. They tell you whether the cluster is healthy. They do not tell you whether the application is doing what users expect.

Application metrics require the application to expose a /metrics endpoint using a Prometheus client library. Go, Java, Python, and Rust have official libraries. Node.js, Ruby, and others have community-supported ones. The application defines counters, gauges, and histograms and exposes them at the endpoint. Prometheus scrapes it on a configured interval.

The metrics that reflect actual user experience sit in this second category. Request rate per endpoint. p95 and p99 latency. Background job processing time. External API failure rate. Queue depth. These are the signals that tell you whether the system is working from a user’s perspective — not just from the cluster’s.

Getting application metrics into Grafana manually involves: instrumenting the application with a client library, exposing /metrics, creating a ServiceMonitor CRD, matching label selectors to the correct Prometheus instance, verifying RBAC for cross-namespace access, and confirming scrape status in the Prometheus targets UI. When any of these steps is wrong, the metrics do not appear. There is no error that tells you which step failed.

Here’s how LocalOps handles this with a declaration in ops.json:

json

{

“metrics”: {

“endpoint”: “/metrics”,

“interval”: 15

}

The platform registers the endpoint with Prometheus. The custom metrics appear in Grafana. The ServiceMonitor, RBAC binding, and Prometheus configuration are handled without a separate debugging loop.

If you want to walk through this on your own codebase, our engineers can show you how it works in a live environment.

How Do You Give Developers Log and Metric Access Without Exposing Cloud Infrastructure?

The obvious answers create problems. Giving developers kubectl access to production is a large blast radius. Giving broad IAM read access to the AWS account raises audit and compliance concerns. Neither is a clean answer.

Grafana works as the access layer because it separates observability from infrastructure control. Developers query logs in LogQL and metrics in PromQL through the Grafana UI. They get visibility into system behavior without needing direct cluster or cloud console access.

Through Grafana, a developer can see application logs from their services, infrastructure metrics for the cluster, deployment timestamps, and resource utilization. With proper access controls configured, they cannot touch the Kubernetes control plane, access cloud account credentials, or see data from unrelated services or environments.

The separation matters for a specific reason: observability tells you what happened. It does not give you the ability to change anything. Developers can investigate. Operational control stays restricted.

This carries over to customer-dedicated environments. A vendor’s engineering team can access Grafana for a specific customer environment, review logs and metrics, and debug a support issue — without needing IAM access to the customer’s AWS account. The observability layer has the data. The underlying infrastructure stays isolated.

Per-Customer Observability: Why Centralized Monitoring Breaks Down for BYOC Deployments

When a B2B SaaS company starts supporting enterprise customers who need their own cloud infrastructure — for data residency, compliance, or isolation — the observability model gets complicated fast.

The common reaction is Prometheus federation: a central Prometheus instance in the vendor’s account scrapes from per-customer Prometheus instances. Mattermost documented this pattern across multiple Kubernetes clusters and multiple AWS VPCs. The implementation involved a central monitoring cluster, cross-VPC networking, private load balancers, Route 53 private hosted zones, and a Lambda function to handle dynamic cluster registration. It works. It is also a substantial ongoing infrastructure commitment.

And it still does not fully solve the log problem. In regulated industries like financial services or healthcare, application logs frequently cannot leave the customer’s cloud account. Metrics are sometimes acceptable to aggregate centrally. Logs often are not.

A per-environment, in-cluster model fits this constraint better. Each customer environment runs its own Prometheus, Loki, and Grafana within its VPC. Logs and metrics stay within the account boundary. Vendor engineers access that environment’s Grafana when debugging.

The tradeoff is real: there is no centralized view across all customer environments. Aggregating insight across customers requires additional work. For teams where data residency is a hard requirement, that tradeoff is usually unavoidable.

LocalOps provisions the observability stack inside every environment it creates, including BYOC deployments. Each environment runs its own Prometheus, Loki, and Grafana within the customer’s VPC. The provisioning is consistent across environments because it comes from the same template — which reduces setup variation without centralizing data.

What Internal Developer Platform Architecture Should Include for Observability

In platform engineering, an internal developer platform is only as useful as the capabilities it provisions consistently. Observability is one of the first gaps that shows up when that consistency is missing. Not all IDPs handle it the same way, and the differences show up in how much manual work is required and how the system holds up across many environments.

A per-environment Prometheus instance prevents metric label conflicts between environments and supports the isolation that BYOC deployments require. Loki running inside the same cluster keeps log ingestion internal to the cluster network and makes logs available quickly without routing data outside the VPC.

Grafana should come pre-configured with Prometheus and Loki as data sources — not because this is difficult to do manually, but because it is reliably skipped or done incorrectly in manual setups. Persistent storage for Prometheus needs a PersistentVolumeClaim backed by something durable like EBS, with appropriate retention and capacity planning. Without it, node replacement erases your metrics history.

There should be a way to declare application metrics endpoints without writing ServiceMonitor resources. This lowers the bar for teams that want custom instrumentation but do not want to debug Kubernetes resource configurations to get there. Access to observability should flow through Grafana with role-based controls, not through direct cluster access.

Consistency across environments is worth treating as a requirement, not a nice-to-have. Dashboards and queries built on staging should work on production without modification. That only holds if labeling, naming, and configuration are consistent across environments from the start.

One constraint worth stating plainly: in BYOC and single-tenant architectures, the observability system should not route customer logs or metrics through the vendor’s cloud account unless the customer has explicitly agreed to that. In regulated industries, it is a procurement blocker.

LocalOps provisions the observability stack inside the target cloud account for each environment. Each environment runs its own Prometheus, Loki, and Grafana within the customer’s VPC.

DIY Observability vs. a Cloud-Native IDP That Provisions It for You

One question that comes up often when teams think about how to build an internal developer platform is where observability fits, whether it gets provisioned upfront or bolted on later. Building the Prometheus + Loki + Grafana stack yourself is not technically hard. Helm charts exist, documentation is reasonable, and most engineers can get a working setup in a day.

The problem is not the first installation. It is the second, third, and eighth.

Every new environment — staging, canary, customer-dedicated deployment — repeats the same process. Different engineers make slightly different choices. Chart versions differ. Storage configurations vary. Six months later, Prometheus versions differ across environments, scrape intervals are inconsistent, and dashboards built on staging do not work on production because label names drifted.

DIY also means owning the operational surface. Prometheus runs out of memory on a high-cardinality workload: your alert, your fix. Loki fills its disk because someone added verbose logging to a worker: your PVC resize, your pod restart, your lost logs. These are ongoing responsibilities, not one-time setup tasks.

A cloud-native IDP that provisions observability as part of environment creation reduces that maintenance overhead. The stack comes from a consistent template. Configuration drift is reduced because the manual step that causes drift is removed — not because the tools behave differently.

The tradeoff is real. You give up some configuration flexibility in exchange for not owning the full operational burden. For teams whose job is building product rather than maintaining monitoring infrastructure, that is often a reasonable trade — though it is worth understanding what you are giving up before making it.

FAQs

1. What observability tools should be built into an internal developer platform?

At minimum: Prometheus for metrics, Loki for log aggregation, and Grafana as the visualization and correlation layer. These three cover infrastructure metrics automatically and application metrics when services expose a /metrics endpoint. The best internal developer platforms include all three as part of environment creation, not as a post-setup task. Distributed tracing requires a separate tool like Tempo or Jaeger plus OpenTelemetry, and is worth planning for before you need it.

2. How do internal developer platforms provide Grafana and Prometheus monitoring out of the box?

By provisioning Prometheus, Loki, and Grafana as Kubernetes workloads during environment creation, running inside the same cluster as the application, with Grafana already configured to connect to both. No manual data source setup, no scrape configuration to write. The stack is functional before any application code is deployed.

3. How do you get logs from multiple Kubernetes environments without configuring Grafana manually each time?

Use an IDP that provisions a per-environment Loki and Grafana stack as part of environment creation. Each environment gets an identical stack with Loki already registered as a Grafana data source. Developers access logs through that environment’s Grafana. No separate login, no data source configuration, no inconsistencies between environments.

4. What is the difference between an internal developer portal and an internal developer platform?

A portal like Backstage is a UI layer: software catalog, documentation links, embedded dashboards. Teams evaluating a Backstage internal developer platform setup often find that Backstage handles the portal layer well but still requires a separate solution for provisioning, environment management, and observability. A platform handles all of that. A portal surfaces information about your stack. A platform creates and maintains it.

5. Does an open source internal developer platform work for production observability?

Yes. Prometheus, Loki, and Grafana are open source and run on standard Kubernetes. The tools are production-ready. The question is whether your team has capacity to provision, configure, and maintain them consistently across every environment you run. The tooling cost is zero. Ensuring they are provisioned, configured, and maintained consistently across environments is where the effort accumulates.

Conclusion

Most teams spend a day or two getting Prometheus and Grafana running on a new cluster. That feels like a one-time investment. It is not.

Every environment added is another Grafana instance to update, another Prometheus scrape configuration to maintain, another persistent volume to size correctly, and another set of dashboards to keep in sync with production. That cost scales with the number of environments, not the size of the team. A four-person team running eight environments carries a monitoring surface that grows with every new deployment.

According to the 2024 DORA report, elite engineering teams recover from failed deployments significantly faster than low performers. Observability is one contributing factor in that gap. Teams that can quickly see what broke, where, and when spend less time navigating fragmented monitoring setups and more time fixing the issue.

An internal developer platform does not replace Prometheus, Loki, or Grafana. It reduces the repeated provisioning work and limits the configuration drift that makes these tools harder to operate across multiple environments.

If you are running a single environment with dedicated infrastructure support, a manual setup can be sufficient. As the number of environments grows across stages, regions, or customer accounts, the provisioning overhead compounds, and the maintenance cost tends to surface during incidents when time matters most.

If you’re thinking through how to standardize observability across multiple environments or reduce the maintenance overhead that comes with scaling clusters, then LocalOps team can help you work through it:

Book a Demo - Walk through how environments, deployments, and AWS infrastructure are handled in practice for your setup.

Get started for free - Connect an AWS account and stand up an environment to see how it fits into your existing workflow.

Explore the Docs - A detailed breakdown of how LocalOps works end-to-end, including architecture, environment setup, security defaults, and where engineering decisions still sit.

Keep Shipping

Discussion about this post

Ready for more?