The global digital transformation and e-commerce boom continue to drive up expectations for IT quality, especially regarding performance and user experience. But the current state is critical:
77% of consumers have stopped using certain digital services or uninstalled apps due to performance issues.
This underscores the growing need for strategies and solutions to achieve effective Application Performance Management (APM). Operations teams, developers, and management all require end-to-end visibility into their IT architectures and reliable tools to maintain business continuity with minimal resource expenditure. Given today’s complexity, traditional threshold-based monitoring is no longer sufficient.
In complex environments such as online retail, cloud ecosystems, microservices, and diverse applications, Observability is now essential for maintaining performance, particularly for business-critical processes.
IT Budgets Shrink, Yet Digitalization Continues
Despite continued investments in industrial and business-process digitalization, IT budgets are tightening. In 2023, the average IT budget accounted for just 3.6% of total company revenue, down from 4.2% in 2022.
This share varies by industry: finance and electrical engineering sectors typically spend above the average, while construction and mechanical engineering fall below.
Security remains the top priority for about 80% of companies, due to increasingly sophisticated cyberattacks. Yet 59% aim to invest more in IT operations to better meet evolving customer demands and boost competitiveness, particularly through improved e-commerce performance, stability, flexibility, and scalability for enhanced customer experience.
To achieve this, many decision-makers are renewing their IT infrastructure. MACH architectures (Microservices, API-first, Cloud-native, Headless) are becoming the standard over all-in-one solutions. Their scalability and adaptability make them ideal for dynamic cloud environments and fast customer journeys.
Observability as a Prerequisite for Stable IT
When infrastructures are consolidated or expanded, effective monitoring tools become essential. Without them, IT stability and user experience are nearly impossible to maintain.
Modern architectures – clouds, microservices, serverless apps, hybrid or on-prem environments, generate large volumes of anomalies. On top of this, the rise of AI adds a new layer of complexity.
New Challenge: AI and Shadow AI
According to Forrester, 2024 marks the start of an era of “Intentional AI”, moving beyond the hype toward concrete strategic use. 67% of companies plan to integrate Generative AI into their overall AI strategies.
Yet another trend is emerging: Shadow AI—the unauthorized use of AI tools by employees. Around 60% of employees globally are expected to use AI tools at work without approval.
A Salesforce study reveals:
- 52% of German employees have used unauthorized GenAI tools
- 34% brought their own, officially banned AI tools
This “Bring Your Own AI” (BYOAI) trend is poised to explode. Shadow AI introduces not just security risks but new challenges in performance monitoring. While usage can be limited through internal policies, strategic AI deployments (via MLOps) require robust performance management and monitoring frameworks to ensure business continuity.
Traditional Monitoring Captures Only 1% of Anomalies
Operating containers, microservices, clouds, and AI systems dramatically increases complexity and data volumes can reach terabytes.
This overwhelms traditional monitoring, which typically detects only about 1% of anomalies—the “known knowns”.
The real issue lies in the “unknown unknowns” unexpected, unfamiliar anomalies that defy standard analysis.
Traditional monitoring tools might detect that “something is wrong” but often can’t explain what or why. Troubleshooting these unknowns is slow and costly; often leading to hours or even days of poor performance and critical user experience issues.
In digital business, especially e-commerce, this is unacceptable. As Greg Linden, creator of Amazon’s recommendation engine, once said:
“100 milliseconds of latency costs Amazon 1% in revenue.”
Hence, businesses must implement Observability with real-time monitoring, automated root cause analysis, and rapid resolution (low MTTR) for even complex incidents.
Optimal APM with Observability
Observability takes performance monitoring to the next level by integrating and analyzing logs, metrics, and traces:
- Logs: Show events, errors, user and device info
- Metrics: Show system behavior (e.g., CPU usage, transaction volumes)
- Traces: Show how long requests take and where bottlenecks occur
Combined, these reveal deep insights and transparency. For example, metrics might show slow response times, while logs reveal the cause, bycomplex transactions being processed.
To implement Observability, entire systems must be instrumented to collect data. Due to the data volumes, AI and ML are key for real-time analysis and insight generation.
AI-powered systems:
- Understand historical system states
- Compare them to real-time data
- Identify anomalies
- Assign known issues to auto-remediation
- Escalate unknowns to specialists
Observability dramatically improves the detection, understanding, and resolution of unexpected anomalies “unknown unknowns”, thus ensuring seamless operations and top-tier user experience.
Business Benefits of Observability
Beyond performance, Observability offers significant business value:
- Customer behavior data can be linked to technical data to measure IT’s direct contribution to business outcomes
- Predictive analytics can forecast system loads, enabling proactive scaling
Observability also frees up DevOps teams by streamlining issue resolution and shortening release cycles. Logs, metrics, and traces can be integrated with CI/CD pipelines to test the performance impact of changes early, before issues reach production.
This gives developers more time for innovation and value-creation.
Observability and Generative AI: The Road Ahead
Generative AI brings added complexity to APM. It demands handling broader data diversity and volume. Full observability into AI systems is still evolving.
As Cory Minton, Head of Observability Strategy at Splunk, puts it:
“We need to figure out how to extract metrics, logs, and traces from AI. Does it behave as expected? If not, how do we build and fine-tune MLOps pipelines to keep functionality intact?”
Modern observability systems must rise to this challenge – with or without integrated GenAI.
Strong Return on Investment
We are entering a new era of performance engineering. While not all anomaly resolution can be automated, Observability is ideal for performance management:
- It identifies and classifies unknown issues
- Routes them directly to the right specialist
- Frees up expert resources from manual data triage
Companies are increasingly investing in Observability to achieve business goals and boost profitability. Surveys show:
On average, every €1 invested in Observability returns €2 – it pays off.
References
1 – https://survey.zohopublic.eu/zs/HETsVe
2 – https://www.forrester.com/blogs/prognosen-2024-generative-ki-de/
3 – https://www.salesforce.com/news/stories/ai-at-work-research/
4 – https://www.splunk.com/de_de/form/it-predictions.html
5 – https://www.splunk.com/de_de/form/it-predictions.html