Recently, I was asked to evaluate an organization’s Splunk deployment. This request flummoxed me, because while I’ve always been a fan of the tool’s capabilities, I’ve never actually designed an implementation or administered it. I love the empowerment of people building their own dashboards and alerts, but this only works when there’s a dedicated Splunk-Whisperer carefully overseeing the deployment and socializing the idea of using it as self-service, cross-functional tool. As I started my assessment, I entered what can only be called a “dark night of the IT soul” because my findings have led me to question the viability of most enterprise monitoring systems.
The original implementer recently moved on to greener pastures and (typically) left only skeletal documentation. As I started my investigation, I discovered a painfully confusing distributed deployment built with little to no understanding of “best practices” for the product. With no data normalization and almost non-existent data input management, the previous admin had created the equivalent of a Splunk Wild West, allowing most data to flow in with little oversight or control. With an obscenely large number of sourcetypes and sources, the situation horrified Splunk support and they told me my only option was to rebuild, a scenario that filled me with nerd-angst.
In the past, I’ve written about the importance of using machine data for infrastructure visibility. It’s critical for security, but also performance monitoring and troubleshooting. Log correlation and analysis is a key component of any healthy infrastructure and without it, you’re like a mariner lost at sea. So imagine my horror when confronted by a heaping pile of garbage data thrown into a very expensive application like Splunk.
Most organizations struggle with a monitoring strategy because it simply isn’t sexy. It’s hard to get business leadership excited about dashboards, pie charts and graphs without contextualizing them in a report. “Yeah baby, let me show you those LOOOOW latency times in our web app.” It’s a hard sell, especially when you see the TCO for on-premise log correlation and monitoring tools. Why not focus on improving a product that could bring in more customer dollars or a new service to make your users happier? Most shops are so focused on product delivery and firefighting, they simply don’t have cycles left for thinking about proactive service management. So you end up with infrastructure train wrecks, with little to no useful monitoring.
While a part of me still believes in using the best tools to gain intelligence and visibility from an infrastructure, I’m tired of struggling. I’m beginning to think I’d be happy with anything, even a Perl script, that works consistently with a low LOE. I need that data now and no longer have the luxury of waiting until there’s a budget for qualified staff and the right application. Lately, I’m finding it pretty hard to resist the siren song of SaaS log management tools that promise onboarding and insight into machine data within minutes, not hours. Just picture it: no more agents or on-premise systems to manage, just immediate visibility into data. Most other infrastructure components have moved to the cloud, maybe it’s inevitable for log management and monitoring. Will I miss the flexibility and power of tools like Splunk and ELK? Probably, but I no longer have the luxury of nostalgia.