Why Log Managers Work

By June 26, 2014blog

By Yossi Shteingart, Director of Operations

During the past few weeks we’ve been attempting to improve our monitoring in order to offer better customer service.

The questions that I kept in mind during my decision making process were:
1. How do we get precise alerting to reduce the time it takes to isolate a problem?
2. How can we envision anomalies that predict an issue and address that issue before it becomes a problem?

After brainstorming with my team, these were the possible solutions we came up with:
1. Go to our R&D team and raise the monitoring requisites, asking them to implement product monitoring capabilities.
2. Add an SNMP agent the application and monitor the application, gather statistics and send an SNMP trap in case of a problem.
3. Use APM (aka AppDynamic, Compuware, Newrelic etc.)
4. Implement a log manager solution.

The “We are so gifted, let’s do it ourselves” Approach

The first idea was thrown out the window pretty fast due to three major reasons:
• First, the amount of time it will take to add these capabilities.
• The second is the additional time and resources needed to complete thesechanges and maintain themafterwards.
• The 3rd and last is the fact we are a messaging company, not a monitoring company.

If I learned something during the time as an IT admin and operations manager, stay focused on your main business. Don’t try to do something which doesn’t directly apply to what you excel at.

SNMP Framework

I have consulted with several solution providers that had expertise in this field.
SNMP sounded like a good idea since the de-facto monitoring protocol, like it or not, is SNMP. But again, the involvement of the R&D team in the process and maintaining it was pretty heavy.

APM

Adding an agent to monitor everything the application does and get to performance metrics. The videos and slides on the internet were compelling. After I delved a bit more into the solution and read about what exactly it meant, I came to a conclusion that this road will also require an R&D effort to “help” the 3rd party agent to understand the application “language” and complexity.

Log Management

When a problem rises and there are no specific alert, where do we look for answers? The correct answer is, logs!

Where is the largest amount of application related data being stored ? Right again, logs it is.

One of the jokes I came across on my research was by one of the largest video content provider in North America, Netflix: “We are a log generating company that also happens to stream movies” said Danny Yuan, System Architect. To paraphrase, we are a “log company the also happens to deliver messages.”

TeleMessage generates a huge amount of logs with which technology today can become a valuable asset. Our engineers were thrilled to learn that they can have all the logs in a searchable, graphed manner and that we can generate behavior profiles easily to send alerts in case a specific scenario occurs.

So, what are the products that can do this and which one do we choose? Well, as anything else, there are the open source alternative and the commercial alternative.

Splunk was one obvious contender but with the price they demand for their license, we ought to be either gold miners or oil drillers…

Other contenders were pretty expensive too. At this point we evaluate a promising commercial product and in parallel checking the use of open source products like LogStash, ElasticSearch, Kibana and Greylog2.

To conclude, apparently analyzing logs is the best approach for us. This journey begins for us as we move forward to implementing a log management solution within the next few weeks and the upcoming months.

On my next post I will share some of the insights and results as well as reveal what we eventually chose.

Below is a simple table that was created to compare between the different monitoring approaches:

  Cost R&D Effort Solution Maintenance Overhead Implementation Time
SNMP Agent High High High 6-9 months
Application APM High Medium Medium 3-6 months
Log Management Free – High Low – Medium Low-Medium 3 Weeks – 3 months

To be continued…

Leave a Reply

5