Failure to Communicate.

Working in the Network Operations Center (NOC) for many years with the various problems that presented themselves to the operations team, we’d often look to the tools that failed to ferret out issues and would instantly recall a line from a Paul Newman classic film “Cool Hand Luke”: “What we’ve got here is failure to communicate.” In the film, Strother Martin played the infamous Captain of Road Prison 36, who cites the same line to the prisoners to clarify his beating of a disobedient Luke (Paul Newman).

Captain was not able to understand Luke, and Luke clearly was not able to get through to the Captain. It’s not that either side wasn’t communicating; it’s just that the Captain didn’t know how to listen and understand what Luke was trying to express.

This pretty much summarizes issues in the NOC.

Your devices are trying to do the same thing.  The Captain is your Manager of Managers, and the Bosses (or other NOC tools) are there to ‘help’ the Captain do his job better.

The problem in the NOC breaks down, believe it or not, at the very beginning.

In our last article, we mentioned how data normalization and translation are the keys to being able to understand and process each message a device sends or that we collect from a device.  These ephemeral and non-ephemeral data objects represent either a point-in-time, or can be compared over time to understand patterns, direction, and velocity of singular points of data.

But once you have processed those points, you then need to comprehend the larger problems at hand.  How does an ephemeral singularity impact a service or customer?  That’s where Federos’ Assure1® truly shines.  And that’s where we, at Federos, jump out of our seats and get excited.

Every data object ingested into Assure1 – these points of communication – are processed by multiple data science engines that rely on algorithms, which in turn rely on policies or rules.  The analysis is critical to the NOC and the key is in the timeliness of the data.

With Assure1, we started by collecting the best algorithms data science has to offer.  This is an evolving science, so we continue to examine new problems and find the latest tools math and science offer.  We then collected years of historical production event and metric data to help us solve and optimize “situational awareness” by developing out-of-box policies, matched with the right algorithms, to compress symptoms behind actionable problems.  Using our in-house Lean Six Sigma approaches, we continue to iterate with our clients as we develop new policies and pattern matchers, and test the results with what actually transpired for that customer.  This allows us to uniquely offer out-of-box policies and rules that provide our customers day one value.

Our expansive and ever-growing anomaly-based library of off-the-shelf policies provide our customers immediate tangible day one value from machine learning.  But how does a customer enhance the value?  This is where our Assure1 Event Analytics’ Policy Workflow begins.  Customers running our Event Analytics Engine can examine their data patterns easily to find signal volume and variety within their variables to expose potential new machine learning policies.  With Assure1, it’s remarkably easy to do, and once a policy has been point-and-click created, you can then bench-test the policy against historical data to review and tweak its effectiveness before deploying into production.

While the Event Analytics Unsupervised Machine Learning Detectors can capture about 60% – 70% of the issues you would want to capture, that still leaves plenty of room for improvement.  That’s where our Topological Root Cause Analysis engine steps in.

Historically, Topological Root Cause Analysis is an efficient one-trick pony.  Associating events on a graph, and then suppressing the noise to a root cause only works when there is an accurate and discoverable topology that corresponds to the correct matching layers of your network.  Plus, you need the ability to correspond correctly “root causal” events to “suppressible corresponding symptomatic” events.

From a delivery perspective, Topological RCA works great on Layer 2/3 networks.  Other OSI layers are possible, but are highly dependent on discovery via inventory or other methods.  Hence, we suggest that Topological RCA offers about a 20% – 30% solution to overall noise reduction.  But for Assure1 customers, the true magic happens when combining Topological RCA with Event Analytics Unsupervised Machine Learning.  Forever, customers have been frustrated with leased network paths.  They represent blind spots on many carrier networks.  But for Assure1’s Event Analytics, you now have the ability to correlate topological problems on leased networks to their child symptoms.  Meaning, if there were a fiber cut that resulted in hundreds of alarms, our Unsupervised Machine Learning can correlate the alarms and find the two end points of that cut circuit, as well as perform real-time suppression of the symptomatic alarms that caused the engine to identify the cut in the first place.

So, between the topological and invisible topological, the two engines provide the most robust topological noise reduction ever.

With Assure1’s Topological and Unsupervised engines we can detect 90% of the situations your operations team will ever face.  They can continue to evolve the policies to sharpen the skills of the engine as new technologies or situations are exposed.  The event suppression (or compression behind the root causes) allows the NOC to more efficiently respond to emergent issues and allows them to capture new situations and deploy new policies at will.

The last mile problems, for Assure1 customers, are now more easily solved.  We recently released the Supervised Event Correlator to handle the last-mile problems.  The result: the things that are not correlatedly by Unsupervised Machine Learning and Topological Root Cause can be turned into new point-and-click Supervised policies to complete the Assure1 RCA trilogy – aptly called RCA3.