Understanding the challenges of managing data at scale

The multiple challenges of operating ever more complex environments are well known. The most common we hear when we are speaking with our customers and partners are based around the vast amount of data now being produced and the quality of it.

These aren’t new problems when it comes to network availability and performance monitoring. When I started working in this area in the late 1990s, Network Operations Centers (NOCs) were already drowning in the amount of data being produced. Back then, in the early days of systems and network management solutions, data would simply be discarded to avoid overloading the network management system.

Data quality was and continues to be, a constant challenge. In the past, a massive set of rules had to be developed and maintained to make sense of the data stream. As either the data format or the estate being managed, changed, the rules needed to be updated. This challenge was, and in some cases still is, a huge maintenance and support overhead.

Fast forward 20 years and I am pleased to report that although some of the data challenges remain, we are now significantly better equipped to be able to deal with them.

Embracing Artificial Intelligence as more than a concept

There’s a lot of hype around the ‘latest’ technologies being introduced into service assurance – especially around Artificial Intelligence (AI) and Machine Learning (ML). However, these aren’t entirely new, the concept or science of AI was introduced in the 1940s and 1950s when academics and scientists, including Alan Turing, started to consider the idea of an artificial brain.

AI remained a relatively academic field until the 1980s when the world was then introduced to ‘Expert Systems’ (I remember working at a large financial institution in London in the mid-1980s where there was a dedicated ‘expert systems’ team led by several senior doctoral scientists. I thought they were all a bit mad at the time!). Expert Systems worked on a series of ‘if-then’ rules to try to get the computer to perform reasoning. The good news is that since then, a lot of work has been done on making these technologies more ‘consumable’ by non-experts.

So that’s a brief history of AI, but where does Machine Learning fit in and how do we use it?

The intelligence of Supervised and Unsupervised Machine Learning

As mentioned, Expert Systems were programmed by humans to follow step-by-step instructions. The system had to have these instructions to be able to perform the actions needed. Machine Learning aims to remove the need to have explicit instructions, so the computer ‘learns’ by adjusting the models it is using based on the data it is exposed to. In general, the more data it processes, the more accurate the outcomes will be.

Just to confuse matters, there are several classifications and types of Machine Learning algorithms – the most frequently used are termed ‘Supervised’ and ‘Unsupervised’ – the difference being that in the case of Supervised ML (SML), we know both the input and output variables and use the algorithm to map the two together. For this to happen, a ‘training dataset’ needs to be used to ‘train’ the algorithm so that you can then predict the output variable when you have a new input variable.

Supervised ML is ideal if you have good quality data to train the algorithms. Unfortunately, this isn’t always available.

In this case, Unsupervised Machine Learning (UML) can be the best option. With UML, we only have the input variable and no corresponding output variable. UML is able to model underlying data structure or distribution to ‘learn’ more about the data. It’s called ‘Unsupervised’ because unlike Supervised ML, there are no correct answers and there is no teacher.

Unsupervised Machine Learning is, therefore, ideal for situations where you have a lot of data, but it is not well understood and structured, such as in a very dynamic and complex network environment. Hence our use of multiple Unsupervised Machine Learning policies within our Assure1 solution to identify patterns in the real-time data stream. We supply these policies out-of-the-box, but also appreciate that some of our customers may want to update, clone and build new ones, so everything is open and editable with our solution.

AIOps technology enables actionable intelligence, advanced root cause analysis and increased automation capabilities.

Identifying patterns is obviously useful, but the end goal is to produce an outcome that is actionable. In Assure1, we fully leverage Artificial Intelligence for IT Operations (AIOps) technologies, using the results of our Machine Learning to drive our advanced root cause analysis and automation functions. The identified action can either be picked up by a human or by automation – all fully configurable.

Using the machine learning policies within Assure1, we have delivered significant business and operational benefits. We have identified critical issues and outages that had been missed. We have significantly reduced the number of events seen by grouping related events together. In many cases, we have seen potential savings of thousands of hours that would have been wasted by attempting to investigate events manually.

As I am sure you appreciate, doing justice to this subject in a short blog article is not possible, and there is so much more to discuss. If you would like to know more about Assure1 and how we use machine learning to help save our customers from drowning in a sea of data, please see https://www.federos.com/products/assure1/analytics/ or contact us at info@federos.com.

You may also be interested in this White Paper https://www.federos.com/resources/white-papers/machine-learning-anomaly-detection-at-scale/ which was produced in partnership with Appledore Research and discusses the use of machine learning to detect anomalies and presents an example of one of the more common use cases (detection of a partially crushed optical fiber) we come across.

The modern Service Operations Centers (SOCs) and Network Operations Centers (NOCs) process billions of events daily. In most cases, service impacting events that violate service level agreements and cause widespread network outages to occur because of the complexity for the best experts to identify, isolate, and detect with accuracy the root cause of events.

Learn how Machine Learning (ML) can be used to tackle the challenges within telecommunication networks and explore how it has the potential to improve customer satisfaction by reducing MTTR and lower operational costs by offloading outlier detection to machines, not humans.

Please fill out the form to download PDF.

White Paper - Closed Loop