By Uday Kiran Pulleti, Senior Director – AI, Cyble
The internet as we know it is a dynamic, constantly growing entity. In 2021, Eric Schmidt, CEO of Google, estimated that “the internet” is 5 Million Terabytes of data and that Google had indexed hardly 0.004%, or 200 terabytes. Even this massive chunk of data only symbolizes a fraction of the Surface Web.
At Cyble, we pride ourselves on having complete visibility into the Surface Web, Deep Web, and the Darkweb – of these 3, the surface web comprises <5% of the internet. It goes without saying that analyzing this massive data set manually is impossible, with petabytes of new data added every minute.
This is where Artificial Intelligence (AI) comes in. AI has been used as an umbrella term for multiple types of data processing algorithms. Without being pedantic, any system that analyzes data and generates inferences without explicitly programmed rules can be generalized to be called an AI system.
Along with the algorithms, hardware resources are an integral part of any modern-day AI system. AI systems should be designed to be distributed, scalable, and resilient with respect to storage and computing. According to multiple survey reports, the adoption of AI for Cybersecurity has been on the rise over the past decade, with over 70% of organizations adopting and prioritizing AI (1, 2).
AI is a natural fit for Cybersecurity
AI lends itself well to the domain of Cyber Threat Intelligence (CTI). Some of the important problems in CTI gathering and analysis are suited well to be solved with AI systems as described below:
- Scale: With over a thousand data breaches per year (3), 4000 ransomware attacks per day (4), and 4 million files stolen per day (5), manual CTI analysis is impossible. Distributed computing and data storage combined with AI is adept at solving the scale problem.
- Correlation and pattern recognition: AI algorithms are adept at memorizing terabytes of data and are capable of detecting patterns distributed across time and space.
- Repetition: Threat analysts are adept at identifying particular types of entities, which forms the basis for their in-depth analyses. But this becomes repetitive if this search for similar entities has to be conducted manually across thousands of documents. There is also the necessity to be able to identify new entities. Natural language processing algorithms can be trained to identify any unique or custom entity when an entity recognizer is trained with a few thousand or (in some cases) as little as tens of examples. New entities can thus be trained on the fly with minimal effort.
- Errors and False positives: Traditional data processing algorithms were rule based, relying heavily on programmed logic like keyword matching. This produces many erroneous threats and false positives. AI algorithms rely on context and semantics resulting in higher accuracy and low false positives. They also learn patterns that cannot be explicitly programmed using rule-based algorithms, which results in improved accuracy and lower false positives for different varieties of data.
- Continuous learning: AI systems can be designed to learn from continuous feedback. From models learning to filter out false positives to models detecting new threats, models adapt to become more accurate as more data is fed, changing dynamically to adapt to previously unseen data. While this is a great advantage, model developers should be keenly aware of data bias, model drift, and data poisoning and design resilient systems.
Challenges with deploying AI-powered solutions and mitigation approaches
Even though AI can help with the sheer scale and complexity of the datasets, there are some challenges that need to be overcome to optimally utilize AI in providing maximum value. Below are some of the important challenges and our approaches to overcome these challenges:
- Noisy data: Effective CTI requires analysis of data from multiple sources like data breaches, ransomware data dumps, social media, etc., which are very noisy sources of data. Data breaches and ransomware data are completely unformatted and often have no context. Carefully designed data pipelines with multiple data preprocessing blocks are the key to overcoming the noisy data challenge.
- Unique data formats: Till recently, one of the biggest challenges in Computer vision and NLP was the lack of a vast amount of labeled data to train deep networks. That problem for many use cases was solved by introducing pre-trained models and transfer learning. Unfortunately, most data available through data breaches and darkweb forums does not lend well to the pre-trained models. There could be a necessity to train AI models from scratch or selectively use pre-trained models after carefully orchestrated preprocessing. Depending on the business value of the use case, a unique data format issue could also be mitigated, albeit with more effort.
- Data Labelling: Data labeling has been one of the biggest challenges for the supervised learning component of AI since the beginning. The same challenges persist in AI for Cyber Security. Well-designed use of transfer learning with few-shot and multi-shot models can establish baseline results and utility. An iterative approach can then be followed where more resources on labeling are progressively expended as the model’s value becomes increasingly evident.
- Cost: Training and deploying AI models incur significant compute and storage costs. It is very important to understand the business value of the problem and analyze if AI is required for a particular problem. There are many instances where a simple regular expression can solve an NLP problem instead of a seq-to-seq model. For model selection, it is important to deviate from an academic or research mindset where even a fraction of a percentage point increase in performance could be significant and could lead to new research paths, but this has no practical value in real-life system deployments. Horses for courses approach generally keeps costs in check.
- Data poisoning: With the ubiquity of AI deployments by CTI researchers, attackers have also increased their sophistication. For continuous learning systems, attackers deliberately send data in ways to manipulate the AI model and make false predictions. It is important to maintain system observability and include human-in-the-loop designs wherever appropriate.
We are still in the early days of realizing the full potential of AI for CTI. Inspite of the amazing results produced by current day AI systems, they will keep evolving to deliver more and more business value.. Along with this, a symbiotic business model that utilizes sophisticated AI systems with oversight from talented cybersecurity professionals has proven extremely effective for Cyble and its suite of offerings.
About the Author
Uday Kiran Pulleti is the Senior Director – AI at Cyble. He is a core AI technologist with more than 15 years of expertise in conceiving new ideas, developing algorithms, and enabling the productization of complex systems. Uday has led several products in the domains of legal tech, smart home, smart city, and smart industry. Previously, as the Director of AI at Cognition IP, Uday has led the development of NLP products that improve the patent lawyers’ efficiency when it comes to patent search and drafting. As the Principal Research Scientist at Honeywell Global Labs, he led various AI initiatives, including cloud-based object detection and classification platform, video and audio security for smart home and smart city applications, multi-sensor mounted UAV-based inspection applications, and an indoor location tracking platform, among others.