A post in occasional series about the ins and outs of data science, by senior AI researcher Natan Katz.
Artificial Intelligence usage has become common in our world. Everyone knows someone that's involved in AI. The massive progress that is often reported by academic institutes, or research teams in organizations such as Microsoft or Google, leads to the question: “How well will AI perform in commercial products?" In the following post, I will describe the gaps between good achievement in a research team and the challenges that phishing offers.
AI - Domain of Excellence
There are several domains in which AI achieved massive progress in classification problems:
- Sentiment Analysis
- Recommendation systems
(You may be thinking that I forgot about games. I did not. AI tasks in games are not classification problems. Look out for a future post about the similarities between phishing and games.)
In the following sections, I will characterize the manners of AI tasks, the criteria for successful domains and the aspects of phishing that differ from these domains.
Common Manners of Domains
I will begin by describing the AI problem from the BI perspective. We have
- Observed data – images, audio files, set of tabular columns or textual content.
- Target outputs – The values that the AI machine is expected to output:
AI has to identify whether an image is a dog, a cat, or an oak tree. In a different use case, it has to decide whether a user will prefer to buy a vacation in LA, London, or Frankfurt, or to decide if a text is about a payment overdue, an audit, or a churn request.
Data science's objective is to train the machine to decide as accurately as a human does.
Observed Data- Why Phishing is not Vision or Speech?
Vision and speech analytics enjoy a wide set of commonly used technical tools for pre-processing. There’s FFT, MFCC and STFT in speech with hamming distance and convolutions in vision.
A researcher in these domains knows exactly which tool to use in order to pre-process the data for AI. In addition, the data in these domains are equipped with known metrics that hint to the researcher which steps are correct.
Text analytics don’t have a ton of well-studied computational tools. However, with the advance of tools such as attention and transformers, current embedding techniques perform nearly as well as FFT and/or convolution methods. These techniques allow for the development of accurate language tools that can be used for classification tasks, such as sentiment.
In the case of recommendation systems, the tabular data is less equipped with sophisticated tools (and there are fewer academic studies on tabular data) but each column can be processed independently.
Now, what about phishing?
The raw data in phishing is a combination of textual content and tabular data. It’s pretty cool that we are both a recommendation system and a sentiment.
However, it is not a recommendation system, since, in these systems, the user is cooperative with the system. A Sprint’s subscriber wants Sprint to offer goods that interest them. A user of Booking.com wants to get a unique offer.
Our culprits prefer to give up on this uniqueness. In phishing, the user is competitive with the system, not cooperative.
What about text? We do use these tools; however, phishing is not a standard text problem.
Why Phishing is Different
The target is the set of values that our AI machine needs to output. In training these machines, we need someone to accurately label the data. Target offers several differences between phishing and other domains:
In our successful domains, this is trivial: The CRM of the recommendation system contains the answer to whether a customer bought the offer. It’s obvious if an image is a dog, a mouse or a football.
In phishing, we don’t have these luxurious conditions. People who do phishing are clever. They don’t write suspicious emails. They don’t write emails that can be identified by anomaly detection engines. They write emails as we do: Naive, common and without any unusual words or phrases.
A single email, therefore, is extremely difficult to be classified without additional context. Emails that are nearly identical may have different classifications. Thus, the labeling may suffer noise. It may cause the model to have lower performance and the user to work harder in analyzing wrong emails or missing real phishing.
Consider the following anecdote: One of the leaders of Yahoo! once said, “we have English language models, but we need to handle Twitter and Facebook postings.” At Avanan, we need to speak not only English and other languages but also phishing.
In each of the successful domains we mentioned, there are plenty of balanced data sets. Finding phishing data that is big and contains 30-50% of phishing is extremely difficult.
Consider a user. The user hovers over a list of emails and notices that some of the clean emails are identified as marketing, while others are identified as spam. Will the user react to both cases in the same way? Will the user find similarities between a phishing email that was classified as spam and the same email when it’s classified as clean? The answer is no for both questions.
Why does it matter?
Two reasons. The first is a business one. We aim for our customers to follow only mandatory emails; otherwise, it becomes too cumbersome. If they see that we classify phishing as “clean” or “marketing” they may ask questions or show dissatisfaction.
The second reason is data science. There are methods to train orderable categories but they are less effective than non-orderable ones. The most common classification loss is cross-entropy, which measures only mistakes without any consideration of its nature. Consider the following example. Let’s say we study a system that identified business processes: payment, audit and technical issues. We get a vector of probabilities (0.3, 0.2, 0.5) and a vector of probabilities (0.3, 0.7, 0), both for real category two. The loss will have the same value. If we replace payment with phishing, and the categories of spam and clean, it’s pretty clear that the mistakes are different.
This post focused on the differences between AI for phishing and for common working domains. This will allow us to understand why some of the steps are not trivial; in fact, they are extremely appealing for AI.