We all use a multitude of applications in our everyday lives: to transfer money, book car services, order a pizza, and so much more. We regularly provide these applications with sensitive information about ourselves, often without giving it an extra thought:
our home address, phone number, email address, or credit card information. In many cases, we get more sensitive data back from a service provider, such as our account balance or medical records.
We do so because we trust the service providers, assuming that Personally identifiable Information (PII) and other sensitive data is handled with care and that only we can access it. We also do it because it’s easy and comfortable. Yet this sensitive and personal information can be attractive for hacking, targeting the APIs that enable the data exchange.
One such attack was discovered in a T-Mobile API: a user logging in to a T-Mobile application used a specific endpoint for which one of the request parameters is a phone number (MSISDN). The response received for this request contained private user information of the user for which the phone number entered in the request belongs to. However, by applying multiple calls to this endpoint with many different phone numbers, the attacker(s) were able to receive personal and sensitive data of many of the application users (which were unaware their data leaked), and used this information for other malicious purposes (e.g., taking over social media accounts).
This raises the question of how can organizations protect themselves from such attacks, and more importantly: how can they assure that sensitive data they have stored is not abused?
First things first, let’s look at some alternative approaches
Going back to T-Mobile’s vulnerability described above, one approach would be to start finding all values that exhibit a phone number pattern (using some regex for instance), and treat those with greater care. However, while it would indeed be effective in finding personal phone numbers, it may also find public phone numbers, which are not sensitive information. For example, your bank branch phone number is not sensitive data.
At the same time, there could be many other types of sensitive information that don’t have a ‘predefined’ value-generating pattern (as email or phone number do). Moreover, such an approach would require comprehensive and ongoing maintenance to make sure all sensitive data is known and marked. In most cases, this approach doesn’t scale to cover possible variations and changes that occur in API data.
A second approach might take a data-driven path based on how sensitive data should statistically behave. Grossly speaking, we can say that sensitive data should exhibit many different values, with similar occurrence counts for most of them.
Consider the following example: in an automotive API we are likely to find two types of data – Fuel and Vehicle ID, the latter being considered as sensitive data. Figure 2 below shows data histograms (unnormalized and pruned for brevity) of both data fields, Fuel Type on the left side and Vehicle ID on the right. Following just the data distribution, we can see how the non-sensitive data has many occurrences per limited values, while the sensitive data had few occurrences for many different values.
However, as with the first approach of finding phone number patterns, this approach also wouldn’t cover all sensitive data, and might lead to falsely identified data as sensitive. For example time-related fields, whose distribution can be similar to that of vehicle ID, yet they are not sensitive fields.
NLP to the rescue
A different approach to this problem can be derived by viewing the API data, essentially a mechanism for machine-to-machine communication, as a dialogue. This enables to model the various information elements that appear in the data as words in the ‘API language’. Building on this concept, we can apply algorithms and methods from the Natural Language Processing (NLP) domain, which advanced greatly in recent years and reached ground-breaking achievements, towards protecting APIs
One example is the term frequency-inverse document frequency, ortf-idf, weighting scheme, one of the most fundamental ranking schemes used by search engines. It is used to evaluate how important a word is to a document in a collection of documents. A word’s importanceincreases proportionally to the number of times it appears in the document but is also outweighed by the frequency of the word in the entire collection.
For example, the word ‘you’ can appear many times in a document, but also in many others, hence its importance is low. On the other hand, we wouldn’t expect the word ‘coffee’ to appear in as many documents, so it is likely to be important in a document in which it appears frequently.
In its basic form, the tf-idf formula is a product of two terms:
Tf(t,d) = the number of times term t appears in a document d
Idf(t) = log (total number of documents/number of documents term t appears in)
Different variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking adocument’s relevance to an input query.
Similarly, by viewing the API data as words and modeling the API dialogue, we can say that the way tf-idf ranking is used to determine which words are important for a query, it can also be used to determine which fields in the data are important (=sensitive) for a user. This way, it is possible to automatically find the fields in the data whose values are sensitive without any labels or prior knowledge on the API.
By applying such scoring, we can automatically find the needle in the haystack, distinguish sensitive fields within thousands of different fields in the API data we monitor (and the many more different values each field may take). In turn, this ability means that sensitive data can be kept safe, enabling organizations to push forward with publishing new APIs that unlock greater operational efficiencies, business models and customer experience, keeping pace in today’s digital transformation and API economy.