The only automated NLP-based API security platform that protects enterprise applications at scaleLearn More »
A good API specification can provide valuable insights into how the API works. It describes which data objects exist, whether they are required, details on their format, authentication requirements, and so on. Using these to create policies for rate limiting, input validation and the like can block ‘hostile’ traffic (e.g., injections) using rule-based protection, while letting legitimate traffic pass through. But it’s the legitimate traffic that can cause damage.
In a food ordering app’s API data, for an /order endpoint with a POST method, the schema definitions include two required fields in the request - userID and userName, noted as fields A and B. At the same time, when placing an order, the user is required to choose a payment method - either “credit card” or “lunch-credit”, noted as fields C and D.
However, according to API specifications, neither of these fields (C nor D) is required, yet one of them must be used (and included in the request) for the order to be accepted.
From a data perspective, the actual request structure is more complex than the schema definitions: either fields A,B,C occur, or fields A,B,D occur - yet this “mutual-exclusion” pattern is not expressed in the schema, and thus cannot be enforced. At the same time, a request that contains all four fields should be deemed suspicious and alerted or even blocked from reaching the server.
This pattern illustrates business logic that occurs but isn't explicitly stated in the API specification, which is why it cannot be enforced. Protection is hampered by the lack of a critical context. It is necessary to infer the underlying business logic from the patterns of the API data in order to guarantee maximum coverage and protection.
Another example can be found in a bank’s API data: For a /transaction endpoint with a POST method, where the request has a single mandatory field - User Id, as expressed in the schema file as well. In sending the request, the user chooses the Transfer Type in a corresponding field, which can take the values “Domestic” or “International”.
If the transaction type is International, an additional field - Account Number - takes a specific format for International Bank Account Number (aka IBAN), and includes other fields such as Bank Routing Scheme. If, however, the transfer is Domestic, the account number takes the form of the associated local bank, and different fields appear, such as Bank Number.
These dependencies between the different fields make them optional and conditional, while at least one of them is mandatory. This is a complex hierarchical relationship between data objects, which encompasses both the fields and the values. This business process doesn’t appear in the API specification, making it impossible to enforce by relying on the schema alone.
The main challenge with this problem is the multidimensionality of the data. With multiple fields and values, it can be difficult to take all of the information into account. At the same time, narrowing it down to small subgroups such as pairs or triplets of fields (or fields and values) is insufficient since it does not capture the full scope of the pattern.
One way to address this problem algorithmically is to represent the API data as a graph: nodes are fields (or fields and values) and an edge connects a node u to a node v, whenever both fields appear together in the same packet. After applying some data processing and normalization, an undirected graph is formed.
Following the example of the POST /transaction and focusing on a sub-part of the fields-values graph, the fields are represented as nodes, while some fields are also split into separate nodes based on their values, like the Transfer Type field. Based on the joint occurrence in API data transactions, field nodes are connected by edges. Lastly, applying a clique finding algorithm to the entire graph constitutes a search for all clusters within the API data: the groups of fields (or fields and values) occurring together.
These clusters, or substructures, reflect the business logic of the application as depicted in API data itself - and not in the API specification. Automatically inferring these patterns can enforce their existence upon incoming transactions, and flag packets that violate them as anomalous.
Even the best AP schema isn’t up for the challenge of security enforcement. It is perhaps good practice to maintain accurate schemas for scanning and fixing vulnerabilities before deploying to production. Furthermore, because of the speed and scope of development, it's not uncommon (to put it mildly) for API specifications to be inaccurate, out-of-date, or insufficient.
Going forward, the effectiveness of an API specification for enforcement is only expected to deteriorate. With GraphQL gaining momentum as a querying language for APIs, it introduces greater ambiguity compared to today’s REST APIs. GraphQL queries are more conditional to begin with, as shown in the example above, thereby creating relationships and sequences that do not necessarily appear in the specification.
The best way moving forward is to let the data tell the story. By monitoring API data it becomes possible to model the full functionality of the API, uncover the business logic, and use that to automatically develop runtime protections. This way, relying on the accuracy of documentation for enforcement can be eliminated by analyzing the data and learning what's legitimate usage.