Generative AI + Quant Trading = Love. (And Profits, Of Course)
It's a match made in quant heaven.
Ever since the release of ChatGPT, we’ve been looking for a way to fit components of generative AI into our models and strategies. Whether it would be using it to do math or having it read and translate an SEC filing — all to no avail (well, to some avail).
However, it looks like we’ve finally got something.
To bring you up to speed, we’ve been using the volatility surface to identify tradeable situations. If we see a sudden kink in the surface where, say; premiums for AAPL’s stock became a lot richer — we would do some manual work to find out why, then take a trade if the risk/reward was right. So far, it hasn’t been a bad approach:
While we’re seeing initial positive results, there are a few things we need to take care of:
Prove that we aren’t getting lucky.
Squeeze in AI to reduce/eliminate the manual workload
If we can prove that our success hasn’t been due to luck and we can use a new technology to make the trades near-effortless — let’s just say this might be a good summer.
If you’re interested in whipping up our surface on your end, check this out.
So, without further ado, let’s get into it.
Proving We Aren’t Lucky
Our strategy is a blend of quantitative and discretionary, so we have to be creative when building out a backtest. Replicating the vol surface of a given day is just a matter of transforming historical data, so we can reasonably get the stocks that had a boost in volatility expectations. However, we need to find a way of identifying the information responsible for the changes in the surface.
Considering that we don’t have access to expensive resources like Bloomberg news feeds (yet), we have to make best with what we have available — slow, messy public feeds. Now, we have plenty of respect for our data provider, but here’s what a request for the most recent news looks like:
The API pulls from decent news sources like Benzinga and MarketWatch, but it also pulls from places like The Motley Fool which isn’t exactly the premier venue for market-moving headlines.
Anyway, the point is that the feed is unstructured and has a lot of fluff. In order to backtest this, we need to manually root through the headlines of the day, pick out the ones most responsible for the move and then run our strategy based on those.
We need the help of GPT.
If we can train GPT to know which headlines are the most useful, we can then pass in the unstructured list and tell it to return the headlines most appropriate based on that prior training.
To begin, we grit our teeth and build a manual dataset of headlines that we deemed to be responsible for the respective stocks price movement. We filtered these headlines to be those we would actually base trades off of:
For context, our favorite strategy is that when there’s an increase in vol expectations, we sell options in the opposite direction of the news. So, if DFS vol is richer because of a CEO exit and the stock price is lower, we sell calls outside of the 1-day implied move.
Now that we have an optimal set of headlines, we need to train GPT to be able to fish out other headlines just like them. There are generally two ways of fitting GPT to an internal task:
Fine-Tuning
With fine-tuning, we would be diving into the nitty-gritty process of things like adjusting the weights and underlying training data of the GPT model so that it becomes more attuned to the specific task. This is essentially what private companies do for internal uses, since you can’t just prompt the public GPT to answer questions based on your company’s internal data (e.g., “How much revenue has the Anderson account generated and how often has their payments arrived late?”). The drawback of this is that it is highly prone to overfitting, where the model performs well on the training data but poorly on new, unseen data.
Prompt Engineering
Prompt engineering, on the other hand, doesn’t bother to fundamentally change the model but rather tries to finely craft the input into the exact output you want. This mainly includes specifying instructions, providing examples of the desired output and framing questions in a certain way. This approach gets the best out of the extremely large training dataset GPT is built on, since we don’t change anything else but the instructions.
For our uses, we’ll go with prompt engineering.
Before continuing, let’s step back and take a big picture look at where we’re going:
Now, here’s our prompt to-be:
Input: "Provided is a series of headlines that best explain the real-time performance of a stock. You will first receive the list of the useful headlines, then you will receive a different set of headlines that are not as structured. I want you to parse through those second headlines and only return the headlines most likely to be responsible for the current stock performance. These results should be similar to the first list.
Useful, Structured List: [{headline_list}]
Unstructured List: [{unfiltered_headlines_list}]"
The useful list will be the headlines we pre-defined , and the unstructured list will be the full list of headlines that day, some useful, most not.
Here’s a sample output from the most recent trading day on 3/1/2024:
Output: “Based on the useful, structured list provided, here are the headlines from the unstructured list that are most likely to explain the current stock performance:
1. Regional-bank stocks dragged down by NYCB — led by banks with exposure to New York City real estate
2. Hewlett Packard Enterprise Posts Mixed Q1 Results, Joins Fisker, Plug Power And Other Big Stocks Moving Lower In Friday's Pre-Market Session”
It returned just 2 headlines back out of a list of 44 (4.5%), but after manually verifying, these were about the ones most useful (out of the set provided by the API). As it currently works, we check Bloomberg and Reuters for relevant headlines, so while our manual coverage would yield a larger sample size, this is good enough for testing purposes.
So, with that being said, let’s get into a backtest: