A physicist view on Market Microstructure — Building a Rust program to match a trade with its corresponding order book event.

Gil-Arnaud Coche
26 min readJan 1, 2024

--

Codes available on Github

All the work presented here has been achieved with the help of a specialized GPT which I made public for those with the link below. It is tuned to produce rust code.
https://chat.openai.com/g/g-6jn21r4yc-algo-trader

When the physicist realizes that his rust superpowers will be handy (Generated by Dall.e)

Binance has revolutionized a lot of things when it comes to market data availability.

When I was a student, you simply could not get your hands on order book data. Or you knew someone (generally who knew someone)… When I was a young Quant, I had to wait to be assigned to the right project to receive what I perceived then as the most incredible data that ever existed. And I had to sweat to extract it… from the logs of the Smart Order Router!

Nowadays, you can freely connect to the Binance API to get very decent microstructure data. It is not perfect as you don’t get the full information — Market By Price (MBP) data and not Market By Order (MBO), check this post for more information, but it is already a really good place to start and you can have quite a bit of fun with it.

In today’s post, I will try and guide through one of the most challenging yet even more interesting problem of algoithmic trading (actual trading set aside of course!).

We will write a naive algorithm that matches order book event updates with trade updates.

The what with what? Yes I know, if you are not familiar with these terms, some clarifications are in order. These clarifications will be the object of the first part: we’ll have some memory refreshments on order books and trades.

In the second part, we’ll talk about the Data Sources. In our case, we will be sourcing our data from Binance using some scripts available in the GitHub repo.

In the third part, we’ll investigate a small data sample. In that investigation, we will, by-hand*, match trade updates with order book events. The task may seem hard at first, and it will have some caveats, but with simple rules we can get something to get us started with.

Fourth, we’ll get our hands dirty with an actual algorithm. The algorithm will achieve about 50% reconstruction. It is a small ratio and I will be able to explain most of the reasons leading to such reconstruction performance. And despite this small ratio, the investigation is still very insightful and a useful learning experience.

And you’ll see that achieving this results requires already quite a lot of work.

行くぞ!

*Of course I wrote a program to help…

Order Book and Trades

In order to buy or sell a financial instruments, a trader can enter an agreement with a person over the phone, this is an over-the-counter, or OTC, trade. They can also make an entry in a dedicated registry, called an order book, providing the price and quantity they are attempting to trade.

An order book can be sketched as below.

Sketch of an Order Book.

Each rectangle represents a price at which the asset can be agreed to be exchanged. The number inside the rectangle is the quantity of said asset that is waiting to be sold (Ask side — in red) or purchased (Bid side — in green). Behind each of these quantities, there may be multiple traders waiting for an execution.

When a trader submits an order, it may either be a passive order or an aggressive order.

Passive Order

Say that the trader places a buy order for 0.24 asset at 9.80. Their order will be placed on price level 9.80 of the bid side of the order book. The order book has now a configuration where the volume at price level 9.80 is at 1.5 + 0.24 = 1.74.

The same Order Book after a passive order for 0.24 asset at a price of 9.80.

Such an order is called a passive order because the trader is in a waiting queue to buy the asset at the price of 9.80. They have to passively wait for another trader to be willing to trade with them.

Earlier, I made a mention that ‘there may be multiple traders waiting for an execution’, and this is now justified. When traders decide to wait for an execution (buy or sell) by sending a passive order, their desired liquidity for the trade is added to the already available liquidity. And it is only executed once all prior liquidity in the queue has been consumed.

When sending a passive order no trade occurs. There is only an order book event with an increase of liquidity.

Aggressive Order (and Trades!)

Say that the trader now places a buy order for 0.35 asset at 10.20. Since the price corresponds to a rectangle on the ask side, the order is said to be aggressive. The Order Book now looks like this.

The same Order Book after an aggressive buy order for 0.35 asset at a price of 20.20.

What happened?

Well, the trader was willing to pay up to 10.20 for 0.35 assets. As there was 0.25 asset available at 10.10, a trade occurred at 10.10 for 0.25 assets. Since 0.10 asset were remaining to be executed, and since 0.67 asset were available at 10.20, the limit of the aggressive trader, there was another trade at 10.20 for 0.10 asset.

Therefore at least two trades have to be reported:

  • a trade for 0.25 asset at 10.10, and
  • a trade for 0.10 asset at 10.20.

I used the word at least because there will actually be as many trades as passive participants provide the liquidity, e.g. if 3 people were waiting at ask price level 10.10 for an execution, then three trades would be recorded. And the same goes for the 0.10 asset consumed on price level 10.20. If more than one participant was waiting, then there will be as many trades as there were executed providers at that price level.

Bonus — An aggressive trade is always more costly — We ought to take a bit of time here. In a previous post on optimal execution, I went very technical on how to execute according to industry standards without explaining key concepts. I’ll correct that wrong here.

First of all, an aggressive trader is said to cross the spread. The phrase is quite self explanatory: in its impatience, the trader decides to not wait for an aggressive seller and goes one or few more levels up the best bid to meet a patient seller at the best ask or above.

Second of all, for a broker, being able to avoid crossing the spread can make a a bit of a difference on the final PnL. Say for example that we take today’s spread on the order book BNB-BTC where BNB is the exchanged asset and BTC is the currency of transactions.

% date && \
curl -s 'https://api.binance.com/api/v3/depth?symbol=BNBBTC&limit=1' | \
jq -r '"\(.bids[0][0]) \(.asks[0][0])"' | \
awk '{print 20000*($2 - $1)/($1 + $2)}'

Sun Dec 24 06:55:05 JST 2023
1.6146

As the bash snippet above shows, it is at 1.6146 bips just before 7. For those who don’t know, 1 bip is equal to 0.01%. Why does this number matter? If you trade a $ 1,000,000 aggressively, you would loose about $161.46. Not much might you say. But let’s scale things up a bit.

Traded Volumes on the Nasdaq.

Above is a summary of the traded volumes on the Nasdaq. On Nasdaq’s order books, European equity traded volume was between $ 50,000,000,000 to $ 70,000,000,000 every month. And that’s for Europe only!

So using that number to get a sense of the order of magnitude, 1.6146 bips on that volume is about $ 116,251,200!

Even if we scaled that number at the size of the traded volume of an execution desk of a big broker, this could amount in millions every year.

In a nutshell, with a back-of-an-envelop cross multiplication, we see that the stakes are rapidly high to favor passive trading over aggressive trading.

However, there is now another issue that arise from trading passively: you are not sure to be fully executed… But that is for another day…

Two streams of data feeds for a complete information

From a conceptual perspective, two continuous auction phenomena have to be distinguished, hence the two different data streams.

(i) — First, the order book is constantly changing because of passive traders as they add or cancel passive orders.

(ii) Second, aggressive orders trigger one or more trades between two parties.

Now the difficult part, in the context of an MBP feed is that the orderbook changes are tracked with aggregated view. And if for some reason, one wants to reconstruct the continuous auction in its entirety, some logic has to reverse engineer the two streams of data. Which what we are after here!

But first let7s talk about the data sources.

The Data Sources

There will be three data sources.

  • One http endpoint from which the current order book is retrieved. We will call the output of such request a Snapshot. Simply because it is de facto a snapshot of the current order book’s state.
  • Two websocket endpoints: one to collect order book event updates, and another one to collect trade updates.

Order book snapshots

Try this. It’s better than a 1000 words.

% curl 'https://api.binance.com/api/v3/depth?symbol=BNBBTC&limit=10'   

The output of that command is something like this,

{
"lastUpdateId":3221409269,
"bids":[
["0.00614900","32.92600000"],
["0.00614800","22.23300000"],
["0.00614700","13.73200000"],
["0.00614600","7.06500000"],
["0.00614500","27.11700000"],
...
],
"asks":[
["0.00615000","5.24200000"],
["0.00615100","28.26300000"],
["0.00615200","13.49900000"],
["0.00615300","14.58300000"],
["0.00615400","12.66600000"],
...
]
}

which we can plot using a cumulative sum on the volumes, and with a limit parameter in the curl high enough to cover sufficient depth.

What you are looking at is the order book state when the query was made.

You can have a look at the current state of the order book too if you wish. After cloning the GitHub repo, from the root, run these bash commands.

mkdir tmp
curl 'https://api.binance.com/api/v3/depth?symbol=BNBBTC&limit=5000' > tmp/snapshot.json
python scripts/plot-orderbook.py --input-path tmp/snapshot.json --output-path tmp/

Binance Requirements

This is an important point: the snapshot json response's lastUpdateId field is a very important field as it provides the index to synchronize the snapshot with Order Book event updates. Indeed, as you will see below, there are two fields U and u which are respectively the first update ID in the message and the final update ID in the message. As a message from the order book data stream can contain the outcome of multiple events.

These ID’s are used for two things:

  • ensuring the correct sequencing of received messages as some could get lost on the way;
  • properly synchronize the snapshot by making sure that
    U <= lastUpdateId + 1 && u >= lastUpdateId + 1 .

Order book event updates

You can try to connect to the order book event update with websocat and the command below.

websocat wss://stream.binance.com:9443/ws/bnbbtc@depth@100ms

If you enter this into a terminal, you should be getting an output in the likes of :

% websocat wss://stream.binance.com:9443/ws/bnbbtc@depth@100ms
{"e":"depthUpdate","E":1703308358257,"s":"BNBBTC","U":3221277455,"u":3221277456,"b":[],"a":[["0.00614700","17.63500000"],["0.00615000","12.96600000"]]}
{"e":"depthUpdate","E":1703308358657,"s":"BNBBTC","U":3221277457,"u":3221277458,"b":[["0.00614400","13.66600000"]],"a":[]}
{"e":"depthUpdate","E":1703308358757,"s":"BNBBTC","U":3221277459,"u":3221277460,"b":[["0.00612900","13.83700000"]],"a":[["0.00615700","126.01600000"]]}
{"e":"depthUpdate","E":1703308358957,"s":"BNBBTC","U":3221277461,"u":3221277461,"b":[],"a":[["0.00614600","7.16900000"]]}
{"e":"depthUpdate","E":1703308359057,"s":"BNBBTC","U":3221277462,"u":3221277464,"b":[["0.00614400","18.22200000"],["0.00614200","8.26400000"]],"a":[]}
{"e":"depthUpdate","E":1703308359157,"s":"BNBBTC","U":3221277465,"u":3221277467,"b":[["0.00614300","4.81900000"],["0.00614200","16.35600000"]],"a":[]}
{"e":"depthUpdate","E":1703308359257,"s":"BNBBTC","U":3221277468,"u":3221277468,"b":[["0.00614400","16.95600000"]],"a":[]}
^C

These updates do not carry obvious abnormalities. As the Binance Requirements presented earlier requires, the value of field "U" at one line is exactly equal to the increment of 1 from the value of field "u" of the previous line.

Other fields of interest will be "E" , "b" and "a" as they are respectively a timestamp, the updated levels in the bid side of the book and the updated levels on the ask side of the book.

Trade event updates

The websocket endpoint is very similar to the order book event update one. Only the very end changes for a very expected "trade".

websocat wss://stream.binance.com:9443/ws/bnbbtc@trade

It outputs the following messages.

{"e":"trade","E":1703328396500,"s":"BNBBTC","t":230994585,"p":"0.00614300","q":"0.09000000","b":1779378259,"a":1779387173,"T":1703328396499,"m":true,"M":true}
{"e":"trade","E":1703328408630,"s":"BNBBTC","t":230994586,"p":"0.00614300","q":"0.49000000","b":1779378259,"a":1779387210,"T":1703328408607,"m":true,"M":true}
{"e":"trade","E":1703328408630,"s":"BNBBTC","t":230994587,"p":"0.00614300","q":"2.00000000","b":1779381485,"a":1779387210,"T":1703328408607,"m":true,"M":true}
{"e":"trade","E":1703328408630,"s":"BNBBTC","t":230994588,"p":"0.00614300","q":"0.69800000","b":1779385186,"a":1779387210,"T":1703328408607,"m":true,"M":true}
{"e":"trade","E":1703328408632,"s":"BNBBTC","t":230994589,"p":"0.00614300","q":"1.30200000","b":1779385186,"a":1779387211,"T":1703328408608,"m":true,"M":true}
{"e":"trade","E":1703328408632,"s":"BNBBTC","t":230994590,"p":"0.00614300","q":"1.10100000","b":1779385188,"a":1779387211,"T":1703328408608,"m":true,"M":true}
{"e":"trade","E":1703328408633,"s":"BNBBTC","t":230994591,"p":"0.00614300","q":"0.89900000","b":1779385188,"a":1779387212,"T":1703328408608,"m":true,"M":true}
{"e":"trade","E":1703328408633,"s":"BNBBTC","t":230994592,"p":"0.00614300","q":"1.00000000","b":1779386156,"a":1779387212,"T":1703328408608,"m":true,"M":true}
{"e":"trade","E":1703328408633,"s":"BNBBTC","t":230994593,"p":"0.00614300","q":"2.00000000","b":1779387169,"a":1779387212,"T":1703328408608,"m":true,"M":true}
{"e":"trade","E":1703328408653,"s":"BNBBTC","t":230994594,"p":"0.00614200","q":"0.83500000","b":1779377545,"a":1779387218,"T":1703328408632,"m":true,"M":true}
{"e":"trade","E":1703328408702,"s":"BNBBTC","t":230994595,"p":"0.00614300","q":"0.65300000","b":1779387236,"a":1779387219,"T":1703328408689,"m":false,"M":true}
{"e":"trade","E":1703328408735,"s":"BNBBTC","t":230994596,"p":"0.00614300","q":"0.63000000","b":1779387246,"a":1779387219,"T":1703328408718,"m":false,"M":true}
{"e":"trade","E":1703328408735,"s":"BNBBTC","t":230994597,"p":"0.00614300","q":"0.02300000","b":1779387246,"a":1779387221,"T":1703328408718,"m":false,"M":true}
^C

The fields of interest will mainly be "E" , "p" and "q" : the timestamp of the trade event, the price and the quantity of trade.

Matching any of the trades above will be tantamount to looking at the variations in the order book levels and trying to associate an order book event with that trade whenever there is a reasonable match in the timestamp, quantity and price numbers.

Data collection

For the data collection, nothing really fancy. To avoid having to start and stop multiple processes at the same time, I leveraged some asynchronous coding using python'sasyncio module.

Here is the bash command to start a capture.

python scripts/capture-depth-trades.py --symbol BTCUSDT --save-folder tmp

In the script, the order book snapshot is requested 1 second after the start of the websockets’ collections.

async def fetch_and_save_snapshot(symbol, folder):
await asyncio.sleep(1) # Delay for 1 second
#
# Rest of the code in the full snippet above
#

This is to ensure the Binance Requirements presented above would be respected when reconstructed the order book.

Empirical Investigation on a small sample

Before throwing myself into the implementation of an algorithm, I’ll investigate a small sample to see how I could reconcile them, was I to do it with a pen and paper.

Well… Almost as if i was to do it with a pen and paper. Because I did set up a small pygame interface. Because really, looking at lots of giberrishy lines can be cumbersome.

The data

The samples are located at the path data/capture/btcusdt/1702798595534677/.

% ls data/capture/btcusdt/1702798595534677/                       
ask_trade.txt bid_trade.txt depth.txt snapshot.txt

The bid_trade.txt and the ask_trade.txt contain respectively 29 and 26 trades for a total of 45 trades. That’s not a lot! And that should not surprise you as the capture was from GMT Sunday, December 17, 2023 7:36:35.858 AM to GMT Sunday, December 17, 2023 7:36:44.862 AM, so exactly 9.004 seconds.

The snapshot has the event ID 41239690701 implying that the first depth update that will be considered is

{"e":"depthUpdate","E":1702798596858,"s":"BTCUSDT","U":41239690697,"u":41239690709, ...}

implying that all trades with a timestamp inferior to 1702798596858 will be discarded because of the system implementation. This means that the trade with the tid (trade id) 3324705797 will be discarded as it happened a bit too early at 1702798596164 694 milliseconds too soon.

PyGame script

I am not going to present the script, it would be way too long and it would add information in this already very long post! If you wish to check the code (it is a little messy…), please check the file scripts/message-viewer.py.

But here is the final interface though (and you can also easily launch it with the command

python scripts/message-viewer.py

too!).

User interface of the pygame tool

As you can see on the figure above, the interface is rather simple. It has the timestamp of the current order book event update — e.g. the timestamp of the orderbook, the trades appears with their timestamp under the corresponding side, volume variations between order book events are between brackets on the right of the corresponding side and finally a cursor is there to travel across time — forward to the right and backward to the left.

The trades are chosen so that they are within a range of -100ms and +100ms around the order book event timestamp.

Why this number? Because the websocket was open with a request for 100ms updates. This means that orderbook updates are batched by 100ms intervals. An orderbook event will necessarily have a corresponding trade, if it does, within that interval… Unless there are serious delays on the network, which I do not consider for now.

Matched trades

On the figure above, you should see also in green a double sided arrow that signals a matched trade. It is what we are looking for.

Why is it matched? Because the volumes are identical. And this what we hope for: negative order book volume variations that are equal in absolute value to the trade quantity.

Edge cases

Here we are going to look at edge cases that will break the logic.

Multiple trade reports for the same order book event — A first issue occurs when multiple trades, add or cancels occur within 100ms. Usually these trades have the same timestamp and correspond to bigger trades split into smaller ones to avoid missing on liquidity.

(i) — Because the order book event update will have a newly added volume that does not correspond to the trade volume.

(ii) — Because the best bid might have changed during the trade.

Iceberg orders— This issue arises when the some liquidity is not shown on the order book. Binance authorizes those and they can be quite deceptive.

Example 1 — At timestamp 1702798603963, for example, there is a stack of trades that consumes the first level entirely, the second level and the third level, and at the second level there is a mismatch of quantities because of a hidden iceberg order!

Stack of trade that consumes multiple levels and an iceberg order.

Example 2— At timestamp 1702798602905, there is a stack of 3 trades that consume a total of 0.28061 asset. However the level has decreased by 0.13334 asset. This mismatch indicates that there possibly was a cancellation. If you check the trade id’s in the raw data you’ll see that the sequence is accurate.

Therefore a cancellation is the only sound conclusion one can have. Of course, there could also be add events that we are not aware of as well, but we shall keep the dominant trait.

Rust implementation of the trade-book event matcher

The code is organized between four files.

% ls src 
messages.rs orderbook.rs shougoutaku.rs trade_matcher.rs

messages.rs contains the serialization/deserialization logic.

orderbook.rs contains the logic for maintaining the flow of information from the order book event update webhook as well as the matching logic — yes… that’s two responsabilities and I would probably seperate it if I was to do it again)

shougoutaku.rs is the main file with the main loop. It contains also the construction logic of the trade candidates.

trade_matcher.rs maintains the flow of trade update from the webhook and its initiate the matching of trades with order book event updates when there are any. It also maintains a clean results of the overall process.

The general idea is to loop through the depth file and look at currently loaded trades to see if a match is possible.

Serialization/Deserialization of messages and snapshots

With regards to the serialization, I used Rust’s serde which is very convenient. However, one should keep in mind that an external library is used to parse the json. There could be a possibility that custom json parsing would be beneficial on such a critical part. Nonetheless, for our object of concern, serde will do the job just as well.

// src/messages.rs

use rust_decimal::Decimal;
use serde::{Deserialize, Serialize};
use serde_json::Value;
/// Custom deserialization function for Decimal
/// Converts a string to a Decimal, returning an error if the string is not a valid representation.
fn deserialize_decimal<'de, D>(deserializer: D) -> Result<Decimal, D::Error>
where
D: serde::Deserializer<'de>,
{
let s: String = Deserialize::deserialize(deserializer)?;
s.parse::<Decimal>().map_err(serde::de::Error::custom)
}
/// Custom serialization function for Decimal
/// Converts a Decimal to a string representation.
fn serialize_decimal<S>(value: &Decimal, serializer: S) -> Result<S::Ok, S::Error>
where
S: serde::Serializer,
{
let s = value.to_string();
serializer.serialize_str(&s)
}
/// Custom deserialization function for trade_id
/// Supports deserializing a trade_id from either a string or a number in JSON.
fn deserialize_trade_id<'de, D>(deserializer: D) -> Result<String, D::Error>
where
D: serde::Deserializer<'de>,
{
let value = Value::deserialize(deserializer)?;
match value {
Value::String(s) => Ok(s),
Value::Number(num) => Ok(num.to_string()),
other => Err(serde::de::Error::custom(format!(
"Invalid type for trade_id: expected string or number, found {:?}",
other
))),
}
}
/// Represents the JSON message format for depth updates via WebSockets.
#[derive(Debug, Serialize, Deserialize, Default)]
pub struct DepthUpdate {
#[serde(rename = "e")]
pub event_type: String,
#[serde(rename = "E")]
pub event_time: u64,
#[serde(rename = "s")]
pub symbol: String,
#[serde(rename = "U")]
pub first_update_id_in_event: u64,
#[serde(rename = "u")]
pub final_update_id_in_event: u64,
/// List of bid values to update; each bid is represented as a pair of Decimals [price, quantity].
#[serde(rename = "b")]
pub bids_to_update: Vec<(Decimal, Decimal)>,
/// List of ask values to update; each ask is represented as a pair of Decimals [price, quantity].
#[serde(rename = "a")]
pub asks_to_update: Vec<(Decimal, Decimal)>,
}
/// Represents the JSON message format for an order book snapshot update via HTTP.
#[derive(Debug, Deserialize, Serialize, Default)]
pub struct SnapShotUpdate {
#[serde(rename = "lastUpdateId")]
pub last_update_id: u64,
/// List of bids; each bid is represented as a tuple of Decimals (price, quantity).
pub bids: Vec<(Decimal, Decimal)>,
/// List of asks; each ask is represented as a tuple of Decimals (price, quantity).
pub asks: Vec<(Decimal, Decimal)>,
}
/// Represents the JSON message format for trade updates via WebSockets.
#[derive(Debug, Serialize, Deserialize, Default)]
pub struct TradeUpdate {
#[serde(rename = "e")]
pub event_type: String,
#[serde(rename = "E")]
pub event_time: u64,
#[serde(rename = "s")]
pub symbol: String,
#[serde(rename = "t", deserialize_with = "deserialize_trade_id")]
pub trade_id: String,
#[serde(rename = "p", deserialize_with = "deserialize_decimal", serialize_with = "serialize_decimal")]
pub price: Decimal,
#[serde(rename = "q", deserialize_with = "deserialize_decimal", serialize_with = "serialize_decimal")]
pub quantity: Decimal,
#[serde(rename = "b")]
pub buyer_order_id: u64,
#[serde(rename = "a")]
pub seller_order_id: u64,
#[serde(rename = "T")]
pub trade_time: u64,
#[serde(rename = "m")]
pub is_market_maker: bool,
}

The main point here is that I had to write specialized serialization/deserialization methods for the Decimal type. It is a very similar type to the decimal.Decimal object in python that avoids rounding float errors.

Order Book

Snapshot synchronization — The order book is a rust struct as shown below. Much of the code is obfuscated for now as I don’t need to present it just yet.

However, if you take the time to look at it, you will notice the use of Binary Tree map to store both sides’ levels and two update methods.

Initially I went for a classic hash map — python reflexes — but my specialized GPT advised for that object instead.

// src/orderbook.rs

// Order Book struct
pub struct OrderBook {
last_update_id: u64,
first_update_id_in_event: u64,
final_update_id_in_event: u64,
bids: BTreeMap<Decimal, Decimal>,
asks: BTreeMap<Decimal, Decimal>,
// ...
// Some variables explained later
// ...
}
impl OrderBook {
pub fn new() -> Self {
// Nothing critical
}
// Getter methods for best bid and ask updated flags
pub fn is_best_bid_updated(&self) -> bool {
// Simple getter
}
pub fn is_best_ask_updated(&self) -> bool {
// Simple getter
}
pub fn update(&mut self, update: DepthUpdate) {
// Skip if orderbok has not been loaded
if self.last_update_id == 0 { return; }
// Skip if final_update_id_in_event <= last_update_id
if update.final_update_id_in_event <= self.last_update_id { return; }
if self.first_update_id_in_event == 0 && update.first_update_id_in_event > self.last_update_id + 1 {
warn!("Snapshot is a little too old - first_update_id_in_event = {:?} last_update_id = {:?}", update.first_update_id_in_event, self.last_update_id);
}
if self.first_update_id_in_event != 0 && update.first_update_id_in_event != self.final_update_id_in_event + 1 {
warn!("Update sequence from websocket is mechakucha - update.first_update_id_in_event = {:?} self.final_update_id_in_event = {:?}", update.first_update_id_in_event, self.final_update_id_in_event);
}
// ...
// Some logic explained later
// ...
}
pub fn update_with_snapshot(&mut self, update: SnapShotUpdate) {
self.last_update_id = update.last_update_id;
// Nothing critical
}
pub fn match_and_process_trade(&mut self, trade: &TradeUpdate, trade_type: TradeType) -> u64 {
// ...
// Some logic explained later
// ...
}
pub fn print_orderbook(&self, n: usize, title: &str) {
// Nothing critical
}
}

In the update method, the order book digests a message from the endpointbnbbtc@depth@100ms. With the satement

if self.last_update_id == 0 { return; }

it skips any line if the snapshot has not been loaded yet — self.last_update_id is still set to false.

To ensure that the sequencing requirement is enforced, we check that the event ID’s are in the right sequencing order as required by Binance.

if self.first_update_id_in_event != 0 && update.first_update_id_in_event != self.final_update_id_in_event + 1 { 
warn!("Update sequence from websocket is mechakucha - update.first_update_id_in_event = {:?} self.final_update_id_in_event = {:?}", update.first_update_id_in_event, self.final_update_id_in_event);
}

For this last filter, I chose to simply spit in the logs some warning and not halt the execution because I do not plan on having redundancies in my small program, nor is it critical to have the most accurate decriptions. Depending on my usage I might add a counter for these errors and exit the program if there are too many error. But at this stage, I prefer not to worry too much about that.

You can skip this on a first read — Initially, I developped this struct for real-time purposes with frequents snapshot requests to bootstrap information. Therefore a couple of other filters were implemented.

  • For ulterior snapshots that would be requested after the first one, in a real-time continuous set up, all order book updates that would somehow be older than the snapshot, would have to be dropped as they would mess up the information by updating with outdated messages.
    Therefore I use this if statement to make sure otherwise.
if update.final_update_id_in_event <= self.last_update_id { return; }
  • Similarly, if for some reason the snapshot message arrives a little too late, this condition at least shoots a warning.
if self.first_update_id_in_event == 0 && update.first_update_id_in_event > self.last_update_id + 1 { 
warn!("Snapshot is a little too old - first_update_id_in_event = {:?} last_update_id = {:?}", update.first_update_id_in_event, self.last_update_id);
}

Computing the volume variations — If you recall, to match the trade, we have to detect and compute volume variation on the order book levels. As shown below.

To do so, whenever the Orderbook struct receives a depth update, we immediately compute the volume variations. And we have to do so for all the levels that could potentially be the best price i.e. whenever the current best price reaches zero, we compute the volume variation — level_delta in the snippet below — and we iterate for as many levels as necessary.

pub fn update(&mut self, update: DepthUpdate) {
// ...
// Checks presented earlier
// ...

// Store the current best bid and ask prices
let current_best_bid = self.bids.keys().next_back().cloned();
let current_best_ask = self.asks.keys().next().cloned();
// Reset flags
self.best_bid_updated = false;
self.best_ask_updated = false;
// Reset LevelDeltas before each update
self.best_bid_deltas.clear();
self.best_ask_deltas.clear();
// Update bids
let mut add_next_bid_level_delta: bool = false;
for (price_level, quantity) in update.bids_to_update {
// Check if the best bid price is updated
if Some(&price_level) == current_best_bid.as_ref() || add_next_bid_level_delta {
if let Some(current_volume) = self.bids.get(&price_level) {
self.best_bid_updated = true;
let volume_delta = *current_volume - quantity;
if quantity.is_zero() {
add_next_bid_level_delta = true;
} else {
add_next_bid_level_delta = false;
}
let level_delta = LevelDelta::new(price_level, volume_delta, update.event_time);
self.best_bid_deltas.push(level_delta);
}
}
if quantity.is_zero() { self.bids.remove(&price_level); }
else { self.bids.insert(price_level, quantity); }
}

// Update asks
// ...
// Same as bids
// ...
}

Constructing the trade candidates

The level update variations are computed and available. Now we need to focus on the trades. If you recall again, trades could be in a stack sequence. Usually having the same time stamp. And if you keep recalling, the selected level variation may not match all the trades of that timestamp and price level — like in example 2 above for price level 41926.85 where only one trade could actually be matched. Sometimes because of Iceberg orders, some other times because the level variation include an additional remove or add event.

In any case, it is wise to create from the stack sequence, a sequence of potential candidates using some — ideally all — subsets of possible subsequences of trades.

For example, say we have four trades of ids tid1,tid2,tid3,tid4. In such a case we would want to add the trades

  • tid1 with volume V1
  • tid2 with volume V2
  • tid3 with volume V3
  • tid4 with volume V4
  • tid1-tid2 with volume V1+V2
  • tid1-tid3 with volume V1+V3
  • tid1-tid4 with volume V1+V4
  • tid2-tid3 with volume V2+V3
  • tid2-tid4 with volume V2+V4
  • tid3-tid4 with volume V3+V4
  • tid1-tid2-tid3 with volume V1+V2+V3
  • tid1-tid2-tid4 with volume V1+V2+V4
  • tid1-tid3-tid4 with volume V1+V3+V4
  • tid2-tid3-tid4 with volume V2+V3+V4
  • tid1-tid2-tid3-tid4 with volume V1+V2+V3+V4

Now… This is the first version of my trade matcher and to be frank I did not want to add too much complexity so I stayed simple and only added these trades.

  • tid1 with volume V1
  • tid1-tid2 with volume V1+V2
  • tid1-tid2-tid3 with volume V1+V2+V3
  • tid1-tid2-tid3-tid4 with volume V1+V2+V3+V4

It provides better resuls but it is not perfect. To illustrate its imperfection, check timestamp 1702798604062 in the message-viewer.py interface, you will see that the trade stack below the bid side matches the volume variation if you remove the top trade of the stack…

But anyway, I wanted to stay simple so here it is in rust. The method is located in shougoutaku.rs.

fn next_trades(trade_reader: &mut Reader, prev_trade_info: &mut PreviousTradeInfo, trade_matcher: &mut TradeMatcher, trade_type: &str) -> Result<(), Box<dyn std::error::Error>> {
if let Some(Ok(line)) = trade_reader.next() {
let mut trade_update: TradeUpdate = serde_json::from_str(&line)?;
// Compare current and previous values
if trade_update.event_time == prev_trade_info.prev_event_time && trade_update.price == prev_trade_info.prev_price {
trade_update.quantity += prev_trade_info.prev_quantity;
trade_update.trade_id += "-";
trade_update.trade_id += &prev_trade_info.prev_trade_id;
}
// Update the variables with current values for next iteration
prev_trade_info.prev_event_time = trade_update.event_time;
prev_trade_info.prev_price = trade_update.price.clone();
prev_trade_info.prev_quantity = trade_update.quantity.clone();
prev_trade_info.prev_trade_id = trade_update.trade_id.clone();
trade_matcher.add_trade(trade_update);
return Ok(());
}
Err(Box::new(TradeReadError::new("No more trades to read")))
}

Matching trade candidates with volume variations

Finally, we are on the most crucial and desired aspect of the algorithm: the matching logic!

It’s actually two-fold.
(i) — First a method match_trades in the TradeMatcher struct that initiates the matching process. When it gets a result, it keeps a record of the result.

(ii) — Second a method match_and_process_trades in the Orderbook struct that compares a subset of known trades with the most recent order book update. When it finds a match, it returns the timestamp of the corresponding order book event update. If the trade is out dated — over 100 ms — then 1 is returned.

pub fn match_trades(&mut self, orderbook: &mut OrderBook) -> Vec::<u64> {
let mut indices_to_remove = Vec::new();
let mut event_times = Vec::<u64>::new();
let mut trades_to_insert = Vec::new();
for (index, trade) in self.trade_queue.iter().enumerate() {
let te = match self.trade_type {
TradeType::Bid => orderbook.match_and_process_trade(trade, TradeType::Bid),
TradeType::Ask => orderbook.match_and_process_trade(trade, TradeType::Ask),
};

if te == 1 {
info!("{:?} - Dropped Trade ID: {}", self.trade_type, trade.trade_id);
indices_to_remove.push(index);
trades_to_insert.push((trade.trade_id.clone(), trade.event_time, te));
} else if te > 1 {
info!("{:?} - Matched Trade ID: {}, Event Time: {}", self.trade_type, trade.trade_id, te);
indices_to_remove.push(index);
event_times.push(te);
trades_to_insert.push((trade.trade_id.clone(), trade.event_time, te));
}
}
// Remove items in reverse order
for index in indices_to_remove.into_iter().rev() {
self.trade_queue.remove(index);
}
// Now that we are no longer borrowing `self.trade_queue`, insert the trades
for (trade_id, trade_event_time, event_time) in trades_to_insert {
self.insert_trade_ids(&trade_id,trade_event_time, event_time);
}
event_times
}
pub fn match_and_process_trade(&mut self, trade: &TradeUpdate, trade_type: TradeType) -> u64 {
let level_deltas = match trade_type {
TradeType::Bid => &mut self.best_bid_deltas,
TradeType::Ask => &mut self.best_ask_deltas,
};
for level_delta in level_deltas.iter_mut() {
let trade_event_time = trade.event_time;
if level_delta.event_time > (trade_event_time + 100) {
return 1;
}
if level_delta.volume < Decimal::new(0, 0) { continue; }
if trade.price == level_delta.price && trade.quantity == level_delta.volume {
// Here, you'd perform the matching logic and return the trade_id if matched
level_delta.volume -= trade.quantity;
return level_delta.event_time;
}
}
0
}

Main loop

The main loop is actually quite straight forward. Lines in the depth file are read one by one and trade candidates are processed. As trades are matched or dropped, other trades are loaded. The loop is implemented as such.

    loop {

let depth_line = depth_reader.next();

// Check if all files have reached EOF
if depth_line.is_none() {
break;
}

if next_ask_trade { // When trades are matched or dropped.
next_ask_trade = false;
loop {
match next_trades(&mut ask_trade_reader, &mut prev_ask_info, &mut ask_matcher, "Ask") {
Ok(()) => { /* Normal processing */ },
Err(e) => {
error!("Error reading trade: {}", e);
break; // Break out of the loop
}
}
if ask_matcher.number_of_timestamps() == 5 {
break;
}
}
}

// ...
// Same for bid trades
// ...

if let Some(Ok(line)) = depth_line {
let depth_update: DepthUpdate = serde_json::from_str(&line)?;
debug!("{:?}", depth_update);
orderbook.update(depth_update);
if orderbook.is_best_ask_updated() {
let event_times = ask_matcher.match_trades(&mut orderbook);
if event_times.len() != 0 {
next_ask_trade = true; // Matched or dropped trades.
}
}

// ...
// Same for bid trades
// ...

}
}

Purging at the end of execution

When reaching the end of the depth file, because the capture is such that there may be trade updates without possible depth updates to match with, these trades are tagged as purged with a value of 2 — as a reminder, for matched trade, the tag is the timestamp of the matched order book event while for dropped trade, the tag is .

Results

The algorithm is tested on two data set. The first very small sample that we investigated earlier; and a bigger one, spanning from GMT Tuesday, December 19, 2023 4:23:08.999 PM to GMT Tuesday, December 19, 2023 5:07:30.434 PM.

Small sample — The command is rather long but quite self explanatory. After providing cargo with the necessary arguments to compile and run, you need to add and extra -- after which the programs arguments may be provided.

RUST_LOG=info cargo run --bin shougoutaku -- --snapshot data/capture/btcusdt/1702798595534677/snapshot.txt --depth data/capture/btcusdt/1702798595534677/depth.txt --ask_trade data/capture/btcusdt/1702798595534677/ask_trade.txt --bid_trade data/capture/btcusdt/1702798595534677/bid_trade.txt

The output is something like

[2023-12-28T08:04:02Z INFO  shougoutaku::trade_matcher] Ask - Matching output
Trade ID Trade Time Event Time
3324705798 1702798597052 1702798597058
3324705799 1702798597349 1702798597358
3324705800 1702798597387 1702798597458
3324705801 1702798598334 1702798598359
3324705803 1702798598719 1702798598759
3324705804 1702798598927 1702798598959
3324705806 1702798600587 1 -> additional add
3324705808 1702798601388 1702798601460
3324705809 1702798601549 1702798601560
3324705811 1702798601752 1702798601761
3324705814 1702798602154 1702798602161
3324705815 1702798602319 1702798602361
3324705816 1702798602905 1 -> additional cancel
3324705817 1702798602905 1 -> additional cancel
3324705818 1702798602905 1 -> additional cancel
3324705820 1702798603265 1702798603361
[2023-12-28T08:04:02Z INFO shougoutaku::trade_matcher] Bid - Matching output
Trade ID Trade Time Event Time
3324705797 1702798596164 1 -> snapshot delay
3324705802 1702798598588 1702798598659
3324705805 1702798599555 1702798599559
3324705807 1702798600829 1702798600860
3324705810 1702798601655 1702798601660
3324705812 1702798601955 1702798601961
3324705813 1702798602134 1702798602161
3324705819 1702798603083 1702798603161
3324705821 1702798603963 1702798604062
3324705822 1702798603963 1702798604062
3324705823 1702798603963 1702798604062
3324705824 1702798603963 1702798604062
3324705825 1702798603963 1702798604062
3324705826 1702798603963 1702798604062
3324705827 1702798603963 1702798604062
3324705828 1702798603963 1 -> iceberg order
3324705829 1702798603963 1 -> iceberg order
3324705830 1702798603963 1 -> iceberg order
3324705831 1702798603963 1 -> iceberg order
3324705832 1702798603963 1 -> iceberg order
3324705833 1702798603963 1 -> iceberg order
3324705834 1702798603963 1 -> iceberg order
3324705835 1702798603963 1 -> iceberg order
3324705836 1702798603963 1 -> iceberg order
3324705837 1702798603963 1702798604062
3324705838 1702798604087 1 -> iceberg order
3324705839 1702798604087 1 -> logic limitation
3324705840 1702798604087 1 -> logic limitation
3324705841 1702798604087 1 -> logic limitation

If you wish to check the explanation of particular trade, use ctrl+c->ctrl+f on the trade timestamp to be pointed to the right paragraph below.

As you can see 17 trades out of a total of 45 has been dropped. The xclusion reasons is shown at the end of the line with the format
-> <reason for being dropped>

Let’s have a look in more details using the message-viewer.py script.

Ask trade of timestamp 1702798600587As shown in the image sequence below, there actually are multiple order book events that could be potentially matched with the trade. None have a volume variation that exactly match the trade quantity.

First match possibility for ask trade of timestamp 1702798600587
Second match possibility for ask trade of timestamp 1702798600587
Third match possibility for ask trade of timestamp 1702798600587

The second event seems to be the most likely events and its absolute volume variation is smaller than expected therefore there probably has been at least one additional add event on that level.

Ask trade of timestamp 170279860587This trade has been dealt with in example 2 above in the paragraph on Edge cases.

Bid trade of timestamp 1702798596164In this particular case, the trade could not be matched because its timestamp is anterior to the snapshot event. And if you recall, I coded the order book to wait for a snapshot before it starts processing order book event updates.

Bid trade of timestamp 1702798603963This trade has been dealt with in example 1 above in the paragraph on Edge cases.

Bid trade of timestamp 1702798604087This trade is interesting as it shows a clear short coming of the implemented logic. If you recall, I mentionned that, for simplicity’s sake, I was reducing the set of trade candidates with a very simple subsequence of the trade stack — a pyramidal like construct. I also said that it could produce matching weaknesses. And here is a perfect example.

Basically, if you sum only the last three quantities you find

>>> 0.00681000+0.00681000+0.02730000
0.04092

which is a perfect match with the level volume variation.

The logic should thus build other trade candidate combinations. A brute force approach would be to get all the possible ones, while a more heuristical one would require the quant-devlopper to add combinations has they occur. Here they would have to include pyramidal sequences starting from the last trade of the stack.

Big sample — For the big sample, the command is the following.

RUST_LOG=info cargo run --release --bin shougoutaku -- --snapshot data/capture/btcusdt/1702970588517503/snapshot.txt --depth data/capture/btcusdt/1702970588517503/depth.txt --ask_trade data/capture/btcusdt/1702970588517503/ask_trade.txt --bid_trade data/capture/btcusdt/1702970588517503/bid_trade.txt > tmp/debug.log 2>&1

You’ll notice that this time I have used the --release flag to have an optimized compilation.

(i) — 9,311 ask trades were dropped out of 19,416 ask trades.
(ii) — 14,548 bid trades were dropped out of a total of 23,016 bid trades.
(iii) —23,859 trades on both sides were dropped out of a total of 42,432 trades.

Clearly the program has shortcomings as about 56% of trades were either dropped or purged.

Now that is if you see the glass as half empty. If you see it as half full, and although it may sound very obvious to say, almost 50% of trades were matched with an order book event update that is within 100 ms and has the exact same volume variation as the trade quantity!

I think this is a very encouraging benchmark with such a naive system, which, if ever you want to put into production, you would still have few deployment hurdles you would entrust to your Dev’Ops — or if your as curious as I am, to yourself!

I believe this first version to be an acceptable step towards production, so I will stop this already very long post here.

Wrapping Up

So… In a nutshell, with this study and naive algorithm:

A — We can match ~1/2 of the trades with order book events.

B — We gained a better understanding of the data provided by Binance through the websocket streams.

C — The road ahead would consist in investigating further the large data capture to improve the quality of the event matching algorithm.

D — I chose the pair BTC-USDT which is very active, or liquid in the jargon. Possibly, and I have not checked, the algorithm could behave a lot better on less liquid pairs where event collisions are more unlikely.

E — People discovering microstructure today are very lucky, they have access to ressources I would never have dreamt of when I first got acquainted with this field of Finance. Ladies, Gents, if you are reading this, I envy you. :)

Get your hands dirty — If you are really motivated and want to give some of your personal education time to the wonders of Microstructure, you could investigate for yourself the shortcomings on the large data capture.
If you have any questions, you can ping me on the comments below or on GitHub. I’ll happily assist you.

--

--

Gil-Arnaud Coche

What am I? I am not sure... An engineer? A physicist? A mathematician? What I know is that I terribly enjoy to model stuff.