Understanding the microstructure of the financial market requires the processing of a vast amount of data related to individual trades, and sometimes even multiple levels of quotes. This requires computing resources that are not easily available to financial academics and regulators. Fortunately, data-intensive scientific research has developed a series of tools and techniques for working with a large amount of data. In this work, we demonstrate that these techniques are effective for market data analysis by computing an early warning indicator called Volume-synchronized Probability of Informed trading (VPIN) on a massive set of futures trading records. The test data contains five and a half year's worth of trading data for about 100 most liquid futures contracts, includes about 3 billion trades, and takes 140GB as text files. By using (1) a more efficient file format for storing the trading records, (2) more effective data structures and algorithms, and (3) parallelizing the computations, we are able to explore 16,000 different parameter combinations for computing VPIN in less than 20 hours on a 32-core IBM DataPlex machine. On average, computing VPIN of one futures contract over 5.5 years takes around 1.5 seconds on one core, which demonstrates that a modest computer is sufficient to monitor a vast number of trading activities in real-time – an ability that could be valuable to regulators.
By examining a large number of parameter combinations, we are also able to identify the parameter settings that improves the prediction accuracy from 80% to 93%.