Today is world data truly makes the world go 'round'. It is fundamental to virtually everything we do. And data assumes even greater power and importance when it is shared. Think about how much more quickly diseases could be cured or how much waste could be reduced, or how much more efficiently ecosystems could run if data were able to be freely exchanged. Of course, such sharing isn’t possible today because we’re limited to using our own data that for good reason is highly protected.
What is artificial data?
Artificial data, simply put, is data artificially generated by an AI algorithm that has been trained on a real data set. The goal is to reproduce the statistical properties and patterns of the existing dataset by modelling its probability distribution and sampling it out. The algorithm essentially creates new data that has all the same characteristics of the original data leading to the same answer – but, crucially, it’s impossible for any of the original data to ever be reconstructed from either the algorithm or the artificial data it has created. As a result, the artificial data set has the same predictive power as the original data, but none of the privacy concerns that restrict the use of most original data sets.
Here is an example
Imagine as a simple exercise that you are interested in creating artificial data around athletes, specifically height and speed. We can represent the relationship between these two variables as simple linear function if you take this function and want to create artificial data it’s easy enough to have a machine randomly create a set of points that conform to the equation. This is our artificial set. Same equation but different values.
Now imagine you are interested in height, speed, blood-pressure, oxygen in blood, etc.. the data is much more complicated and representing it requires more complex non-linear equations and we need the power of AI to help us determine the pattern. Using the same thinking as with our simple example, one can now use the trained AI to create data points that approximate to this new more complex "pattern" we have learned and thus create our artificial data set.
While the pandemic has illustrated potential health research-oriented use cases for artificial data, we see potential for the technology across a range of other industries. For instance, in financial services, where restrictions around data usage and customer privacy are particularly limiting, companies are starting to use artificial data to help them identify and eliminate bias in how they treat customers without contravening data privacy regulations. Retailers are beginning to recognize how they could create new revenue streams by selling artificial copies of their customers’ purchasing behavior that companies such as consumer goods manufacturers would find extremely valuable—all while keeping their customers’ personal details safely locked up.
The value for business to security, speed and scale
While the use of artificial data today is still nascent, it is poised for massive growth in the coming years because it offers companies security, speed and scale when working with data and AI.
data is most obvious benefit is in eliminating the risk of exposing critical data and compromising the privacy and security of companies and customers. Techniques such as encryption, anonymization, and advanced privacy preserving focus on protecting the original data and the information in that data that could be traced back to an individual. So long as the original data is in play, there is always a risk of compromising or exposing it in some way.
This is one of the main points of the covid-19 example noted earlier and, indeed, is a big selling point for the healthcare industry at large. Imagine if we had pooled all the data we collectively have about everybody who is contracted the disease around the world since the beginning, and we were sharing it with whoever wanted to use it. We likely would have been better off but, legally, there’s no chance of that happening. The NIH’s initiative demonstrates how artificial data can hurdle the privacy barrier.
Another big challenge companies face is getting access to their data quickly so they can start generating value from it artificial data eliminates the roadblocks of privacy and security protocols that often make it difficult and time-consuming to get and use data.
Consider the experience of one financial institution. The enterprise had a cache of rich and valuable data that could help decision makers solve a variety of business problems. And yet, the data was so highly protected and controlled that getting access to it was an arduous process even if the data would never leave the company. In one case, it took six months to get even a small amount of data, which the analysis team used very quickly. Another six months followed just to get an update. To get around this access obstacle, the company created artificial data from its original data. Now the team can continuously update and model the data and generate ongoing powerful insights into how to improve business performance.
Furthermore, with artificial data, a company can quickly train ML models on large datasets, which means faster speed to training, testing, and deploying an AI solution. This addresses a real challenge many companies face: a lack of enough data to train a model. Access to a large set of artificial data gives ML engineers and data scientists more confidence in the results they’re getting at the different stages of model development and that means getting to market more quickly with new products and services, and ultimately, more value faster.
Scale is a by-product of security and speed. Secure and faster access to data make it possible to expand the amount of data you can analyze and, by extension, the types and numbers of problems you can solve. This is attractive to big companies, whose current modeling efforts tend to be quite narrow because they are limited to just the data they own. Companies can, of course, purchase third-party data in its original form, but it’s often prohibitively expensive. Artificial data sets from third parties make it much easier and cheaper for companies to supplement their own data with additional data from many other sources, so they can learn more about the problem they are trying to solve and get more accurate answers—without the worry of compromising anyone is privacy.