2 min read

The emergence of Proxy Data in Machine Learning

Data requirements for Machine Learning algorithms is usually higher compared to traditional statistical algorithms.
The emergence of Proxy Data in Machine Learning

Data requirements for Machine Learning algorithms are usually higher compared to traditional statistical algorithms. As more and more organisations rely on models for their operational work it becomes important to make sure data is available in good quality at the exact time of model execution/scoring. In the changing complex world all the data points are not available due to multiple reasons.

Data Science teams has been increasingly pursuing proxy data sources to reconstruct some of the features predictive models use. It is driven by many reasons including;

  • people are not ready to share the data points that can hamper their score,
  • the regulatory framework is fragile and stringent,
  • no clarity of liabilities in case something goes wrong,
  • timely availability of data.

Regulatory constraints are very difficult to manage. A must read for Data Scientist is Article 4 of General Data Protection Regulation (GDPR) Act, this article gives definition to key terms in the legislation including personal data, processing, profiling etc.

'personal data' means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person

So, what we data scientists are left with options to find valuable information? Just imagine, you are not allowed to use location data, then how you would provide or deny a service based on location.

GDPR is a well-written regulation and it would go through judiciary scrutiny to establish the actual interpretation of same in different scenarios. Today, my team is looking into how to reconstruct some of the variables we use today and might vanish due to multiple reasons in future.

There is a lot of interesting research being done to find additional data sources to re-construct features by using established economic and social theories. I will pull an example shared by Juan from Microsoft in Strata conference titled "Using Data Science on Internet Search Behavior as a Proxy for Human Behavior"

The work done by Microsoft Scientist showed how the search data is useful in creating proxy for important predictive work. As per the work brief they presented at Starta, they presented how search data can also be used adverse drug events that cause mortality are often discovered after a drug comes to market. The team hypothesised that Internet users may provide early clues about adverse drug events via their online information-seeking. The mining of search logs provided information on how the search activities of population be a proxy for drug safety surveillance. Compared to analyses of other sources such as electronic health records (EHR), logs are inexpensive to collect and mine.

In coming years, the understanding of proxy data and building relationships among data would play an important role. Data Scientist need to be prepared for using data from multiple sources to test the business hypothesis.

So, what are next steps?

Probyto can help you with understanding of interrelations of data and help you build robust machine learning algorithms using proxy data. We also work to create hypothesis and test them with proxy data to help business grow.