The Internet Is Broken, But We Can Fix It
The recent scandal involving Facebook and Cambridge Analytical was different than most other breeches, because it wasn’t a breech at all. Cambridge Analytica simply walked through the front door, took the data it wanted and then moved it back to its own servers to be used for a completely different purpose than was intended.
We derive enormous benefits from having our data shared. As Sheryl Sandberg pointed out, without data, Facebook would have to be a paid service. The same goes for Google and every other ad supported business on the web. Yet there is a dark side of technology and data sharing is its greatest vulnerability.
Data sharing goes far beyond social media. We want doctors to share our data so that we can get medical treatment. Banks needs information about us so that they can extend credit. Law enforcement needs access to data to keep us safe. Yet each one of these activities exposes us. It’s not just Facebook, the Internet is broken, but new technologies may be able to fix it
A Brief History Of Data
The use of data, as we know it today, was pioneered by Herman Hollerith in the late 1880s, when he devised a system of punch cards to store and calculate census bureau data. His invention led him to found the Tabulating Machine Company, which eventually formed the basis for what we now know as IBM.
By the 1960s, IBM had successfully made the transition from mechanical to digital computers and digital versions of Hollerith’s punch cards, called flat file databases became standard. These were, of course, far larger, faster and more efficient than punch cards, but essentially worked the same way and did not fully leverage the capabilities of digital machines.
That began to change when a researcher at IBM named Edgar F. Codd came up with a relational model for the database. The problem he solved was that a flat file required an analyst to be intimately familiar with the database structure in order to glean insights from it. In Codd’s model, however, relationships between data were stored in the database and could be retrieved using a query language.
It was Codd’s innovation that helped power the data economy as we know it today. Data could be stored centrally, but used remotely by whoever was given access to it. This gave it a value independent of the purpose for which it was originally stored, because query languages could be used to establish relationships that weren’t initially obvious or planned for.
Distributed Computing Needs Secure, Distributed Data
At about the same time that Codd was developing the relational database, engineers at the Defense Department’s Advanced Research Projects Agency (ARPA) were developing a new kind of network called ARPANET, which became the precursor to the Internet. That meant a lot more people could access a lot more databases and derive a lot more insights.
At first, ARPANET was limited to a small cadre of researchers at government labs and academic institutions, who would use it to share scientific data, but by the late 80s that tight circle expanded greatly. Tim Berners-Lee created the World Wide Web in 1989 and the Gore Act opened up ARPANET and other networks in 1991 to create what we now know as the Internet.
It also led to the problems with data security we have today. Although information stored in a database can be made secure through encryption, it must be decrypted in order to be analyzed. That’s created what Mark Zuckerberg described in a recent interview as a “values tension” between data portability and security.
To be clear, the Cambridge Analytica scandal was, in part, caused by serious governance issues at Facebook. The company has always pushed the envelope on openness and that’s been a big part of its commercial success. However, beyond governance concerns, the episode highlights a dire need to close what’s become a gaping security hole. Distributed computing requires distributed security.
The news isn’t all bad though. A new class of “open secure” techniques, called Secure Multi-Party Computation (SMPC) may be able to give us the best of both worlds, the ability to share data for analysis while keeping it secure.
Empowering Secure Collaboration
There are a number of efforts now underway which aim to bring SMPC into the mainstream. One is a new pilot program at Experian that’s shown great promise. “What we’ve been able to do is take SMPC from an experimental technology that could handle just a handful of data providers to a commercial technology that can potentially do far more complex analysis on hundreds of attributes,” Kevin Chen, U.S. Chief Scientist at Experian Datalabs told me.
Basically, the way it works is this: Take a groups of, say, ten banks, all who have data on their customer’s assets, income, loan payment history, etc. Obviously, they can all benefit from access to each other’s data, but don’t want to share for both privacy and competitive reasons. What Experian does is break the data into parts that, individually, are worthless, but can be pooled in a separate environment to perform analysis.
Another approach is called Fully Homomorphic Encryption (FHE), which was developed by Craig Gentry at IBM and allows for data to be analyzed while it is still actually encrypted. So, for example, a company like Cambridge Analytics could use Facebook encrypted data, analyze it in encrypted form, and feed the encrypted result back to the platform to target campaigns on Facebook, while the data itself is never exposed in the clear.
The problem with FHE, which is still largely experimental, is that it’s much slower — about 10,000–100,000 times slower in fact — than conventional methods of analysis. Still, the technology is improving quickly and Shai Halevi, a Cryptographer at IBM, told me he believes it will be commercially available for niche projects, such as genomics research, within a year or two. With improvements, it could be ready for widespread adoption in five years or so.
There is also great potential in combining insights from both approaches. For example, you could encrypt the most sensitive data, like names, addresses, social security numbers, etc., using FHE and break the rest up into parts. So a bad actor like Cambridge Analytica might, with enormous effort, be able to gain access to some behavior data, but would still have no way of figuring out who it belonged to.
Empowering Secure Collaboration
As noted above, our economy has become highly dependent on the ability to share data. “There have been massive businesses based on data aggregation, the media industry being just one of them. Others, such as health care, risk management and internet security are just as important, if not more so,” says Eric Haller, Global Head of Experian Datalabs. “If data doesn’t have to be aggregated to extract the value, that’s a real paradigm shift in terms of helping businesses make smarter decisions and protecting consumers.”
It’s not hard to see how these technologies can have a major impact. Doctors will be able to gain greater access to patient histories to help them prescribe the right treatment. Lenders will be able to more able to extend credit to customers. Entrepreneurs will find it easier to get financing for their businesses and law enforcement will be able to more closely collaborate with other organizations, such as airlines, to keep us safe.
What the Facebook-Cambridge Analytica scandal makes clear is that we can’t go on as we have been. Distributed computing requires distributed security. Initiatives like GDPR can help, as can other technologies like Blockchain, but what’s really needed is to truly secure our data infrastructure. We have roughly 30 years of technology debt to work off and we simply can’t wait any longer.
As Josh Sutton, CEO of Agorai and an expert in cognitive technologies, put it to me. “Data is unique as an asset in that its power to create value depends on how it can be combined with other data. Once we can do that securely, we make data a far more liquid asset that will create far more value for everyone.”
An earlier version of this article first appeared in Inc.com
Image: Anthony Quintano