Google launches new search engine to help scientists find the datasets they need

Google’s goal has always been to organize the world’s information, and its first target was the commercial web. Now, it wants to do the same for the scientific community with a new search engine for datasets.

The service, called Dataset Search, launches today, and it will be a companion of sorts to Google Scholar, the company’s popular search engine for academic studies and reports. Institutions that publish their data online, like universities and governments, will need to include metadata tags in their webpages that describe their data, including who created it, when it was published, how it was collected, and so on. This information will then be indexed by Google’s search engine and combined with information from the Knowledge Graph. (So if dataset X was published by CERN, a little information about the institute will also be included in the search.)

Speaking to The Verge, Natasha Noy, a research scientist at Google AI who helped created Dataset Search, says the aim is to unify the tens of thousands of different repositories for datasets online. “We want to make that data discoverable, but keep it where it is,” says Noy.

At the moment, dataset publication is extremely fragmented. Different scientific domains have their own preferred repositories, as do different governments and local authorities. “Scientists say, ‘I know where I need to go to find my datasets, but that’s not what I always want,’” says Noy. “Once they step out of their unique community, that’s when it gets hard.”

Noy gives the example of a climate scientist she spoke to recently who told her she’d been looking for a specific dataset on ocean temperatures for an upcoming study but couldn’t find it anywhere. She didn’t track it down until she ran into a colleague at a conference who recognized the dataset and told her where it was hosted. Only then could she continue with her work. “And this wasn’t even a particularly boutique depository,” says Noy. “The dataset was well written up in a fairly prominent place, but it was still difficult to find.”


An example search for weather records in Google Dataset Search.
Image: Google

The initial release of Dataset Search will cover the environmental and social sciences, government data, and datasets from news organizations like ProPublica. However, if the service becomes popular, the amount of data it indexes should quickly snowball as institutions and scientists scramble to make their information accessible.

This should be helped by the recent flourishing of open data initiatives around the world. “I do think in the last several years the number of repositories has exploded,” says Noy. She credits the increasing importance of data in scientific literature, which means journals ask authors to publish datasets, as well as “government regulations in the US and Europe and the general rise of the open data movement.”

Having Google involved should help make this project a success, says Jeni Tennison, CEO of the Open Data Institute (ODI). “Dataset search has always been a difficult thing to support, and I’m hopeful that Google stepping in will make it easier,” she says.

To create a decent search engine, you need to know how to build user-friendly systems and understand what people mean when they type in certain phrases, says Tennison. Google obviously knows what it’s doing in both of those departments.

In fact, says Tennison, ideally Google will publish its own dataset how Dataset Search gets used. Although the metadata tags the company is using to make datasets visible to its search crawlers are an open standard (meaning that any competitor like Bing or Yandex can also use them and build a competing service), search engines improve most quickly when a critical mass of users is there to provide data on what they’re doing.

“Simply understanding how people search is important… what kind of terms they use, how they express them,” says Tennison. “If we want to get to grips with how people search for data and make it more accessible, it would be great if Google opened up its own data on this.”

In other words: Google should publish a dataset about dataset search that would be indexed by Dataset Search. What could be more appropriate?

Google and Harvard team up to use deep learning to predict earthquake aftershocks

After a big earthquake hits, the danger isn’t over. Smaller, follow-up quakes that are triggered by the initial shock can rumble around an affected area for months, toppling structures weakened by the parent quake. Scientists can predict the size and timing of these aftershocks to some degree, but nailing the location has always proved challenging. New research from scientists at Harvard and Google suggests AI might be able to help.

In a paper published in the journal Nature this week, researchers show how deep learning can help predict aftershock locations more reliably than existing models. Scientists trained a neural network to look for patterns in a database of more than 131,000 “mainshock-aftershock” events, before testing its predictions on a database of 30,000 similar pairs.

The deep learning network was significantly more reliable than the most useful existing model, known as “Coulomb failure stress change.” On a scale of accuracy running from 0 to 1 — in which 1 is a perfectly accurate model and 0.5 is as good as flipping a coin — the existing Coulomb model scored 0.583, while the new AI system hit 0.849.

Brendan Meade, a professor of Earth and planetary sciences at Harvard who helped author the paper, told ScienceDaily that the results were promising. “There are three things you want to know about earthquakes,” said Meade. “When they are going to occur, how big they’re going to be and where they’re going to be. Prior to this work we had empirical laws for when they would occur and how big they were going to be, and now we’re working the third leg, where they might occur.”


Relief Efforts Continue in Italy After 6.2 Earthquake

Predicting the location of aftershocks could help direct emergency services to where they’re needed.
Photo by Carl Court / Getty Images

The success of artificial intelligence in this domain is thanks to one of the technology’s core strengths: its ability to uncover previously overlooked patterns in complex datasets. This is especially relevant in seismology, where it can be incredibly difficult to see connections in the data. Seismic events involve too many variables, from the makeup of the ground in different areas to the types of interactions between seismic plates to the ways energy propagates in waves through the Earth. Making sense of it all is incredibly hard.

Read more: AI is helping seismologists detect earthquakes they’d otherwise miss

The researchers say their deep learning model was able to make its predictions by considering a factor known as the “von Mises yield criterion,” a complex calculation used to predict when materials will begin to break under stress. As Meade tells ScienceDaily, this factor is often used in fields like metallurgy, “but has never been popular in earthquake science.” Now, with the findings of this new model, geologists can investigate its relevance.

Despite the success of this research, it’s far from ready to deploy in the real world. For a start, the AI model only focuses on aftershocks caused by permanent changes to the ground, known as static stress. But follow-up quakes can also be caused by rumblings in the ground that occur later, known as dynamic stress. The existing model is also too slow to work in real-time. This is important, as most aftershocks occur on the first day after a quake occurs, before roughly halving in frequency on each following day.

As Phoebe DeVries, a Harvard postdoc who helped lead the research, told ScienceDaily: “We’re still a long way from actually being able to forecast [aftershocks] but I think machine learning has huge potential here.”

DeepMind’s AI can detect over 50 eye diseases as accurately as a doctor

Step by step, condition by condition, AI systems are slowly learning to diagnose disease as well as any human doctor, and they could soon be working in a hospital near you. The latest example is from London, where researchers from Google’s DeepMind subsidiary, UCL, and Moorfields Eye Hospital have used deep learning to create software that identifies dozens of common eye diseases from 3D scans and then recommends the patient for treatment.

The work is the result of a multiyear collaboration between the three institutions. And while the software is not ready for clinical use, it could be deployed in hospitals in a matter of years. Those involved in the research described is as “ground-breaking.” Mustafa Suleyman, head of DeepMind Health, said in a press statement that the project was “incredibly exciting” and could, in time, “transform the diagnosis, treatment, and management of patients with sight threatening eye conditions […] around the world.”

The software, described in a paper published in the journal Nature Medicine, is based on established principles of deep learning, which uses algorithms to identify common patterns in data. In this case, the data is 3D scans of patients’ eyes made using a technique known as optical coherence tomography, or OCT. Creating these scans takes around 10 minutes and involves bouncing near-infrared light off of the interior surfaces of the eye. Doing so creates a 3D image of the tissue, which is a common way to assess eye health. OCT scans are a crucial medical tool, as early identification of eye disease often saves the patient’s sight.


An example of an OCT scan, showing the thickness of retinal tissue in a patient’s eye.
Credit: UCL, Moorfields, DeepMind, et al

The software was trained on nearly 15,000 OCT scans from some 7,500 patients. These individuals had all been treated at sites operated by Moorfields, which is the largest eye hospital in Europe and North America. The system was fed their scans alongside diagnoses by human doctors. From this, it learned how to first identify the different anatomical elements of the eye (a process known as segmentation) and then recommend clinical action based on the various signs of diseases that the scans show.

In a test where the AI’s judgments were compared with diagnoses by a panel of eight doctors, the software made the same recommendation more than 94 percent of the time.

Whose call is it anyway?

Results like this are extremely encouraging, but experts in the medical community are still worried about how AI systems will be integrated into care practices. Luke Oakden-Rayner, a radiologist who’s written extensively on the subject, says advances in AI are fast pushing us toward a tipping point where software is no longer a tool that’s applied and interpreted by a doctor, but something that makes decisions on behalf of humans.

The first systems are just beginning to cross this line. In April, the FDA approved the first AI-powered program that diagnoses disease without human oversight. As one of the program’s creators put it: “It makes the clinical decision on its own.” (Coincidentally, like today’s new algorithm, this software also analyzes eye scans. But it only looks for one disease, diabetic retinopathy, whereas DeepMind’s is sensitive to more than 50 conditions.)

This is the point at which the risk from medical AI becomes much greater. Our inability to explain exactly how AI systems reach certain decisions is well-documented. And, as we’ve seen with self-driving car crashes, when humans take our hands off the wheel, there’s always a chance that a computer will make a fatal error in judgment.

The researchers from DeepMind, UCL, and Moorfields are aware of these issues, and their software contains a number of features designed to mitigate this sort of problem.

First, the software doesn’t rely on a single algorithm making the decision, but a group of them, and each is trained independently so that any freak error will be overruled by the majority. Second, the system doesn’t just spit out a single answer for each diagnosis. Instead, it gives several possible explanations, alongside its confidence in each one. It also shows how it has labeled the parts of the patient’s eye, giving doctors an opportunity to spot faulty analysis.


An example diagnosis from the system. Most of the boxes show how the AI has labelled parts of the OCT scan, but in the top left you can see its recommendation and various confidence levels.
Image: UCL, Moorfields, DeepMind, et al

But most importantly, the software isn’t a straightforward diagnostic tool. Instead, it’s designed to be used for triage, the process of deciding which patients need care first. So while it does guess what conditions a patient might have, the actual recommendation it makes is how urgently the individual needs to be referred for treatment.

These features sound incidental, but each of them operates like a speed bump, slowing the algorithm down, and giving humans a chance to intervene. The real test, though, will come when this software is deployed and tested in a real clinical environment. When this might happen isn’t known, but DeepMind says it hopes to start the process “soon.”

Gold from the data mine

Along with its clinical possibilities, this research is also interesting as an example of how AI companies benefit from access to valuable datasets. DeepMind, specifically, has been criticized in the past for how it has accessed data from patients treated by the UK’s publicly funded National Health Service (NHS). In 2017, the UK’s data watchdog even ruled that a deal the company struck in 2015 was illegal because it failed to properly notify patients about how their data was being used. (The deal has since been superseded.)

Today’s research would not have been possible without access to this same data. And while the information used in this research was anonymized and patients could opt out, the diagnostic software created from this data belongs solely to DeepMind.

The company says that if the software is approved for use in a clinical setting, it will be provided free of charge to Moorfields’ clinicians for a period of five years. But that doesn’t stop DeepMind from selling the software to other hospitals in the UK or other countries. DeepMind says this sort of deal is standard practice for the industry, and it tells The Verge it “invested significantly” in this research to create the algorithm. It also notes that the data it helped corral is now available for public use and non-commercial medical research.

Despite efforts like this, skepticism about the firm remain. A recent independent panel set up by DeepMind to scrutinize its own business practices suggested that the company needs to be more transparent about its business model and its relationship with Google, which bought the firm in 2014. As DeepMind gets closer to producing commercial products using publicly funded NHS data, this sort of scrutiny will likely become increasingly pointed.

An eye on the future

Regardless of these issues, it’s clear that algorithms like this could be incredibly beneficial. Some 285 million people around the world are estimated to live with a form of sight loss, and eye disease is the biggest cause of this condition.

OCT scans are a great tool for spotting eye disease (5.35 million were performed in the US alone in 2014), but interpreting this data takes time, creating a bottleneck in the diagnostic process. If algorithms can help triage patients by directing doctors to those most in need of care, it could be incredibly beneficial.

As Dr. Pearse Keane, a consultant ophthalmologist at Moorfields who was involved in the research, said in a press statement: “The number of eye scans we’re performing is growing at a pace much faster than human experts are able to interpret them. There is a risk that this may cause delays in the diagnosis and treatment of sight-threatening diseases.

“If we can diagnose and treat eye conditions early, it gives us the best chance of saving people’s sight. With further research it could lead to greater consistency and quality of care for patients with eye problems in the future.”