Artificial Intelligence And Data Science

Monday, 14 February 2022

Research advances technology of AI assistance for anesthesiologists

A new study by researchers at MIT and Massachusetts General Hospital (MGH) suggests the day may be approaching when advanced artificial intelligence systems could assist anesthesiologists in the operating room.

In a special edition of Artificial Intelligence in Medicine, the team of neuroscientists, engineers, and physicians demonstrated a machine learning algorithm for continuously automating dosing of the anesthetic drug propofol. Using an application of deep reinforcement learning, in which the software’s neural networks simultaneously learned how its dosing choices maintain unconsciousness and how to critique the efficacy of its own actions, the algorithm outperformed more traditional software in sophisticated, physiology-based simulations of patients. It also closely matched the performance of real anesthesiologists when showing what it would do to maintain unconsciousness given recorded data from nine real surgeries.

The algorithm’s advances increase the feasibility for computers to maintain patient unconsciousness with no more drug than is needed, thereby freeing up anesthesiologists for all the other responsibilities they have in the operating room, including making sure patients remain immobile, experience no pain, remain physiologically stable, and receive adequate oxygen, say co-lead authors Gabe Schamberg and Marcus Badgeley.

“One can think of our goal as being analogous to an airplane’s autopilot, where the captain is always in the cockpit paying attention,” says Schamberg, a former MIT postdoc who is also the study’s corresponding author. “Anesthesiologists have to simultaneously monitor numerous aspects of a patient’s physiological state, and so it makes sense to automate those aspects of patient care that we understand well.”

Senior author Emery N. Brown, a neuroscientist at The Picower Institute for Learning and Memory and Institute for Medical Engineering and Science at MIT and an anesthesiologist at MGH, says the algorithm’s potential to help optimize drug dosing could improve patient care.

“Algorithms such as this one allow anesthesiologists to maintain more careful, near-continuous vigilance over the patient during general anesthesia,” says Brown, the Edward Hood Taplin Professor Computational Neuroscience and Health Sciences and Technology at MIT.

Both actor and critic

The research team designed a machine learning approach that would not only learn how to dose propofol to maintain patient unconsciousness, but also how to do so in a way that would optimize the amount of drug administered. They accomplished this by endowing the software with two related neural networks: an “actor” with the responsibility to decide how much drug to dose at every given moment, and a “critic” whose job was to help the actor behave in a manner that maximizes “rewards” specified by the programmer. For instance, the researchers experimented with training the algorithm using three different rewards: one that penalized only overdosing, one that questioned providing any dose, and one that imposed no penalties.

In every case, they trained the algorithm with simulations of patients that employed advanced models of both pharmacokinetics, or how quickly propofol doses reach the relevant regions of the brain after doses are administered, and pharmacodynamics, or how the drug actually alters consciousness when it reaches its destination. Patient unconsciousness levels, meanwhile, were reflected in measure of brain waves, as they can be in real operating rooms. By running hundreds of rounds of simulation with a range of values for these conditions, both the actor and the critic could learn how to perform their roles for a variety of kinds of patients.

The most effective reward system turned out to be the “dose penalty” one in which the critic questioned every dose the actor gave, constantly chiding the actor to keep dosing to a necessary minimum to maintain unconsciousness. Without any dosing penalty the system sometimes dosed too much, and with only an overdose penalty it sometimes gave too little. The “dose penalty” model learned more quickly and produced less error than the other value models and the traditional standard software, a “proportional integral derivative” controller.

An able advisor

After training and testing the algorithm with simulations, Schamberg and Badgeley put the “dose penalty” version to a more real-world test by feeding it patient consciousness data recorded from real cases in the operating room. The testing demonstrated both the strengths and limits of the algorithm.

During most tests, the algorithm’s dosing choices closely matched those of the attending anesthesiologists after unconsciousness had been induced and before it was no longer necessary. The algorithm, however, adjusted dosing as frequently as every five seconds, while the anesthesiologists (who all had plenty of other things to do) typically did so only every 20-30 minutes, Badgeley notes.

As the tests showed, the algorithm is not optimized for inducing unconsciousness in the first place, the researchers acknowledge. The software also doesn’t know of its own accord when surgery is over, they add, but it’s a straightforward matter for the anesthesiologist to manage that process.

One of the most important challenges any AI system is likely to continue to face, Schamberg says, is whether the data it is being fed about patient unconsciousness is perfectly accurate. Another active area of research in the Brown lab at MIT and MGH is in improving the interpretation of data sources, such as brain wave signals, to improve the quality of patient monitoring data under anesthesia.

In addition to Schamberg, Badgeley, and Brown, the paper’s other authors are Benyamin Meschede-Krasa and Ohyoon Kwon.

The JPB Foundation and the National Insititutes of Health funded the study.

source https://news.mit.edu/2022/research-advances-technology-ai-assistance-anesthesiologists-0214

Friday, 11 February 2022

An International Scientific Challenge for the Diagnosis and Gleason Grading of Prostate Cancer

Posted by Po-Hsuan Cameron Chen, Software Engineer, Google Health and Maggie Demkin, Program Manager, Kaggle

In recent years, machine learning (ML) competitions in health have attracted ML scientists to work together to solve challenging clinical problems. These competitions provide access to relevant data and well-defined problems where experienced data scientists come to compete for solutions and learn new methods. However, a fundamental difficulty in organizing such challenges is obtaining and curating high quality datasets for model development and independent datasets for model evaluation. Importantly, to reduce the risk of bias and to ensure broad applicability of the algorithm, evaluation of the generalisability of resulting algorithms should ideally be performed on multiple independent evaluation datasets by an independent group of scientists.

One clinical problem that has attracted substantial ML research is prostate cancer, a condition that 1 in 9 men develop in their lifetime. A prostate cancer diagnosis requires pathologists to examine biological tissue samples under a microscope to identify cancer and grade the cancer for signs of aggressive growth patterns in the cells. However, this cancer grading task (called Gleason grading) is difficult and subjective due to the need for visual assessment of cell differentiation and Gleason pattern predominance. Building a large dataset of samples with expert annotations can help with the development of ML systems to aid in prostate cancer grading.

To help accelerate and enable more research in this area, Google Health, Radboud University Medical Center and Karolinska Institutet joined forces to organize a global competition, the Prostate cANcer graDe Assessment (PANDA) Challenge, on the open Kaggle platform. In “Artificial Intelligence for Diagnosis and Gleason Grading of Prostate Cancer: the PANDA challenge”, published in Nature Medicine, we present the results of the challenge. The study design of the PANDA challenge provided the largest public whole-slide image dataset available and was open to participants from April 21st until July 23rd, 2020. The development datasets remain available for further research. In this effort, we compiled and publicly released a European cohort of prostate cancer cases for algorithm development and pioneered a standardized evaluation setup for digital pathology that enabled independent, blinded external validation of the algorithms on data from both the United States and EU.

AVvXsEgjKeYkFZ3OYRfbIWiRc3Pyg7SutaZw_U-9

The global competition attracted participants from 65 countries (the size of the circle for each country illustrates the number of participants).

Design of the Panda Challenge
The challenge had two phases: a development phase (i.e., the Kaggle competition) and a validation phase. During the competition, 1,290 developers from 65 countries competed in building the best performing Gleason grading algorithm, having full access to a development set for algorithm training. Throughout the competition teams submitted algorithms that were evaluated on a hidden tuning set.

In the validation phase, a selection of top performing algorithms were independently evaluated on internal and external validation datasets with high quality reference grades from panels of expert prostate pathologists. In addition, a group of general pathologists graded a subset of the same cases to put the difficulty of the task and dataset in context. The algorithms submitted by the teams were then compared to grades done by groups of international and US general pathologists on these subsets.

AVvXsEj0UN1OMlFYWkmDvocDSDKaUb3wFqJ0G4oc

Overview of the PANDA challenge’s phases for development and validation.

Research Velocity During the Challenge
We found that a group of Gleason grading ML algorithms developed during a global competition could achieve pathologist-level performance and generalize well to intercontinental and multinational cohorts. On all external validation sets, these algorithms achieved high agreement with urologic pathologists (prostate specialists) and high sensitivity for detecting tumor in biopsies. The Kaggle platform enabled the tracking of teams’ performance throughout the competition. Impressively, the first team achieving high agreement with the prostate pathologists at above 0.90 (quadratically weighted Cohen’s kappa) on the internal validation set occurred within the first 10 days of the competition. By the 33rd day, the median performance of all teams exceeded a score of 0.85.

AVvXsEjDHD4kmt7SgLuWnp4kG-d9-DASL3fkqkLe

Progression of algorithms’ performances throughout the competition, as shown by the highest score on the tuning and internal validation sets among all participating teams. During the competition teams could submit their algorithm for evaluation on the tuning set, after which they received their score. At the same time, algorithms were evaluated on the internal validation set, without disclosing these results to the participating teams. The development of the top score obtained by any team shows the rapid improvement of the algorithms.

Learning from the Challenge
By moderating the discussion forum on the Kaggle platform, we learned that the teams’ openness in sharing code via colab notebooks led to rapid improvement across the board, a promising sign for future public challenges, and a clear indication of the power of sharing knowledge on a common platform.

Organizing a public challenge that evaluates algorithm generalization across independent cohorts using high quality reference standard panels presents substantial logistical difficulties. Assembling this size of a dataset across countries and organizations was a massive undertaking. This work benefited from an amazing collaboration between the three organizing institutions which have all contributed respective publications in this space, two in Lancet Oncology and one in JAMA Oncology. Combining these efforts provided a high quality foundation on which this competition could be based. With the publication, Radboud and Karolinska research groups are also open sourcing the PANDA challenge development datasets to facilitate the further improvement of prostate Gleason grading algorithms. We look forward to seeing many more advancements in this field, and more challenges that can catalyze extensive international knowledge sharing and collaborative research.

Acknowledgements
Key contributors to this project at Google include Po-Hsuan Cameron Chen, Kunal Nagpal, Yuannan Cai, David F. Steiner, Maggie Demkin, Sohier Dane, Fraser Tan, Greg S. Corrado, Lily Peng, Craig H. Mermel. Collaborators on this project include Wouter Bulten, Kimmo Kartasalo, Peter Ström, Hans Pinckaers, Hester van Boven, Robert Vink, Christina Hulsbergen-van de Kaa, Jeroen van der Laak, Mahul B. Amin, Andrew J. Evans, Theodorus van der Kwast, Robert Allan, Peter A. Humphrey, Henrik Grönberg, Hemamali Samaratunga, Brett Delahunt, Toyonori Tsuzuki, Tomi Häkkinen, Lars Egevad, Masi Valkonen, Pekka Ruusuvuori, Geert Litjens, Martin Eklund and the PANDA Challenge consortium. We thank Ellery Wulczyn, Annisah Um'rani, Yun Liu, and Dale Webster for their feedback on the manuscript and guidance on the project. We thank our collaborators at NMCSD, particularly Niels Olson, for internal re-use of de-identified data which contributed to the US external validation set. Sincere appreciation also goes to Sami Lachgar, Ashley Zlatinov, and Lauren Winer for their feedback on the blogpost.

source http://ai.googleblog.com/2022/02/an-international-scientific-challenge.html

Thursday, 10 February 2022

Guiding Frozen Language Models with Learned Soft Prompts

Posted by Brian Lester, AI Resident and Noah Constant, Senior Staff Software Engineer, Google Research

Large pre-trained language models, which are continuing to grow in size, achieve state-of-art results on many natural language processing (NLP) benchmarks. Since the development of GPT and BERT, standard practice has been to fine-tune models on downstream tasks, which involves adjusting every weight in the network (i.e., model tuning). However, as models become larger, storing and serving a tuned copy of the model for each downstream task becomes impractical.

An appealing alternative is to share across all downstream tasks a single frozen pre-trained language model, in which all weights are fixed. In an exciting development, GPT-3 showed convincingly that a frozen model can be conditioned to perform different tasks through “in-context” learning. With this approach, a user primes the model for a given task through prompt design, i.e., hand-crafting a text prompt with a description or examples of the task at hand. For instance, to condition a model for sentiment analysis, one could attach the prompt, “Is the following movie review positive or negative?” before the input sequence, “This movie was amazing!”

Sharing the same frozen model across tasks greatly simplifies serving and allows for efficient mixed-task inference, but unfortunately, this is at the expense of task performance. Text prompts require manual effort to design, and even well-designed prompts still far underperform compared to model tuning. For instance, the performance of a frozen GPT-3 175B parameter model on the SuperGLUE benchmark is 5 points below a fine-tuned T5 model that uses 800 times fewer parameters.

In “The Power of Scale for Parameter-Efficient Prompt Tuning”, presented at EMNLP 2021, we explore prompt tuning, a more efficient and effective method for conditioning frozen models using tunable soft prompts. Just like engineered text prompts, soft prompts are concatenated to the input text. But rather than selecting from existing vocabulary items, the “tokens” of the soft prompt are learnable vectors. This means a soft prompt can be optimized end-to-end over a training dataset. In addition to removing the need for manual design, this allows the prompt to condense information from datasets containing thousands or millions of examples. By comparison, discrete text prompts are typically limited to under 50 examples due to constraints on model input length. We are also excited to release the code and checkpoints to fully reproduce our experiments.

AVvXsEgWPnqNhC2ZtEjkumYCtNi18nHLQY9U5dmV

Prompt tuning retains the strong task performance of model tuning, while keeping the pre-trained model frozen, enabling efficient multitask serving.

Prompt Tuning
To create a soft prompt for a given task, we first initialize the prompt as a fixed-length sequence of vectors (e.g., 20 tokens long). We attach these vectors to the beginning of each embedded input and feed the combined sequence into the model. The model’s prediction is compared to the target to calculate a loss, and the error is back-propagated to calculate gradients, however we only apply these gradient updates to our new learnable vectors — keeping the core model frozen. While soft prompts learned in this way are not immediately interpretable, at an intuitive level, the soft prompt is extracting evidence about how to perform a task from the labeled dataset, performing the same role as a manually written text prompt, but without the need to be constrained to discrete language.

Our codebase, implemented in the new JAX-based T5X framework, makes it easy for anyone to replicate this procedure, and provides practical hyperparameter settings, including a large learning rate (0.3), which we found was important for achieving good results.

Since soft prompts have a small parameter footprint (we train prompts with as few as 512 parameters), one can easily pass the model a different prompt along with each input example. This enables mixed-task inference batches, which can streamline serving by sharing one core model across many tasks.

AVvXsEgNi-pteVLIEZ6H5HdV8RadrzCkegKA3zJC

Left: With model tuning, incoming data are routed to task-specific models. Right: With prompt tuning, examples and prompts from different tasks can flow through a single frozen model in large batches, better utilizing serving resources.

Improvement with Scale
When evaluated on SuperGLUE and using a frozen T5 model, prompt tuning significantly outperforms prompt design using either GPT-3 or T5. Furthermore, as model size increases, prompt tuning catches up to the performance level of model tuning. Intuitively, the larger the pre-trained model, the less of a “push” it needs to perform a specific task, and the more capable it is of being adapted in a parameter-efficient way.

AVvXsEgfRiP40VYzGyUPM15SOmcIiZJlNMOXm6BY

As scale increases, prompt tuning matches model tuning, despite tuning 25,000 times fewer parameters.

The effectiveness of prompt tuning at large model scales is especially important, since serving separate copies of a large model can incur significant computational overhead. In our paper, we demonstrate that larger models can be conditioned successfully even with soft prompts as short as 5 tokens. For T5 XXL, this means tuning just 20 thousand parameters to guide the behavior of an 11 billion parameter model.

Resilience to Domain Shift
Another advantage of prompt tuning is its resilience to domain shift. Since model tuning touches every weight in the network, it has the capacity to easily overfit on the provided fine-tuning data and may not generalize well to variations in the task at inference time. By comparison, our learned soft prompts have a small number of parameters, so the solutions they represent may be more generalizable.

To test generalizability, we train prompt tuning and model tuning solutions on one task, and evaluate zero-shot on a closely related task. For example, when we train on the Quora Question Pairs task (i.e., detecting if two questions are duplicates) and evaluate on MRPC (i.e., detecting if two sentences from news articles are paraphrases), prompt tuning achieves +3.2 points higher accuracy than model tuning.

Train	Eval	Tuning	Accuracy	F1

QQP	MRPC	Model	73.1 ±0.9	81.2 ±2.1
QQP	MRPC	Prompt	76.3 ±0.1	84.3 ±0.3

MRPC	QQP	Model	74.9 ±1.3	70.9 ±1.2
MRPC	QQP	Prompt	75.4 ±0.8	69.7 ±0.3

On zero-shot domain transfer between two paraphrase detection tasks, prompt tuning matches or outperforms model tuning, depending on the direction of transfer.

Looking Forward
Prompt-based learning is an exciting new area that is quickly evolving. While several similar methods have been proposed — such as Prefix Tuning, WARP, and P-Tuning — we discuss their pros and cons and demonstrate that prompt tuning is the simplest and the most parameter efficient method.

In addition to the Prompt Tuning codebase, we’ve also released our LM-adapted T5 checkpoints, which we found to be better-suited for prompt tuning compared to the original T5. This codebase was used for the prompt tuning experiments in FLAN, and the checkpoints were used as a starting point for training the BigScience T0 model. We hope that the research community continues to leverage and extend prompt tuning in future research.

Acknowledgements
This project was a collaboration between Brian Lester, Rami Al-Rfou and Noah Constant. We are grateful to the following people for feedback, discussion and assistance: Waleed Ammar, Lucas Dixon, Slav Petrov, Colin Raffel, Adam Roberts, Sebastian Ruder, Noam Shazeer, Tu Vu and Linting Xue.

source http://ai.googleblog.com/2022/02/guiding-frozen-language-models-with.html

Wednesday, 9 February 2022

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient, and Interpretable Visual Understanding

Posted by Zizhao Zhang, Software Engineer, Google Cloud

In visual understanding, the Visual Transformer (ViT) and its variants have received significant attention recently due to their superior performance on many core visual applications, such as image classification, object detection, and video understanding. The core idea of ViT is to utilize the power of self-attention layers to learn global relationships between small patches of images. However, the number of connections between patches increases quadratically with image size. Such a design has been observed to be data inefficient — although the original ViT can perform better than convolutional networks with hundreds of millions of images for pre-training, such a data requirement is not always practical, and it still underperforms compared to convolutional networks when given less data. Many are exploring to find more suitable architectural re-designs that can learn visual representations effectively, such as by adding convolutional layers and building hierarchical structures with local self-attention.

The principle of hierarchical structure is one of the core ideas in vision models, where bottom layers learn more local object structures on the high-dimensional pixel space and top layers learn more abstracted and high-level knowledge at low-dimensional feature space. Existing ViT-based methods focus on designing a variety of modifications inside self-attention layers to achieve such a hierarchy, but while these offer promising performance improvements, they often require substantial architectural re-designs. Moreover, these approaches lack an interpretable design, so it is difficult to explain the inner-workings of trained models.

To address these challenges, in “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding”, we present a rethinking of existing hierarchical structure–driven designs, and provide a novel and orthogonal approach to significantly simplify them. The central idea of this work is to decouple feature learning and feature abstraction (pooling) components: nested transformer layers encode visual knowledge of image patches separately, and then the processed information is aggregated. This process is repeated in a hierarchical manner, resulting in a pyramid network structure. The resulting architecture achieves competitive results on ImageNet and outperforms results on data-efficient benchmarks. We have shown such a design can meaningfully improve data efficiency with faster convergence and provide valuable interpretability benefits. Moreover, we introduce GradCAT, a new technique for interpreting the decision process of a trained model at inference time.

Architecture Design
The overall architecture is simple to implement by adding just a few lines of Python code to the source code of the original ViT. The original ViT architecture divides an input image into small patches, projects pixels of each patch to a vector with predefined dimension, and then feeds the sequences of all vectors to the overall ViT architecture containing multiple stacked identical transformer layers. While every layer in ViT processes information of the whole image, with this new method, stacked transformer layers are used to process only a region (i.e., block) of the image containing a few spatially adjacent image patches. This step is independent for each block and is also where feature learning occurs. Finally, a new computational layer called block aggregation then combines all of the spatially adjacent blocks. After each block aggregation, the features corresponding to four spatially adjacent blocks are fed to another module with a stack of transformer layers, which then process those four blocks jointly. This design naturally builds a pyramid hierarchical structure of the network, where bottom layers can focus on local features (such as textures) and upper layers focus on global features (such as object shape) at reduced dimensionality thanks to the block aggregation.

AVvXsEj1WqIkah9_VnENoqXrAAz_BdhN2UIjCmue

A visualization of the network processing an image: Given an input image, the network first partitions images into blocks, where each block contains 4 image patches. Image patches in every block are linearly projected as vectors and processed by a stack of identical transformer layers. Then the proposed block aggregation layer aggregates information from each block and reduces its spatial size by 4 times. The number of blocks is reduced to 1 at the top hierarchy and classification is conducted after the output of it.

Interpretability
This architecture has a non-overlapping information processing mechanism, independent at every node. This design resembles a decision tree-like structure, which manifests unique interpretability capabilities because every tree node contains independent information of an image block that is being received by its parent nodes. We can trace the information flow through the nodes to understand the importance of each feature. In addition, our hierarchical structure retains the spatial structure of images throughout the network, leading to learned spatial feature maps that are effective for interpretation. Below we showcase two kinds of visual interpretability.

First, we present a method to interpret the trained model on test images, called gradient-based class-aware tree-traversal (GradCAT). GradCAT traces the feature importance of each block (a tree node) from top to bottom of the hierarchy structure. The main idea is to find the most valuable traversal from the root node at the top layer to a child node at the bottom layer that contributes the most to the classification outcomes. Since each node processes information from a certain region of the image, such traversal can be easily mapped to the image space for interpretation (as shown by the overlaid dots and lines in the image below).

The following is an example of the model's top-4 predictions and corresponding interpretability results on the left input image (containing 4 animals). As shown below, GradCAT highlights the decision path along the hierarchical structure as well as the corresponding visual cues in local image regions on the images.

AVvXsEjfXCCMLzXjIB0F_akGALedVTwseEnhd80l

Given the left input image (containing four objects), the figure visualizes the interpretability results of the top-4 prediction classes. The traversal locates the model decision path along the tree and simultaneously locates the corresponding image patch (shown by the dotted line on images) that has the highest impact to the predicted target class.

Moreover, the following figures visualize results on the ImageNet validation set and show how this approach enables some intuitive observations. For instance, the example of the lighter below (upper left panel) is particularly interesting because the ground truth class — lighter/matchstick — actually defines the bottom-right matchstick object, while the most salient visual features (with the highest node values) are actually from the upper-left red light, which conceptually shares visual cues with a lighter. This can also be seen from the overlaid red lines, which indicate the image patches with the highest impact on the prediction. Thus, although the visual cue is a mistake, the output prediction is correct. In addition, the four child nodes of the wooden spoon below have similar feature importance values (see numbers visualized in the nodes; higher indicates more importance), which is because the wooden texture of the table is similar to that of the spoon.

AVvXsEjm-p4mIMM9LwtkhY_h1sMhu0o1HvpAME6d

Visualization of the results obtained by the proposed GradCAT. Images are from the ImageNet validation dataset.

Second, different from the original ViT, our hierarchical architecture retains spatial relationships in learned representations. The top layers output low-resolution features maps of input images, enabling the model to easily perform attention-based interpretation by applying Class Attention Map (CAM) on the learned representations at the top hierarchical level. This enables high-quality weakly-supervised object localization with just image-level labels. See the following figure for examples.

AVvXsEja4hu_1GvJxMT2_aF7YJbRpvHMhZB34fUA

Visualization of CAM-based attention results. Warmer colors indicate higher attention. Images are from the ImageNet validation dataset.

Convergence Advantages
With this design, feature learning only happens at local regions independently, and feature abstraction happens inside the aggregation function. This design and simple implementation is general enough for other types of visual understanding tasks beyond classification. It also improves the model convergence speed greatly, significantly reducing the training time to reach the desired maximum accuracy.

We validate this advantage in two ways. First, we compare the ViT structure on the ImageNet accuracy with a different number of total training epochs. The results are shown on the left side of the figure below, demonstrating much faster convergence than the original ViT, e.g., around 20% improvement in accuracy over ViT with 30 total training epochs.

Second, we modify the architecture to conduct unconditional image generation tasks, since training ViT-based models for image generation tasks is challenging due to convergence and speed issues. Creating such a generator is straightforward by transposing the proposed architecture: the input is an embedding vector, the output is a full image in RGB channels, and the block aggregation is replaced by a block de-aggregation component supported by Pixel Shuffling. Surprisingly, we find our generator is easy to train and demonstrates faster convergence speed, as well as better FID score (which measures how similar generated images are to real ones), than the capacity-comparable SAGAN.

AVvXsEgo87f63MYCoxTdSUfUjQikCz3-5au-anZV

Left: ImageNet accuracy given different number of total training epochs compared with standard ViT architecture. Right: ImageNet 64x64 image generation FID scores (lower is better) with single 1000-epoch training. On both tasks, our method shows better convergence speed.

Conclusion
In this work we demonstrate the simple idea that decoupled feature learning and feature information extraction in this nested hierarchy design leads to better feature interpretability through a new gradient-based class-aware tree traversal method. Moreover, the architecture improves convergence on not only classification tasks but also image generation tasks. The proposed idea is focusing on aggregation function and thereby is orthogonal to advanced architecture design for self-attention. We hope this new research encourages future architecture designers to explore more interpretable and data-efficient ViT-based models for visual understanding, like the adoption of this work for high-resolution image generation. We have also released the source code for the image classification portion of this work.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Han Zhang, Long Zhao, Ting Chen, Sercan Arik, Tomas Pfister. We also thank Xiaohua Zhai, Jeremy Kubica, Kihyuk Sohn, and Madeleine Udell for the valuable feedback of the work.

source http://ai.googleblog.com/2022/02/nested-hierarchical-transformer-towards.html

Tuesday, 8 February 2022

Robot See, Robot Do

Posted by Kevin Zakka, Student Researcher and Andy Zeng, Research Scientist, Robotics at Google

People learn to do things by watching others — from mimicking new dance moves, to watching YouTube cooking videos. We’d like robots to do the same, i.e., to learn new skills by watching people do things during training. Today, however, the predominant paradigm for teaching robots is to remote control them using specialized hardware for teleoperation and then train them to imitate pre-recorded demonstrations. This limits both who can provide the demonstrations (programmers & roboticists) and where they can be provided (lab settings). If robots could instead self-learn new tasks by watching humans, this capability could allow them to be deployed in more unstructured settings like the home, and make it dramatically easier for anyone to teach or communicate with them, expert or otherwise. Perhaps one day, they might even be able to use Youtube videos to grow their collection of skills over time.

AVvXsEiMCy0jEmc6-jQw8phByZNi_p5w21glY-pU

Our motivation is to have robots watch people do tasks, naturally with their hands, and then use that data as demonstrations for learning. Video by Teh Aik Hui and Nathaniel Lim. License: CC-BY

However, an obvious but often overlooked problem is that a robot is physically different from a human, which means it often completes tasks differently than we do. For example, in the pen manipulation task below, the hand can grab all the pens together and quickly transfer them between containers, whereas the two-fingered gripper must transport one at a time. Prior research assumes that humans and robots can do the same task similarly, which makes manually specifying one-to-one correspondences between human and robot actions easy. But with stark differences in physique, defining such correspondences for seemingly easy tasks can be surprisingly difficult and sometimes impossible.

AVvXsEi5mBx21KuWD83IfXrsArqNxVz32aGC2N5p

Physically different end-effectors (i.e., “grippers”) (i.e., the part that interacts with the environment) induce different control strategies when solving the same task. Left: The hand grabs all pens and quickly transfers them between containers. Right: The two-fingered gripper transports one pen at a time.

In “XIRL: Cross-Embodiment Inverse RL”, presented as an oral paper at CoRL 2021, we explore these challenges further and introduce a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL). Rather than focusing on how individual human actions should correspond to robot actions, XIRL learns the high-level task objective from videos, and summarizes that knowledge in the form of a reward function that is invariant to embodiment differences, such as shape, actions and end-effector dynamics. The learned rewards can then be used together with reinforcement learning to teach the task to agents with new physical embodiments through trial and error. Our approach is general and scales autonomously with data — the more embodiment diversity presented in the videos, the more invariant and robust the reward functions become. Experiments show that our learned reward functions lead to significantly more sample efficient (roughly 2 to 4 times) reinforcement learning on new embodiments compared to alternative methods. To extend and build on our work, we are releasing an accompanying open-source implementation of our method along with X-MAGICAL, our new simulated benchmark for cross-embodiment imitation.

Cross-Embodiment Inverse Reinforcement Learning (XIRL)
The underlying observation in this work is that in spite of the many differences induced by different embodiments, there still exist visual cues that reflect progression towards a common task objective. For example, in the pen manipulation task above, the presence of pens in the cup but not the mug, or the absence of pens on the table, are key frames that are common to different embodiments and indirectly provide cues for how close to being complete a task is. The key idea behind XIRL is to automatically discover these key moments in videos of different length and cluster them meaningfully to encode task progression. This motivation shares many similarities with unsupervised video alignment research, from which we can leverage a method called Temporal Cycle Consistency (TCC), which aligns videos accurately while learning useful visual representations for fine-grained video understanding without requiring any ground-truth correspondences.

We leverage TCC to train an encoder to temporally align video demonstrations of different experts performing the same task. The TCC loss tries to maximize the number of cycle-consistent frames (or mutual nearest-neighbors) between pairs of sequences using a differentiable formulation of soft nearest-neighbors. Once the encoder is trained, we define our reward function as simply the negative Euclidean distance between the current observation and the goal observation in the learned embedding space. We can subsequently insert the reward into a standard MDP and use an RL algorithm to learn the demonstrated behavior. Surprisingly, we find that this simple reward formulation is effective for cross-embodiment imitation.

AVvXsEg2VsuqxQHP7aj31-gSMNdvB3zh6KSppLYC

XIRL self-supervises reward functions from expert demonstrations using temporal cycle consistency (TCC), then uses them for downstream reinforcement learning to learn new skills from third-person demonstrations.

X-MAGICAL Benchmark
To evaluate the performance of XIRL and baseline alternatives (e.g., TCN, LIFS, Goal Classifier) in a consistent environment, we created X-MAGICAL, which is a simulated benchmark for cross-embodiment imitation. X-MAGICAL features a diverse set of agent embodiments, with differences in their shapes and end-effectors, designed to solve tasks in different ways. This leads to differences in execution speeds and state-action trajectories, which poses challenges for current imitation learning techniques, e.g., ones that use time as a heuristic for weak correspondences between two trajectories. The ability to generalize across embodiments is precisely what X-MAGICAL evaluates.

The SweepToTop task we considered for our experiments is a simplified 2D equivalent of a common household robotic sweeping task, where an agent has to push three objects into a goal zone in the environment. We chose this task specifically because its long-horizon nature highlights how different agent embodiments can generate entirely different trajectories (shown below). X-MAGICAL features a Gym API and is designed to be easily extendable to new tasks and embodiments. You can try it out today with pip install x-magical.

AVvXsEi3rV42swE9kiITAn3dFqTX7lbuG9-SHqtY

Different agent shapes in the SweepToTop task in the X-MAGICAL benchmark need to use different strategies to reposition objects into the target area (pink), i.e., to “clear the debris”. For example, the long-stick can clear them all in one fell swoop, whereas the short-stick needs to do multiple consecutive back-and-forths.

AVvXsEgJYaqb14vOfXfXo9POvwtTYPwYO9eGUB9u

AVvXsEhjug3F9C-6BVtEnH8FYZpaWBtnxZJ_OGNQ

Left: Heatmap of state visitation for each embodiment across all expert demonstrations. Right: Examples of expert trajectories for each embodiment.

Highlights
In our first set of experiments, we checked whether our learned embodiment-invariant reward function can enable successful reinforcement learning, when the expert demonstrations are provided through the agent itself. We find that XIRL significantly outperforms alternative methods especially on the tougher agents (e.g., short-stick and gripper).

AVvXsEiSFaweS6Q-U9By-51wlw0uAOeNvlXxbo02

Same-embodiment setting: Comparison of XIRL with baseline reward functions, using SAC for RL policy learning. XIRL is roughly 2 to 4 times more sample efficient than some of the baselines on the harder agents (short-stick and gripper).

We also find that our approach shows great potential for learning reward functions that generalize to novel embodiments. For instance, when reward learning is performed on embodiments that are different from the ones on which the policy is trained, we find that it results in significantly more sample efficient agents compared to the same baselines. Below, in the gripper subplot (bottom right) for example, the reward is first learned on demonstration videos from long-stick, medium-stick and short-stick, after which the reward function is used to train the gripper agent.

AVvXsEjyPfjufeRfNrQG-3DPWKvtVQPpSbJa6e-K

Cross-embodiment setting: XIRL performs favorably when compared with other baseline reward functions, trained on observation-only demonstrations from different embodiments. Each agent (long-stick, medium-stick, short-stick, gripper) had its reward trained using demonstrations from the other three embodiments.

We also find that we can train on real-world human demonstrations, and use the learned reward to train a Sawyer arm in simulation to push a puck to a designated target zone. In these experiments as well, our method outperforms baseline alternatives. For example, our XIRL variant trained only on the real-world demonstrations (purple in the plots below) reaches 80% of the total performance roughly 85% faster than the RLV baseline (orange).

AVvXsEibDWkZrbhZlfVjvaGTAHe1ASceh9Tcx8Vh

AVvXsEh1idFWgC_L_hlHUrlqiXllzIzhlJ9P37ip

What Do The Learned Reward Functions Look Like?
To further explore the qualitative nature of our learned rewards in more challenging real-world scenarios, we collect a dataset of the pen transfer task using various household tools.

AVvXsEhrKfU6Q075LErWi1101f44Q8yqwlVFlBcM

Below, we show rewards extracted from a successful (top) and unsuccessful (bottom) demonstration. Both demonstrations follow a similar trajectory at the start of the task execution. The successful one nets a high reward for placing the pens consecutively into the mug then into the glass cup, while the unsuccessful one obtains a low reward because it drops the pens outside the glass cup towards the end of the execution (orange circle). These results are promising because they show that our learned encoder can represent fine-grained visual differences relevant to a task.

AVvXsEgOwqA3aMqLBmeYtZAvhWyZ4xJQksWTzQ3H

Conclusion
We highlighted XIRL, our approach to tackling the cross-embodiment imitation problem. XIRL learns an embodiment-invariant reward function that encodes task progress using a temporal cycle-consistency objective. Policies learned using our reward functions are significantly more sample-efficient than baseline alternatives. Furthermore, the reward functions do not require manually paired video frames between the demonstrator and the learner, giving them the ability to scale to an arbitrary number of embodiments or experts with varying skill levels. Overall, we are excited about this direction of work, and hope that our benchmark promotes further research in this area. For more details, please check out our paper and download the code from our GitHub repository.

Acknowledgments
Kevin and Andy summarized research performed together with Pete Florence, Jonathan Tompson, Jeannette Bohg (faculty at Stanford University) and Debidatta Dwibedi. All authors would additionally like to thank Alex Nichol, Nick Hynes, Sean Kirmani, Brent Yi, Jimmy Wu, Karl Schmeckpeper and Minttu Alakuijala for fruitful technical discussions, and Sam Toyer for invaluable help with setting up the simulated benchmark.

source http://ai.googleblog.com/2022/02/robot-see-robot-do.html

Unlocking the Full Potential of Datacenter ML Accelerators with Platform-Aware Neural Architecture Search

Posted by Sheng Li, Staff Software Engineer and Norman P. Jouppi, Google Fellow, Google Research

Continuing advances in the design and implementation of datacenter (DC) accelerators for machine learning (ML), such as TPUs and GPUs, have been critical for powering modern ML models and applications at scale. These improved accelerators exhibit peak performance (e.g., FLOPs) that is orders of magnitude better than traditional computing systems. However, there is a fast-widening gap between the available peak performance offered by state-of-the-art hardware and the actual achieved performance when ML models run on that hardware.

One approach to address this gap is to design hardware-specific ML models that optimize both performance (e.g., throughput and latency) and model quality. Recent applications of neural architecture search (NAS), an emerging paradigm to automate the design of ML model architectures, have employed a platform-aware multi-objective approach that includes a hardware performance objective. While this approach has yielded improved model performance in practice, the details of the underlying hardware architecture are opaque to the model. As a result, there is untapped potential to build full capability hardware-friendly ML model architectures, with hardware-specific optimizations, for powerful DC ML accelerators.

In “Searching for Fast Model Families on Datacenter Accelerators”, published at CVPR 2021, we advanced the state of the art of hardware-aware NAS by automatically adapting model architectures to the hardware on which they will be executed. The approach we propose finds optimized families of models for which additional hardware performance gains cannot be achieved without loss in model quality (called Pareto optimization). To accomplish this, we infuse a deep understanding of hardware architecture into the design of the NAS search space for discovery of both single models and model families. We provide quantitative analysis of the performance gap between hardware and traditional model architectures and demonstrate the advantages of using true hardware performance (i.e., throughput and latency), instead of the performance proxy (FLOPs), as the performance optimization objective. Leveraging this advanced hardware-aware NAS and building upon the EfficientNet architecture, we developed a family of models, called EfficientNetX, that demonstrate the effectiveness of this approach for Pareto-optimized ML models on TPUs and GPUs.

Platform-Aware NAS for DC ML Accelerators
To achieve high performance, ML models need to adapt to modern ML accelerators. Platform-aware NAS integrates knowledge of the hardware accelerator properties into all three pillars of NAS: (i) the search objectives; (ii) the search space; and (iii) the search algorithm (shown below). We focus on the new search space because it contains the building blocks needed to compose the models and is the key link between the ML model architectures and accelerator hardware architectures.

We construct TPU/GPU specialized search spaces with TPU/GPU-friendly operations to infuse hardware awareness into NAS. For example, a key adaptation is maximizing parallelism to ensure different hardware components inside the accelerators work together as efficiently as possible. This includes the matrix multiplication units (MXUs) in TPUs and the TensorCore in GPUs for matrix/tensor computation, as well as the vector processing units (VPUs) in TPUs and CUDA cores in GPUs for vector processing. Maximizing model arithmetic intensity (i.e., optimizing the parallelism between computation and operations on the high bandwidth memory) is also critical to achieve top performance. To tap into the full potential of the hardware, it is crucial for ML models to achieve high parallelism inside and across these hardware components.

AVvXsEiHzTGiyJhcsv_KihRCxUv3eeyjZuQxIR35

Overview of platform-aware NAS on TPUs/GPUs, highlighting the search space and search objectives.

Advanced platform-aware NAS has an optimized search space containing a set of complementary techniques to holistically improve parallelism for ML model execution on TPUs and GPUs:

It uses specialized tensor reshaping techniques to maximize the parallelism in the MXUs / TensorCores.
It dynamically selects different activation functions depending on matrix operation types to ensure overlapping of vector and matrix/tensor processing.
It employs hybrid convolutions and a novel fusion strategy to strike a balance between total compute and arithmetic intensity to ensure that computation and memory access happens in parallel and to reduce the contention on VPUs / CUDA cores.
With latency-aware compound scaling (LACS), which uses hardware performance instead of FLOPs as the performance objective to search for model depth, width and resolutions, we ensure parallelism at all levels for the entire model family on the Pareto-front.

EfficientNet-X: Platform-Aware NAS-Optimized Computer Vision Models for TPUs and GPUs
Using this approach to platform-aware NAS, we have designed EfficientNet-X, an optimized computer vision model family for TPUs and GPUs. This family builds upon the EfficientNet architecture, which itself was originally designed by traditional multi-objective NAS without true hardware-awareness as the baseline. The resulting EfficientNet-X model family achieves an average speedup of ~1.5x–2x over EfficientNet on TPUv3 and GPUv100, respectively, with comparable accuracy.

In addition to the improved speeds, EfficientNet-X has shed light on the non-proportionality between FLOPs and true performance. Many think FLOPs are a good ML performance proxy (i.e., FLOPs and performance are proportional), but they are not. While FLOPs are a good performance proxy for simple hardware such as scalar machines, they can exhibit a margin of error of up to 400% on advanced matrix/tensor machines. For example, because of its hardware-friendly model architecture, EfficientNet-X requires ~2x more FLOPs than EfficientNet, but is ~2x faster on TPUs and GPUs.

AVvXsEhLM7_dTe2bN89tYSNWuxeRB1Si_o3lTj_g

EfficientNet-X family achieves 1.5x–2x speedup on average over the state-of-the-art EfficientNet family, with comparable accuracy on TPUv3 and GPUv100.

Self-Driving ML Model Performance on New Accelerator Hardware Platforms
Platform-aware NAS exposes the inner workings of the hardware and leverages these properties when designing hardware-optimized ML models. In a sense, the “platform-awareness” of the model is a “gene” that preserves knowledge of how to optimize performance for a hardware family, even on new generations, without the need to redesign the models. For example, TPUv4i delivers up to 3x higher peak performance (FLOPS) than its predecessor TPUv2, but EfficientNet performance only improves by 30% when migrating from TPUv2 to TPUv4i. In comparison, EfficientNet-X retains its platform-aware properties even on new hardware and achieves a 2.6x speedup when migrating from TPUv2 to TPUv4i, utilizing almost all of the 3x peak performance gain expected when upgrading between the two generations.

AVvXsEjUq32G6UvNmvyz45jryFZ-Eg0exZfnBvOm

Hardware peak performance ratio of TPUv2 to TPUv4i and the geometric mean speedup of EfficientNet-X and EfficientNet families, respectively, when migrating from TPUv2 to TPUv4i.

Conclusion and Future Work
We demonstrate how to improve the capabilities of platform-aware NAS for datacenter ML accelerators, especially TPUs and GPUs. Both platform-aware NAS and the EfficientNet-X model family have been deployed in production and materialize up to ~40% efficiency gains and significant quality improvements for various internal computer vision projects across Google. Additionally, because of its deep understanding of accelerator hardware architecture, platform-aware NAS was able to identify critical performance bottlenecks on TPUv2-v4i architectures and has enabled design enhancements to future TPUs with significant potential performance uplift. As next steps, we are working on expanding platform-aware NAS’s capabilities to the ML hardware and model design beyond computer vision.

Acknowledgements
Special thanks to our co-authors: Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le. We also thank many collaborators including Jeff Dean, David Patterson, Shengqi Zhu, Yun Ni, Gang Wu, Tao Chen, Xin Li, Yuan Qi, Amit Sabne, Shahab Kamali, and many others from the broad Google research and engineering teams who helped on the research and the subsequent broad production deployment of platform-aware NAS.

source http://ai.googleblog.com/2022/02/unlocking-full-potential-of-datacenter.html

Saturday, 5 February 2022

An explorer in the sprawling universe of possible chemical combinations

The direct conversion of methane gas to liquid methanol at the site where it is extracted from the Earth holds enormous potential for addressing a number of significant environmental problems. Developing a catalyst for that conversion has been a critical focus for Associate Professor Heather Kulik and the lab she directs at MIT.

As important as that research is, however, it is just one example of the innumerable possibilities of Kulik’s work. Ultimately, her focus is far broader, the scope of her exploration infinitely more vast.

“All of our research is dedicated toward the same practical goal,” she says. “Namely, we aim to be able to predict and understand using computational tools why catalysts or materials behave the way they do so that we can overcome limitations in present understanding or existing materials.”

Simply put, Kulik wants to apply novel simulation and machine-learning technologies she and her lab have developed to rapidly investigate the sprawling world of possible chemical combinations. In the process, the team is mapping out how chemical structures relate to chemical properties, in order to create new materials tailored to particular applications.

“Once you realize the sheer scale of how many materials we could or should be studying to solve outstanding problems, you realize the only way to make a dent is to do things at a larger and faster scale that has ever been done before,” Kulik says. “Thanks to both machine-learning models and heterogeneous computing that has accelerated first-principles modeling, we are now able to start asking and answering questions that we could never have addressed before.”

Despite Kulik’s many awards and consistent recognition for her research, the New Jersey native was not always destined to be a scientist. Her parents were not particularly interested in math and science and, although she was mathematically precocious and did arithmetic as a toddler and college-level classes in middle school, she pursued other interests into her teens, including creative writing, graphic design, art, and photography.

Majoring in chemical engineering at the Cooper Union, Kulik says she wanted to occupy her mind, do something useful, and “make an okay living.” Chemical engineering was one of the highest-paying professions for undergraduates, she says.

The first thing she remembers hearing about graduate school was from a teaching assistant in her undergraduate physics class, who explained that being in academia meant “not having a real job until you’re at least 30” and working long hours.

“I thought that sounded like a terrible idea!” Kulik says.

Luckily, some of her classroom experiences at the Cooper union, as well as encouragement from her quantum mechanics professor, Robert Topper, led her toward research.

“While I wanted to be useful, I kept being drawn to these fundamental questions of how knowing where the atoms and electrons were located explained the world around us,” she says. “Ultimately, I obtained my PhD in computational materials science to become a scientist who works with electrons every day for that reason. Since what I do hardly ever feels like a chore, I now have a greater appreciation for the fact that this path allowed me to ‘not have a real job.’”

Kulik credits MIT professor of chemistry and biology Cathy Drennan, whom Kulik collaborated with during graduate school, with “helping me see past the short-term barriers that come up in academia” and “showing me what a career in science could look like.” She also mentions Nicola Marzari, her PhD advisor, then an associate professor in the MIT’s Department of Materials Science and Engineering, and her postdoc advisor at Stanford University, Todd Martinez, “who gave me a glimpse of what an independent career might look like.”

Kulik works hard to pass on her ethics and her ideas about work-life balance to students in her lab, and she teaches them to rely on each other, referring to the group as a “tight-knit community all with the same goals.” Twice a month, she holds meetings at which she encourages students to share how they have come up with solutions when working through research problems. “We can each see and learn from different problem-solving strategies others in the group have tried and help each other out along the way.”

She also encourages a light atmosphere. The lab’s web page says its members “embrace very #random (but probably fairly uncool) jokes in our Slack channels. We are computational researchers after all!”

“We like to keep it lighthearted,” Kulik says.

Nonetheless, Kulik and her lab have achieved major breakthroughs, including changing the approach to computational chemistry to make the way multiscale simulations are set up more systematic, while exponentially accelerating the process of materials discovery. Over the years, the lab has developed and honed an open-source code called molSimplify, which researchers can use to build and simulate new compounds. Combined with machine-learning models, the automated method enabled by the software has led to “structure-property maps” that explain why materials behave as they do, in a more comprehensive manner than was ever before possible.

For her efforts, Kulik has won grants from the MIT Energy Initiative, a Burroughs Wellcome Fund Career Award at the Scientific Interface, the American Chemical Society OpenEye Outstanding Junior Faculty Award, an Office of Naval Research Young Investigator Award, a DARPA Young Faculty Award and Director's Fellowship, the AAAS Marion Milligan Mason Award, the Physical Chemistry B Lectureship, and a CAREER award from the National Science Foundation, among others. This year, she was named a Sloan Research Fellow and was granted tenure.

When not hard at work on her next accomplishment, Kulik enjoys listening to music and taking walks around Cambridge and Boston, where she lives in the Beacon Hill neighborhood with her partner, who was a fellow graduate student at MIT.

Each year for the past three to four years, Kulik has spent at least two weeks on a wintertime vacation in a sunny climate.

“I reflect on what I’ve been doing at work as well as what my priorities might be both in life and in work in the upcoming year,” she says. “This helps to inform any decisions I make about how to prioritize my time and efforts each year and helps me to make sure I’ve put everything in perspective.”

source https://news.mit.edu/2022/heather-kulik-chemical-materials-0206