robocrunch
Aran Komatsuzaki
@arankomatsuzaki
ML PhD @ GaTech
Tweets by Aran Komatsuzaki
Models will be able to produce satisfying outputs for a given reasonable prompt as robustly as humans do. Being able to deal with various prompts as flexibly as humans do is a part of general intelligence and unavoidable. https://t.co/HnlcPiM3nu
Shared by
Aran Komatsuzaki
at
5/13/2022
Reducing Activation Recomputation in Large Transformer Models Achieves a Model FLOPS Utilization of 54.2% (baseline: 42.1%) when training a 530B parameter GPT-3 style model on 2240 NVIDIA A100 GPUs. https://t.co/zt6LPIthTj
Shared by
Aran Komatsuzaki
at
5/12/2022
Unifying Language Learning Paradigms - Achieves SOTA on 50 supervised NLP tasks ranging from NLG, NLU, classification, QA, etc - Outperforms 175B GPT-3 on zero-shot SuperGLUE - Releases the 20B model repo: https://t.co/y21KS5YEMa abs: https://t.co/sMOvFblsbQ
Shared by
Aran Komatsuzaki
at
5/12/2022
Improving In-Context Few-Shot Learning via Self-Supervised Training Adapts a pretrained GPT-like model w/ SSL to improve its few-shot performance. https://t.co/wENzVDDcro
Shared by
Aran Komatsuzaki
at
5/5/2022
CoCa: Contrastive Captioners are Image-Text Foundation Models Obtains 86.3% zero-shot top-1 acc. and new SotA 91.0% top-1 acc. on ImageNet with a finetuned encoder by training jointly with contrastive loss and captioning loss. https://t.co/oQ2F5PA9rs
Shared by
Aran Komatsuzaki
at
5/4/2022
When we try to solve a question, it's often more helpful to read the question several times. That being said, would it improve the perf of zero-shot GPT-3 if we concatenate multiple copies of a given question? Anyone saw any relevant results?
Shared by
Aran Komatsuzaki
at
5/4/2022
OPT: Open Pre-trained Transformer Language Models Presents Open OPT, a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which performs comparably to GPT-3, while requiring only 1/7th the carbon footprint to develop. https://t.co/exJUr4eNXD
Shared by
Aran Komatsuzaki
at
5/3/2022
Scalable Training of Language Models using JAX pjit and TPUv4 Explores challenges and design decisions associated with developing a scalable training framework on pjit and TPUv4. https://t.co/sh10fgyOW1
Shared by
Aran Komatsuzaki
at
4/14/2022
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? Presents a large-scale eval of modeling choices, objective and multi-task prompt tuning on generalization. abs: https://t.co/WZYNowcK0d repo: https://t.co/cBBdv9JQOm
Shared by
Aran Komatsuzaki
at
4/12/2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Finds that preference modeling and RL from human feedback (RLHF) to finetune LMs improves performance on almost all NLP evaluations. https://t.co/XdbSeud8ta
Shared by
Aran Komatsuzaki
at
4/12/2022
The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink Google suggests four best practices that can reduce ML training energy by up to 100x and CO2 emissions up to 1000x. https://t.co/IYy9SaWw8M
Shared by
Aran Komatsuzaki
at
4/12/2022
Video Diffusion Models Achieves SotA on an unconditional video generation benchmark and presents the first results on a large text-cond video generation task by extending the standard image diffusion arch to videos. proj: https://t.co/jQ7L7F8yGO abs: https://t.co/Wd2O6FGAVW
Shared by
Aran Komatsuzaki
at
4/7/2022
KNN-Diffusion: Image Generation via Large-Scale Retrieval Achieves SotA in human evaluations and outperforms GLIDE by using retriever + diffusion. https://t.co/WSxzbmqjNx
Shared by
Aran Komatsuzaki
at
4/7/2022
Jump-Start Reinforcement Learning Show via experiments that JSRL, by employing a guide-policy and an exploration-policy, significantly outperforms existing imitation and RL algorithms, esp. in the small-data regime. proj: https://t.co/GXfFUxS3h0 abs: https://t.co/88adVTHa2L
Shared by
Aran Komatsuzaki
at
4/6/2022
Can language models learn from explanations in context? Finds that explanations can support the in-context learning abilities of Gopher on challenging tasks by annotating some tasks from BIG-bench with explanations of answers. https://t.co/Elzg4qcTAa
Shared by
Aran Komatsuzaki
at
4/6/2022
Idea 5: I measured the compute saving from these methods, which was significant. Most models used too many epochs and too small model size, so I concluded it'd save >>10x compute and is applicable to many other unsupervised models, not just LM. Obv, this was the case. (6/N)
Shared by
Aran Komatsuzaki
at
4/5/2022
I'm so thankful that 15k people are following me 🥰 Now that I have voice, let me talk about my pretty much overlooked paper about compute-optimal training released in 2019, which proposed some scaling ideas before all the OAI & Google papers 👇 (1/N) https://t.co/2SRUPxKAJI
Shared by
Aran Komatsuzaki
at
4/5/2022
MultiMAE: Multi-modal Multi-task Masked Autoencoders proj: https://t.co/NJVkZfDihX abs: https://t.co/vvehghYSER repo: https://t.co/USD3DtGFsk
Shared by
Aran Komatsuzaki
at
4/5/2022
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language Socratic Models can generate captions for Internet images and are competitive with SotA on zero-shot video-to-text retrieval. proj: https://t.co/CqHxnJaGjz abs: https://t.co/JWGCIqX1Yn
Shared by
Aran Komatsuzaki
at
4/3/2022
Transformer Language Models without Positional Encodings Still Learn Positional Information Finds that transformer LMs w/o pos enc are still competitive with baseline probably because the context length tells the position of current token. https://t.co/DPDE8Ukk7A
Shared by
Aran Komatsuzaki
at
4/1/2022
Imitate and Repurpose: Learning Reusable Robot Movement Skills From Human and Animal Behaviors proj: https://t.co/Ded2SF6hTj abs: https://t.co/XBnKqIwG5T video: https://t.co/WNpWz4boOR By DeepMind
Shared by
Aran Komatsuzaki
at
4/1/2022
Equivariant Diffusion for Molecule Generation in 3D Significantly outperforms previous 3D molecular generative methods regarding the quality of generated samples and efficiency at training time. https://t.co/GXIxslJf6B
Shared by
Aran Komatsuzaki
at
4/1/2022
Training Compute-Optimal Large Language Models Trains Chinchilla, which is Gopher w/ the same compute budget but with 70B parameters and 4x more more data. It significantly outperforms Gopher, e.g. by >7% on MMLU. https://t.co/MZzHQRWFVB
Shared by
Aran Komatsuzaki
at
3/30/2022
Reinforcement Learning with Action-Free Pre-Training from Videos Significantly improves both final performances and sample-efficiency of vision-based RL in a variety of manipulation and locomotion tasks. https://t.co/u1MMeLwurG
Shared by
Aran Komatsuzaki
at
3/29/2022
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis BDDMs produce SotA-quality samples indistinguishable from human speech with only seven sampling steps (28.6x faster than DiffWave). abs: https://t.co/V6CAOYCdgc repo: https://t.co/DAhfARBI9A
Shared by
Aran Komatsuzaki
at
3/27/2022
A Conversational Paradigm for Program Synthesis Trains up to 16B model (named CODEGEN) based on multi-turn conversation using TPU-v4, which outperforms OpenAI’s Codex on the HumanEval benchmark. https://t.co/814byWw7W9
Shared by
Aran Komatsuzaki
at
3/27/2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Extends MAE to video for efficient video SSL. Finds that an extremely high proportion of masking ratio (~90%) still yields favorable performance of VideoMAE. https://t.co/TE1M7GJ8hf
Shared by
Aran Komatsuzaki
at
3/25/2022
Revisiting Multi-Scale Feature Fusion for Semantic Segmentation ESeg-Lite-L runs at 79 FPS and achieves 80.1% mIoU, largely closing the gap between real-time and high performance segmentation models by leveraging BiFPN to fuse the multi-scale features. https://t.co/ma8B2nKKPq
Shared by
Aran Komatsuzaki
at
3/25/2022
Pathways: Asynchronous Distributed Dataflow for ML Presents Pathways, a new large scale orchestration layer for accelerators explicitly designed to enable exploration of new systems and ML research ideas. https://t.co/ZnJgxO4dTI
Shared by
Aran Komatsuzaki
at
3/24/2022
R3M: A Universal Visual Representation for Robot Manipulation Studies how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. proj: https://t.co/BfFSlRuiPo abs: https://t.co/wHueFPi2yI
Shared by
Aran Komatsuzaki
at
3/24/2022
learned_optimization: Meta-learning optimizers and more with JAX https://t.co/fRG0Y5t6Em
Shared by
Aran Komatsuzaki
at
3/23/2022
Practical tradeoffs between memory, compute, and performance in learned optimizers Identifies and quantifiese the design features governing the memory, compute, and performance trade-offs for many learned and hand-designed optimizers. https://t.co/A5CvFULvqn
Shared by
Aran Komatsuzaki
at
3/23/2022
XTREME-S: Evaluating Cross-lingual Speech Representations Introduces a new benchmark for cross-lingual speech representations, XTREME-S, which covers 102 languages from 10+ language families, 3 different domains and 4 task families. https://t.co/kfGkOvFWdr
Shared by
Aran Komatsuzaki
at
3/22/2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models Finds that self-consistency yields significant accuracy improvements in a variety of datasets for arithmetic and commonsense reasoning benchmarks. https://t.co/FcI5drFBWK
Shared by
Aran Komatsuzaki
at
3/22/2022
Transframer: Arbitrary Frame Prediction with Generative Models Transframer is the SotA on a variety of video gen benchmarks and can generate coherent 30 second videos from a single image without any explicit geometric information. https://t.co/WYlnF5rNHh
Shared by
Aran Komatsuzaki
at
3/18/2022
Long Document Summarization with Top-down and Bottom-up Inference Can summarize an entire book and achieve competitive performance using 0.27% parameters (464M vs. 175B) and much less training data, compared to a recent GPT-3-based model. https://t.co/UY9YYd5uJM
Shared by
Aran Komatsuzaki
at
3/16/2022
Efficient Language Modeling with Sparse all-MLP Sparse all-MLP improves LM PPL and obtains up to 2x improvement in training efficiency compared to Transformer-based MoEs as well as dense Transformers and all-MLPs. https://t.co/iPrVYsuskF
Shared by
Aran Komatsuzaki
at
3/15/2022
The (Un)Surprising Effectiveness of Pre-Trained Vision Models for Control Finds that pre-trained visual reps can be competitive or even better than ground-truth state representations to train control policies. proj: https://t.co/kPtwlhQRF1 abs: https://t.co/2JVWQL83zU
Shared by
Aran Komatsuzaki
at
3/8/2022
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer By transferring from 40M parameters, µTransfer outperforms the 6.7B GPT-3, with tuning cost only 7% of total pretraining cost. abs: https://t.co/kYiuGDiUpE repo: https://t.co/TG4eZHErto
Shared by
Aran Komatsuzaki
at
3/8/2022
Kubric: A scalable dataset generator Presents Kubric, an open-source Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, generating TBs of data. abs: https://t.co/GxiMsbE4RR repo: https://t.co/x5kIkJZnjT
Shared by
Aran Komatsuzaki
at
3/8/2022
Autoregressive Image Generation using Residual Quantization Residual quantization allows to reduce the training and infrence cost of iamage generation models in high-resolution. https://t.co/vzCWYT2gKx
Shared by
Aran Komatsuzaki
at
3/7/2022
Hierarchical Perceiver HiP retains the ability to process arbitrary modalities, but now at higher-resolution and w/o any specialized preprocessing, improving over flat Perceivers in both efficiency and accuracy on the ImageNet etc. https://t.co/bNaULlju6S
Shared by
Aran Komatsuzaki
at
2/23/2022
It's Raw! Audio Generation with State-Space Models Achieves SotA perf on autoregressive unconditional waveform generation. proj: https://t.co/r51awtnuYP repo: https://t.co/FZdSZtn3ai abs: https://t.co/gVxUErF05P
Shared by
Aran Komatsuzaki
at
2/22/2022
Transformer Quality in Linear Time Proposes FLASH, which achieves training speedups of up to 4.8x on C4 for masked language modeling. https://t.co/RL1EcTuFRi
Shared by
Aran Komatsuzaki
at
2/22/2022
Retrieval-Augmented Reinforcement Learning "On Atari, we show that retrieval-augmented R2D2 learns significantly faster than the baseline R2D2 agent and achieves higher scores." https://t.co/5bpI9dMvKq
Shared by
Aran Komatsuzaki
at
2/18/2022
Gradients without Backpropagation Presents a method to compute gradients based solely on the directional derivative that one can compute exactly and efficiently via the forward mode, entirely eliminating the need for backpropagation in gradient descent. https://t.co/2eNaSlGgAi
Shared by
Aran Komatsuzaki
at
2/18/2022
General-purpose, long-context autoregressive modeling with Perceiver AR Obtains SotA likelihood on long-sequence benchmarks, including 64×64 ImageNet and PG-19 books. https://t.co/2fp2SHp2Qx
Shared by
Aran Komatsuzaki
at
2/17/2022
Predictability and Surprise in Large Generative Models Highlights a counterintuitive property of LLMs and discusses the policy implications of this property. https://t.co/9SKyOwsVff
Shared by
Aran Komatsuzaki
at
2/17/2022
Singularity: Planet-Scale, Preemptible, Elastic Scheduling of AI Workloads Presents Singularity, Microsoft’s globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. https://t.co/LHqAaaGM3L
Shared by
Aran Komatsuzaki
at
2/17/2022
Transformer Memory as a Differentiable Search Index Shows that that information retrieval can be done with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. https://t.co/lrYH9xsUHA
Shared by
Aran Komatsuzaki
at
2/16/2022
Quantifying Memorization Across Neural Language Model Finds that memorization in LMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations. https://t.co/iLZn5e85SH
Shared by
Aran Komatsuzaki
at
2/16/2022
On a related note, I'm working on a constrastive generative LM using a method similar to OpenAI's embedding model by letting it to predict both the future seq and token contrastively. Predicting the next token w/ MLE w/ decoder may not be enough to produce a nice rep space.
Shared by
Aran Komatsuzaki
at
2/15/2022
A Contrastive Framework for Neural Text Generation Proposes contrastive objective and contrastive search to calibrate the model’s representation space and encourage diversity while maintaining coherence, which outperforms SotA generation method. https://t.co/qf8Lc9YlsQ
Shared by
Aran Komatsuzaki
at
2/15/2022
Deduplicating Training Data Mitigates Privacy Risks in Language Models Finds that seqs that are present in training data tend to be super-linearly (extremely) over-represented in generated seqs and proposes a method to mitigate this issue. https://t.co/R4NJjPbCuE
Shared by
Aran Komatsuzaki
at
2/15/2022
With a 2T token database, Retrieval-Enhanced Transformer (Retro) obtains comparable ppl to GPT-3 on the Pile, despite using 25x fewer params. Retro can generate more factually correct texts than the baseline w/o hallucinations. https://t.co/aQGOBmj1kZ
Shared by
Aran Komatsuzaki
at
12/8/2021
Gopher considerably outperforms GPT-3, given that the only notable difference in design seems to be that they spent more effort in dataset creation and improvement. They spent 2x more computes, but that doesn't lead to this much improvement. Is there any other trick I'm missing?
Shared by
Aran Komatsuzaki
at
12/8/2021
Some solid first (co-)authors: Noam Shazeer - Transformer, MoE, T5 Alec Radford - DCGAN, GPT-1/2, CLIP Kaiming He - ResNet, Mask R-CNN, MoCo, MAE Aäron van den Oord - (Parallel) WaveNet, VQ-VAE, PixelRNN
Shared by
Aran Komatsuzaki
at
12/1/2021
LAFITE : Towards Language-Free Training for Text-to-Image Generation Obtains competitive results in zero-shot text-to-image generation on the MS-COCO, yet with around only 1% of the model size and language-free training data size relative to DALL-E. https://t.co/mNtwBzzDR0
Shared by
Aran Komatsuzaki
at
11/30/2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio Co-training PolyViT on multiple modalities and tasks leads to a model that is even more parameter-efficient, and learns representations that generalize across multiple domains. https://t.co/OvWXAQWEWI
Shared by
Aran Komatsuzaki
at
11/29/2021
It's fascinating to me how localized the information human eyes process at a given moment (via fovea) compared with the current vision models. I hope more papers on the latter will come up.
Shared by
Aran Komatsuzaki
at
11/26/2021
On a related note, (phenomenal) EfficientZero paper took 5 months until release like many other NeurIPS submissions, which was a loss. We should discuss, as a community, ways to make it possible or incentivize to rapidly share paper/code/dataset/model from academia and industry.
Shared by
Aran Komatsuzaki
at
11/25/2021
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion Achieves SotA results on text-to-image generation, text-to-video generation, video prediction, etc. Outperforms DALL-E in text2image. abs: https://t.co/LgrUVjCAEB repo: https://t.co/xlLetCJi1P
Shared by
Aran Komatsuzaki
at
11/25/2021
Florence: A New Foundation Model for Computer Vision Florence achieves new SotA in majority of 44 representative CV benchmarks, including classification, retrieval, object detection, VQA, image caption, video retrieval and action recognition. https://t.co/Afvkc9pp5N
Shared by
Aran Komatsuzaki
at
11/23/2021
N-grammer: Augmenting Transformers with latent n-grams Substantially outperforms Transformer and Primer on C4 LM by augmenting the model with n-grams constructed from a discrete latent representation of the text sequence. https://t.co/q0lkrS4mJU
Shared by
Aran Komatsuzaki
at
11/17/2021
LiT : Zero-Shot Transfer with Locked-image Text Tuning LiT-tuning achieves 84.5% zero-shot transfer accuracy on ImageNet, halving the gap between previous best zero-shot transfer results (CLIP, ALIGN) and supervised fine-tuning results.
Shared by
Aran Komatsuzaki
at
11/16/2021
Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training PyTorch-based model parallelism library with more flexibility and user-friendly design than DeepSpeed and competitive performance on GPT-3, RoBERTa, BERT. https://t.co/v9zXgxKVpl
Shared by
Aran Komatsuzaki
at
11/12/2021
The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos The first truly end-to-end zero-shot object segmentation from videos. outperforms prevalent image-based contrastive learning methods without augmentation engineering. https://t.co/7TU1UwEYyE
Shared by
Aran Komatsuzaki
at
11/12/2021
Varuna: Scalable, Low-cost Training of Massive Deep Learning Models Demonstrates that Varuna allows to train a massive (~200B) model on 5x cheaper spot instances, while maintaining high training throughput. abs: https://t.co/w7TXgVZmMi repo: https://t.co/0YCMqyq9Uy
Shared by
Aran Komatsuzaki
at
11/9/2021
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm Outperforms CLIP w/ using 7x less data by adding more self-supervision techniques. abs: https://t.co/a3v1TSfZlt repo: https://t.co/aUKzRat0UR
Shared by
Aran Komatsuzaki
at
11/4/2021
Projected GANs Converge Faster Projected GANs match the previously lowest FIDs up to 40x faster, cutting the wall-clock time from 5 days to less than 3 hours given the same computational resources! https://t.co/af2nloV8zJ
Shared by
Aran Komatsuzaki
at
11/2/2021
MetaICL: Learning to Learn In Context Presents MetaICL, which tunes a pretrained LM to do in-context learning on many training tasks. MetaICL approaches the perf of models fully finetuned on the target task, and outperforms models with 8x params. https://t.co/MMI7sjuIRK
Shared by
Aran Komatsuzaki
at
11/1/2021
Sharpness-Aware Minimization Improves Language Model Generalization SAM substantially improves performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA by encourageing convergence to flatter minima w/ minimal overhead. https://t.co/iQuFEq2Ne3
Shared by
Aran Komatsuzaki
at
10/19/2021
Internet-Augmented Dialogue Generation Search-query based access of the internet in conversation provides superior performance compared to existing approaches that either use no augmentation or FAISS-based retrieval. https://t.co/cD5lAm2DLk
Shared by
Aran Komatsuzaki
at
7/18/2021
Variational Diffusion Models Obtains SotA likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. https://t.co/wjHVJuVatj
Shared by
Aran Komatsuzaki
at
7/1/2021
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization - Learns a subword tokenization end-to-end as part of the model - Outperforms byte-level baselines on GLUE etc while generally performing on par https://t.co/bjRtaBmWAp
Shared by
Aran Komatsuzaki
at
6/24/2021
Distributed Deep Learning In Open Collaborations Using volunteer computes pooled by many small groups, they successfully train SwAV and ALBERT and achieve performance comparable to traditional setups at a fraction of the cost. https://t.co/QFc5TOdRBy
Shared by
Aran Komatsuzaki
at
6/20/2021
Efficient Self-supervised Vision Transformers for Representation Learning EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. https://t.co/At20ak6pMR
Shared by
Aran Komatsuzaki
at
6/20/2021
I don't have any insider info about OpenAI, but in my dream they were releasing 10B Guided Diffusion DALL-E in July. So, I started working on replicating this since a few weeks ago. We made progress on data collection, but getting GPUs is the toughest part lol
Shared by
Aran Komatsuzaki
at
6/14/2021
Hash Layers For Large Sparse Models Modifies FFN to hash to different sets of weights. Either outperforms or is competitive with MoE methods such as Switch Transformers, while requiring no routing parameters or extra terms in the objective function. https://t.co/O2oirI0iK7
Shared by
Aran Komatsuzaki
at
6/8/2021
Not All Images are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length Observes that there exist many “easy” images which can be predicted with 4x4 tokens, while only a small fraction of “hard” ones need a finer representation. https://t.co/r4pZeYQVjb
Shared by
Aran Komatsuzaki
at
6/1/2021
ByT5: Towards a token-free future with pre-trained byte-to-byte models Shows that byte-level models are competitive with their token-level counterparts and more robust to noise. abs: https://t.co/Nt6mgTIi29 code: https://t.co/cRWQfFDBFv
Shared by
Aran Komatsuzaki
at
5/30/2021
I'm trying to build this in JAX, which is essentially translating improved DDPM into JAX. Also trying to train it on a larger dataset. I've alredy built VQ/VD/DC-VAE in JAX in the link below, but I've decided to swtich to diffusion models: https://t.co/E5sRbJC3O5
Shared by
Aran Komatsuzaki
at
4/25/2021