robocrunch
Andrej Karpathy
@karpathy
Director of AI at Tesla, leading the Autopilot Vision team. Previously OpenAI, CS231n, PhD @ Stanford. I like to train large deep neural nets 🧠🤖💥
Tweets by Andrej Karpathy
Nice and especially appreciate the release of the accompanying logbooks, detailing the struggles of training transformers at scale https://t.co/GErHySLdCJ
Shared by
Andrej Karpathy
at
5/4/2022
There are two primary API abstractions over which humans interact. 1 The physical API of a humanoid body, and 2 the digital API of keyboard+mouse (or touch) over UIs. Gradual automation of both is of high economic value and of interest towards AGI. All the best to the Adept team!
Shared by
Andrej Karpathy
at
4/26/2022
Someone should try to inspect the model output conditioned on those high loss batches just to make sure it does not have structure. That it doesn't plead or make demands or etc.
Shared by
Andrej Karpathy
at
4/13/2022
Haha ty @ykilcher for hosting me on his excellent ML News series for a cameo while I was in Zurich, in the role of a random pedestrian who knows too much about deep learning :D
Shared by
Andrej Karpathy
at
4/12/2022
- "discontinuous improvement" from scaling alone observed on ~25% of BIG-Bench tasks 🥹 - bitwise determinism 🤓 - mysterious (data+model)-dependent loss spikes (signatures of consciousness🤔? jk) - chain-of-thought prompting + post-hoc calculator + few-shot can do quite well 🪄
Shared by
Andrej Karpathy
at
4/5/2022
New SOTA big language model, surpassing Chinchilla from just ~week ago. My favorite demo is the joke explanations, which rival/surpass my own ability :). 540B Transformer on 780B tokens, roughly 4.3X compute of Chinchilla. Data includes multilingual and code. Few notes: https://t.co/THPTwJb1U6
Shared by
Andrej Karpathy
at
4/5/2022
STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning https://t.co/SbDfRHlAgX amusing: 1) describes an "offline tracker", or a kind of policy improvement operator, but for NLP 📈 2) the input space itself is used a kind of hidden state for intermediate computations 🥲
Shared by
Andrej Karpathy
at
4/2/2022
We might, but constructed deliberately with the sole purpose of improving the log probability of a language model, instead of originating from some other original human concern, before being repurposed as data for language model as an afterthought.
Shared by
Andrej Karpathy
at
3/31/2022
In this paper the Mask RCNN is still bolted on for detection. Philosophically (and I’m guessing authors might agree and are curious) would be exciting to see a fully E2E approach win eventually, simply adding another decoder Transformer, directly outputting the boxes.
Shared by
Andrej Karpathy
at
3/31/2022
Loving the philosophy of preserving simple Transformer as a Universal (Neural) Computer, where the core architecture is not meddled with much. Domain knowledge is “factored out”, only enters only through position encodings, sparsity masks, loss functions, data augmentations, etc.
Shared by
Andrej Karpathy
at
3/31/2022
“Exploring Plain Vision Transformer Backbones for Object Detection” https://t.co/E1POjnFmgZ Excellent read as usual from the FAIR team. Strong object detection results with only minor tweaks on the vanilla (ViT) Transformer backbone.
Shared by
Andrej Karpathy
at
3/31/2022
Seems likely we’ll have custom (and partially auto-generated) “textbooks” but for teaching language models, not humans, to help them “grok” concepts.
Shared by
Andrej Karpathy
at
3/30/2022
New (small!) language model Chinchilla (70B) outperforms much larger Gopher (280B), GPT-3 (175B), Jurrasic-1 (178B), MT-NLG (530B) https://t.co/yALvVcsTDW Important new LM scaling laws paper from DeepMind. Go smaller, train longer. Many misconfigurations likely continue to lurk.
Shared by
Andrej Karpathy
at
3/30/2022
(rant) "epochs" are a bug-inducing concept in neural net training and should be avoided in favor of number of iterations. Use of epochs silently functionally distorts training (as datasets change/grow) and decreases code portability (when one wishes to train on different dataset)
Shared by
Andrej Karpathy
at
3/28/2022
Taking some time off to rest&travel after almost 5 years at Tesla. Esp excited to get focused time to re-sharpen my technical edge and train some neural nets! Though I already miss all the robots and GPU/Dojo clusters and looking forward to having them at my fingertips again ❤️😅
Shared by
Andrej Karpathy
at
3/27/2022
Wanted to try training a neural net on GCP but my requests for GPU node quota keep getting instantly denied with no additional information ;(. I'm assuming other people out there have succeeded, though (?)...
Shared by
Andrej Karpathy
at
3/23/2022
I don’t think a regular person appreciates how insane it is that computers work. I propose we stare at each other mind-blown for about 1 hour/day, in small groups in circles around a chip on a pedestal, appreciating that we can coerce physics to process information like that.
Shared by
Andrej Karpathy
at
3/19/2022
FSD Beta 10.11 release notes. Fave item: "Upgraded modeling of lane geometry from dense rasters (“bag of points”) to an autoregressive decoder that directly predicts and connects “vector space” lanes point by point using a transformer neural network."
Shared by
Andrej Karpathy
at
3/14/2022
"This enables us to predict crossing lanes, allows computationally cheaper and less error-prone post-processing, and paves the way for predicting many other signals and their relationships jointly and end-to-end."
Shared by
Andrej Karpathy
at
3/14/2022
TLDR a GPT-like Transformer is now predicting the lanes and their connectivity. This "direct to vector space" framework allows predictions to be jointly coherent (due to sequential conditioning) and v easily used by planner (due to sparsity). Excellent work from the team!🪄
Shared by
Andrej Karpathy
at
3/14/2022
Is simulation the dark horse of 99% of the training FLOPS in future "foundation models" of computer vision?
Shared by
Andrej Karpathy
at
2/17/2022
A fun & v feasible project idea for someone out there: bundle up face detection, speech recognition, GPT as the core "intelligence engine", text to speech, and face generative model to create a digital human you can talk to e.g. on webcam/phone (but it's just a "dressed up" GPT).
Shared by
Andrej Karpathy
at
2/16/2022
So even though I'm technically in vision, papers, people and ideas across all of AI are suddenly extremely relevant. Everyone is working with essentially the same model, so most improvements and ideas can "copy paste" rapidly across all of AI.
Shared by
Andrej Karpathy
at
12/8/2021
Even within areas (like vision), there used to be some differences in how you do classification, segmentation, detection, generation, but all of these are also being converted to the same framework. E.g. for detection take sequence of patches, output sequence of bounding boxes.
Shared by
Andrej Karpathy
at
12/8/2021
The ongoing consolidation in AI is incredible. Thread: ➡️ When I started ~decade ago vision, speech, natural language, reinforcement learning, etc. were completely separate; You couldn't read papers across areas - the approaches were completely different, often not even ML based.
Shared by
Andrej Karpathy
at
12/8/2021
For deep learning friends: I've re-written arxiv-sanity to be smaller/sweeter/more scalable, to help tame new paper barrage on arxiv: https://t.co/i8ZaNbjWdy - ✍️ tag papers - ⬆️ get svm+tfidf paper recommendations - ✉️✨new: get them via email! run locally or use my instance ht
Shared by
Andrej Karpathy
at
12/1/2021
Great paper and thread! - 😮that super simple MSE loss works vs. BEiT-style dVAE (multi-modal) cross-entropy - <3 efficiency of asymmetric encoder/decoder - 👏detailed training recipes - +1 v curious about dataset size scaling - bit of lack of commentary on test-time protocol
Shared by
Andrej Karpathy
at
11/13/2021
"Something is terribly wrong with architecture. Nearly everything being built is boring, joyless, and/or ugly, even though there is no reason it has to be." https://t.co/LMyznNmEJm 💯. Whenever I rant about this I am met with blank stares, so this is refreshing to stumble by
Shared by
Andrej Karpathy
at
11/1/2021
A ref of particular interest (I had missed it earlier) was "Reinforcement Learning as One Big Sequence Modeling Problem" https://t.co/HqPzA6efwQ , which adapts transformer language models but now for RL. Simply fit a transformer to [(s, a, r),...] sequences, use as world model 👌
Shared by
Andrej Karpathy
at
10/24/2021
The first time I was personally shook by this philosophy was when I saw the "Just tell the AI to be nice" meme on my Twitter, which is the same idea - GPT can be seen as a super multi-task policy (trained via supervised learning), and prompt engineering is the goal conditioning.
Shared by
Andrej Karpathy
at
10/24/2021
wrt consciousness I do suspect it can just emerge in large-enough models trained on hard-enough tasks. The idea that emergence of consciousness is just another "grokking" phenomenon was the inspiration for my earlier short story "Forward Pass" https://t.co/AGhQ8u5Nzp
Shared by
Andrej Karpathy
at
10/24/2021
Really excellent reading and pointers from @ericjang11, putting into words a new "Just Ask for Generalization" approach/philosophy to AI that the field has been slowly internalizing recently. Few more thoughts in thread ->
Shared by
Andrej Karpathy
at
10/24/2021
It seems like the story of the very poor throughput could use more fleshing out, with further hyperparameter tuning or optimization on the kernels.
Shared by
Andrej Karpathy
at
10/7/2021
Errr ok wow, I am shook by the new ConvMixer architecture https://t.co/crUMktQ0ig "the first model that achieves the elusive dual goals of 80%+ ImageNet top-1 accuracy while also fitting into a tweet" 😐
Shared by
Andrej Karpathy
at
10/6/2021
timm has become the SOTA destination for finding ImageNet SOTAs :), here bumping OG 2015 He et al. ResNet-50 75.3% -> 80.4% by modernizing the training recipe alone. Epochs++, LAMB, cosine LRD, weight_decay++, BCE, label smoothing, stochastic depth, RandAugment, MixUp, CutMix
Shared by
Andrej Karpathy
at
10/4/2021
Anyways, haven't come across too much work on compilers for the "teams of people" computer architecture, but could be interesting
Shared by
Andrej Karpathy
at
9/28/2021
Various computational workloads exhibit different amounts of parallelism and are accordingly best scheduled on CPU or GPU. Same is true for human organizations/projects/tasks, but it seems rarely analyzed from that perspective. Compiling a project to run fast on people is hard :)
Shared by
Andrej Karpathy
at
9/28/2021
Amusing! Object detection cast naively into language modeling framework + borrowing many of the tips&tricks. - random object ordering seems fine ✅ - coords, class labels flattened into a single softmax 😂 - sequence augmentation is the most gnarly part, almost as yucky as nms 😬
Shared by
Andrej Karpathy
at
9/24/2021
Deep Learning is a form of human-assisted but mostly constraint-driven software development. It works because a particular smooth relaxation of program space allows a surprisingly efficient and effective local search. Something like that, my favorite definition.
Shared by
Andrej Karpathy
at
9/19/2021
Badly tuned LR decay schedules are an excellent way to silently shoot yourself in the foot. Models can often look like they are converging but it's just LR getting too low too fast. FixedLR (+optional warmup) with 1 manual decay of 10X on plateau is a safe strong baseline.
Shared by
Andrej Karpathy
at
8/27/2021
Pomodoro technique https://t.co/yAweFpZvrH simple idea: break up time/work into discrete committed chunks of 25min, has some nice benefits wrt psychology and analysis.
Shared by
Andrej Karpathy
at
8/15/2021
Perceiver IO is good reading/pointers for neural net architectures https://t.co/cVrTTHdzot esp w.r.t. encoding/decoding schemes of various modalities to normalize them to & from Transformer-amenable latent space (a not-too-large set of vectors), where the bulk of compute happens.
Shared by
Andrej Karpathy
at
8/8/2021
"Machine Learning: The High-Interest Credit Card of Technical Debt" (2014) old but fun/good re-read, appropriately anxiety inducing :) https://t.co/RbcReEqnB3
Shared by
Andrej Karpathy
at
8/7/2021
I've used this very often as well. For me the core benefit is that a page is a short-term memory storage device that allows efficient random access, something that brain is extremely poor at. i.e. it vastly extends the available register space, allowing for richer compute.
Shared by
Andrej Karpathy
at
7/26/2021
But a rough auto-scalable “template” for a healthy & efficient labeling workflow is slowly emerging along the lines of a finite state machine with a number of slots for specific roles, points of checks and balances and supporting infrastructure. Kinda. Maybe.
Shared by
Andrej Karpathy
at
7/8/2021
Even after 4 years I still haven't "solved" labeling workflows. Labeling, QA, Final QA, auto-labeling, error-spotting, diversity massaging, labeling docs + versioning, ppl training, escalations, data cleaning, throughput & quality stats, eval sets + categorization & boosting, ...
Shared by
Andrej Karpathy
at
7/8/2021
Good post! 👏 my Twitter timeline has filled up with a lot of these renders recently, expect we'll see a lot more art from neural nets.
Shared by
Andrej Karpathy
at
7/8/2021
Great supplement! Does a better job of fleshing out more the strategy of shifting as much compute as possible from inference time (where code is under strict latency requirements and doesn’t know the future) to training time.
Shared by
Andrej Karpathy
at
7/3/2021
(This post is really just cherry-picked sections of a larger, much cleaner and tested Python Bitcoin node I've been slowly building here: https://t.co/djCj2c18kJ)
Shared by
Andrej Karpathy
at
6/22/2021
Gave a talk at CVPR over the weekend on our recent work at Tesla Autopilot to estimate very accurate depth, velocity, acceleration with neural nets from vision. Necessary ingredients include: 1M car fleet data engine, strong AI team and a Supercomputer https://t.co/osmEEgkgtL
Shared by
Andrej Karpathy
at
6/21/2021
Love it! Reminds me of when I was trying to write addition manually in raw transformer weights. (Proved to be tricky because of LayerNorms and then I decided I should probably do “real” work). Anyway important to have intuitive sense of architectures and their inductive biases 👌
Shared by
Andrej Karpathy
at
6/16/2021
Top scoring approaches on any benchmark are often overly complex, overly-specialized models, heavy multi-scale ensembles, fancy training techniques, etc. How do we organize benchmarks / metrics for the "simplest baseline that just works".
Shared by
Andrej Karpathy
at
5/28/2021