robocrunch
Thomas Wolf
@Thom_Wolf
Co-founder & CSO at @HuggingFace – Created Transformers & Datasets libraries and the BigScience (@BigScienceW) project
Tweets by Thomas Wolf
When you think you see life here, it’s all in your eyes, just like when we anthropomorphise neural networks. All in our head Dont get me wrong, it’s beautiful how human put meaning in everything that surround them But it’s important to stay aware we build reality in our mind
Shared by
Thomas Wolf
at
4/28/2022
On the reproducibility side, it's a question for reproduction projects like LAION or DALLE-mega While academics can get compute for training through public clusters (e.g. @BigscienceW), the money to buy large scale licensed dataset will probably be much more difficult to ask for
Shared by
Thomas Wolf
at
4/21/2022
So as highlighted by @alexjc, DALL-E was trained (probably in large part) on datasets purchased by OpenAI, stock photos or professional illustrations databases I guess There is a couple of interesting implications to that, let me highlight two of them:
Shared by
Thomas Wolf
at
4/21/2022
I love this new trend of training such models in the open
Shared by
Thomas Wolf
at
4/11/2022
Ready to start the AMA session about the BigScience open-source multimodal 176B language model training 🚀 Come join us now at 👉 https://t.co/wmv0oeCEBN
Shared by
Thomas Wolf
at
3/24/2022
I had a nice chat with Sam recently talking about the 176B parameters model open training of BigScience (that you can follow at @BigScienceLLM) and other projects at @huggingface that I'm excited about (there are many :)
Shared by
Thomas Wolf
at
3/21/2022
I've setup a simple twitter bot to digest the training logs of the BigScience large language model and tweet the training progress (who said I didn't code anymore ;) Should I add something to it? Like ASCII graph of the loss?
Shared by
Thomas Wolf
at
3/16/2022
Just received my first physical copy of our book & the feeling is... surreal One year and a half in the making & I'm amazingly proud with the result It covers so much ground from NLP w/o labels up to training billion param models, multilinguality, pruning, classif, generation..
Shared by
Thomas Wolf
at
3/6/2022
If you haven't checked it out yet,🤗 Optimum is our new OSS library, an extension of 🤗 Transformers providing a set of performance optimization tools for maximum efficiency to train and run models on several type of hardware Check it out at https://t.co/XuP3u6ekDr
Shared by
Thomas Wolf
at
2/24/2022
Many of these topics requires significant compute to escape the local minima of overfitting on a small dataset or environment lacking diversity and it's a major challenge today, for academia research in particular. It's more and more time for large scientific collaborations imo
Shared by
Thomas Wolf
at
2/23/2022
In 2021, we've seen an explosion of grounded Langage Models: from "image/video+text" to more embodied models in "simulation+text" But, when tested on text benchmarks, these grounded models really struggle to improve over pure text-LM e.g T5/GPT3 Why? >>
Shared by
Thomas Wolf
at
12/2/2021
I see people commenting that this is either an unfortunate reality or better than releasing nothing I disagree I think we can find ways to share code/datasets/models I also tend to think industry papers without any of these are closer to press-release than science reports
Shared by
Thomas Wolf
at
11/25/2021
Not singling out this paper but I’m worried the field of large scale multimodal training has recently become a Wild Wild West of unshared training code running on large unreleased datasets to give private models It’s far from ideal for reproducibility and good science practices
Shared by
Thomas Wolf
at
11/25/2021
A strange form of tunneling vision I see often in AI consists in beginning to think that humans work just like ML models, that we are in the end just RL agents or language models (depending what one works on) The human experience is much more diverse & complex than any of these
Shared by
Thomas Wolf
at
11/24/2021
The public supercomputer used by @BigscienceW is doubling its size🤩 BigScience helped a lot to make this a reality and I’m very excited about this outcome. Public compute clusters are critical to reduce the divide between industry & academic AI research https://t.co/ia8Xmz8bAD
Shared by
Thomas Wolf
at
11/18/2021
Going over the last proofs of our book with @_lewtun & @lvwerra before sending it and OMG what a feast! We train and investigate models from 60M to 1.5B param on NER, QA, summarisation.. distill, compress, use them in few-shot.. dive in datasets Code/exp/checkpoints all included!
Shared by
Thomas Wolf
at
10/31/2021
Two in a row! Isn’t that crazy? After 🤗Transformers got “Best demo paper award” at EMNLP last year, it’s the turn of 🤗Datasets to get this great award this year Congratulations @qlhoest @avillanovamoral @MarioSasko @srush_nlp and everyone involved ➡️ https://t.co/0v1Jm3fU4R
Shared by
Thomas Wolf
at
10/30/2021
We have 2 new positions open in the open-source team: vision and PyTorch
Shared by
Thomas Wolf
at
10/8/2021
The BigScience data sourcing/tools/governance groups are creating a very-large and high-quality multilingual text dataset A gift for future research in NLP/CL/AI With already 170+ participants, it’s open to all and happening *now*, come join, help and learn! 👇
Shared by
Thomas Wolf
at
10/7/2021
Authors have no say on the animal O'Reilly choose for the cover of their book But I'm really happy that they chose a parrot🦜 for the cover of the book on Transformers we are finalising with Lewis and Leandro It's a Coconut Lorikeet parrot (a very stochastic Coconut Lorikeet😉)
Shared by
Thomas Wolf
at
9/11/2021
At our first live conference in a long time at #transformersatwork in Amsterdam, @Nils_Reimers is having fun breaking a bunch of assumptions on the generalization, domain adaptation and multilinguality of neural retrieval models (and discussing solutions!)
Shared by
Thomas Wolf
at
9/10/2021
I'm doing a lot of "slow" science these days. Not following arxiv anymore, mostly reading long-form research works and textbooks in areas outside of my usual fields. It feels very good, I should do it more often
Shared by
Thomas Wolf
at
9/1/2021
When Lisp was still the language of AI there was a massive effort called Cyc Today’s large pretrained models research trying to make everything “in-domain data” often remind me of Cyc (despite being NN vs symbolic): -static knowledge -full coverage goal -proprietary licences
Shared by
Thomas Wolf
at
8/20/2021
A few years ago I was mostly interested in models, creating 🤗transformers, adding BERT, GPT, T5… Over time I’ve seen my interests shift to data (sharing, evaluation, processing) leading to 🤗datasets And I see many people around me follow a similar path We are slowly maturing
Shared by
Thomas Wolf
at
8/19/2021
Lol, human are even better than few-shot or zero-shot learners, they are « wrong-shots » learners. Kids learn a correct label from an *unsuccessful* attempt to show them the input/label pair 😂 I wish our models could do anything remotely similar
Shared by
Thomas Wolf
at
7/19/2021
So many experiences defining how I’m and what I do today happened to me after my 30th birthday: raising kids, settling in a country whose language I didn’t know, changing career.. Sometimes in our 20-under-20/MIT35 addiction we forget how maturing takes time and is important
Shared by
Thomas Wolf
at
7/16/2021
I've been thoroughly enjoying reading M. Tomasello's work recently One especially stimulating aspect for me is to see pattern matching –aka "what DL do"– integrated in a much wider take & body of research on the origin of cognition, learning and linguistic capabilities of humans
Shared by
Thomas Wolf
at
7/15/2021
I'm unreasonably excited about this feature @qlhoest just added to 🤗Datasets Seeing a training on a +80GB text corpus –like @oscarnlp– start instantly without waiting for data download/processing has something magical to it🧙♂️ On master at (release soon) https://t.co/AeUFPs26z4 ht
Shared by
Thomas Wolf
at
6/24/2021
The debates on crawling/creating very large datasets remind me of the Bayesian/frequentist debate As often we’d love to avoid having to state & dive deep in our priors & curation choices But they are still there nonetheless And trickle down all the way to the model predictions
Shared by
Thomas Wolf
at
6/22/2021
And it went great! Teven now has two papers in conference proceedings: the 🤗transformers *best demo paper* at EMNLP 2020, and this *best paper* at NAACL 2021 on the equivalence between prompts and data points 😱 https://t.co/lnlRKxSjW2
Shared by
Thomas Wolf
at
6/9/2021