Chris Dixon just spent 2 years investing in and hyping NFTs. AI is a legitimate innovation that can do without all of the “Web3” con artists driving any aspect of it.
But what about "The next big thing will start out looking like a scam"?
godelski 670 days ago [-]
I agree. Hype is a double edged sword. We've already got ourselves cut up more than a few times in ML due to it and honestly, it doesn't seem unlikely that we do far more damage. I wish people would tone it down a bit and we could oust the con artists before moving on.
rootusrootus 671 days ago [-]
> just spent 2 years
So ... after he wrote this blog post?
ramraj07 671 days ago [-]
In my opinion it’s legitimate to question the discretion of anyone who put their entire weight behind cryptocrappency especially NFT.
avarun 671 days ago [-]
Maybe. What about questioning the discretion of anyone who uses childish terminology like "cryptocrappency" and ad hominems instead of making an actual point?
ramraj07 670 days ago [-]
I won’t totally advocate on ignoring what someone says, because of who they are, but I’m not going to waste precious time reading their thoughts when have the choice to read someone else’s.
anotherpaulg 671 days ago [-]
I think there's a new approach for “How do you get the data?” that wasn't available when this article was written in 2015. The new text and image generative models can now be used to synthesize training datasets.
I was working on an typing autocorrect project and needed a corpus of "text messages". Most of the traditional NLP corpuses like those available through NLTK [0] aren't suitable. But it was easy to script ChatGPT to generate thousands of believable text messages by throwing random topics at it.
Similarly, you can synthesize a training dataset by giving GPT the outputs/labels and asking it to generate a variety of inputs. For sentiment analysis... "Give me 1000 negative movie reviews" and "Now give me 1000 positive movie reviews".
The Alpaca folks used GPT-3 to generate high-quality instruction-following datasets [1] based on a small set of human samples.
An interesting question is, if you can get ChatGPT to generate high quality data for you, should you just cut out the middle-model and be using ChatGPT as your classifier?
The answer probably depends a lot on your specific problem domain and constraints, but a non-trivial amount of the time the answer will be that your task could be solved by a wrapper around the ChatGPT API.
nfmcclure 671 days ago [-]
You definitely can use LLMs to do your modeling. But sometimes you need very fast, cheap, and smaller models instead. Also there's research out there showing that using LLM to generate training data for targeted & specific models may result in better performance.
og_kalu 671 days ago [-]
>should you just cut out the middle-model and be using ChatGPT as your classifier?
I guess it will tke some time before the reality really sinks but the days of the artificial sota being obviously behind human efforts for NLP has come and gone.
danuker 671 days ago [-]
> should you just cut out the middle-model and be using ChatGPT as your classifier?
And hope OpenAI forever provides the service, and at a reasonable price, latency, and volume?
selcuka 670 days ago [-]
> And hope OpenAI forever provides the service, and at a reasonable price, latency, and volume?
They are enjoying the be the market leader for now, but OpenAI will soon be facing real competition, and LLM services will become a commodity product. That must be partly why they seeked Microsoft backing: to be a part of the "big tech".
arbuge 671 days ago [-]
Besides that, there's the issue of efficiency.
Better quality training data might enable you to build a leaner more efficient model than is far cheaper to implement and run than the expensive model used to generate the data to train it.
This is a very bad idea for image models. They pick up and amplify imperceptible distortions in images no human reviewer would catch... Not to speak of big ones when the output is straight up erroneous.
This may apply to text too.
Partial or fully synthetic data is OK when finetuning existing LLMs. I personally discovered its not OK for finetuning ESRGAN. Not sure about diffusion models.
godelski 670 days ago [-]
> Not sure about diffusion models.
Diffusion models are still approximate density estimators, not explicit. They lose information because you don't have an unique mapping to the subsequent step. Got to think about the relationships of your image and preimage.
So while they have better distribution that GANs, they still aren't reliable for dataset synthesis. But they are better than GANs for that (GANs will be very mean focused, which is why we had such high quality images from them but we also see huge diversity issues and amplification of biases).
dragonwriter 670 days ago [-]
> Not sure about diffusion models.
Human-curated synthetic data is commonly used in finetuning (or LoRa-training) for SD. I doubt that uncurated synthetic data would be very usable. There might be use cases where curating synthetic data with some kind of vision model would be valuable, but my intuition would be that it would be largely hit-or-miss and hard to predict.
godelski 670 days ago [-]
> The new text and image generative models can now be used to synthesize training datasets.
No. Just no. Dear god, no.
This isn't too different from GPT-4 grading itself (looking at you MIT math problems)!
Current models don't accurately estimate the probability distribution of data, so they can't be reliable for dataset synthesis. Yes, synthesis can help, but you also have to specifically remember that typically they don't because they generate the highest likelihood data, which is already abundant. Getting non-mean data is the difficult part and without good density estimation you can't reliably do this. The density estimation networks are rather unpopular and haven't received nearly as much funding or research. Though I highly suggest it, but I'm biased because this is what I work in (explicit density estimation and generative modeling).
__loam 670 days ago [-]
Sampling an AI output when the distribution you want is human data is incredibly stupid.
loveparade 670 days ago [-]
I don't think it is. The distribution of an AI model that was trained on such a huge amount of movie reviews is very close to the human distribution.
At least that's true around the mean. If your application needs to handle long-tail cases, an LLM won't easily give you that. But depending on the application, that may not be necessary. So yeah, sometimes this is a bad idea, but for many applications it may be just fine.
alex_lav 670 days ago [-]
It's funny, for my lil startup, "How do you get the data" is now _less_ tech than ever. I pay an hourly wage to a human to generate/transcribe it. This method is both much more cost effective and scalable than tech-enabled alternatives.
atleastoptimal 671 days ago [-]
Is synthesized data high quality, or does it just seem high quality
mdrzn 669 days ago [-]
“Quantity has a quality all its own.”
carbocation 670 days ago [-]
Appears to be susceptible to model collapse[1], depending on how you do it.
> The new text and image generative models can now be used to synthesize training datasets.
Only with heavy curation [0], otherwise your new models will be trained on progressively worse data than earlier models.
RC_ITR 671 days ago [-]
This is an interesting post (and an interesting reminder that even Bitcoin maximalists had other things on their minds in 2015).
I would argue that the first step of the maze makes a ton of sense for the voice recognition/image classification/driving use-cases of 2015 that had binary outcomes, but now-a-days, what would it even mean for an LLM to be right 80% of the time? 8/10 words are predicted correctly? It can speak correctly on 80% of topics?
The reason people are so jazzed about generative AI is that it's not autonomously doing a task - it's helping a human operator by making (sometimes very useful) guesses on their behalf. It's much more of a tool than a solution (even if a lot of people want it to be a solution).
andirk 670 days ago [-]
8 out of 10 is pretty darn good though right? Then again, a 9 year old is probably right 80% of the time, and a calculator 99%, so we need to compare to other similar tech. I don't know what to compare it to though. Maybe current product suggestion engines since that's basically what this "AI" is.
icpmacdo 671 days ago [-]
The linked document from Balaji's startup engineering course is extremely useful
Chris is an absolute grifter - I read his book on web 3 and was so underwhelmed by his depth of thought.
Better to ignore folks who have no experience in building cutting edge product - he's just a average philosopher turned VC because it pays more.
tough 670 days ago [-]
He's launching a new book on web3 soon, isn't he?
Wondering if you're talking about that new one or a previous one
chrisdbanks 671 days ago [-]
How much has changed since 2015. With NLP and ML it used to be so hard to create high-quality datasets. It was a case of rubbish in, rubbish out. Not LLMs have solved that problem. It seems that if you put in a huge amount of data into a big enough model then an emergent ability seems to be the ability to discern the wheat from the chaff. Certainly in the NLP space, the days of crowd sourced datasets seem to be over replaced with few shot learning. So much value has been unlocked.
dweekly 671 days ago [-]
There's an interesting dark side to this as well, which is that in 2023 when you think you are crowdsourcing data you may actually just be tasking it to ChatGPT. A lot of turkers just turn around and use an LLM!
__loam 670 days ago [-]
Which is of course absolutely terrible for the quality of the dataset you're trying to produce
671 days ago [-]
luckyt 671 days ago [-]
I think the author's point broadly still holds -- you can get further with more engineering resources and data, whether you're using 2015 era models or 2023 retrieval-augmented LLMs and fine-tuning. Just that now you can accomplish a lot more quickly with a ChatGPT prompt.
carlossouza 671 days ago [-]
Interesting read. I'd argue the most successful AI-based products are the ones that settle for 80-90% accuracy and “Create a fault-tolerant UX.”
Then, the question becomes: how to create a great fault-tolerant UX?
There are some nice recent cases... Github Copilot is one...
philipwhiuk 670 days ago [-]
It's amazing how "AI" now exclusively means LLMs.
xigency 671 days ago [-]
Not super on point, but wow that boilerplate at the end really goes above and beyond at saying “this is my personal blog, and just, like, my opinion man.”
wanderingstan 671 days ago [-]
I suspect it’s due to his writing about crypto, where the all the legal/regulatory risks around securities and financial advice can be high.
Source: https://cdixon.org
https://cdixon.org/2010/01/03/the-next-big-thing-will-start-...
So ... after he wrote this blog post?
I was working on an typing autocorrect project and needed a corpus of "text messages". Most of the traditional NLP corpuses like those available through NLTK [0] aren't suitable. But it was easy to script ChatGPT to generate thousands of believable text messages by throwing random topics at it.
Similarly, you can synthesize a training dataset by giving GPT the outputs/labels and asking it to generate a variety of inputs. For sentiment analysis... "Give me 1000 negative movie reviews" and "Now give me 1000 positive movie reviews".
The Alpaca folks used GPT-3 to generate high-quality instruction-following datasets [1] based on a small set of human samples.
Etc.
[0] https://www.nltk.org/nltk_data/
[1] https://crfm.stanford.edu/2023/03/13/alpaca.html
The answer probably depends a lot on your specific problem domain and constraints, but a non-trivial amount of the time the answer will be that your task could be solved by a wrapper around the ChatGPT API.
Oh you certainly could.
See here: GPT-3.5 outperforming elite crowdworkers on MTurk for Text annotation https://arxiv.org/abs/2303.15056
GPT-4 going toe to toe with expertrs (and significantly outperforming crowdworkers) on NLP tasks
https://www.artisana.ai/articles/gpt-4-outperforms-elite-cro...
I guess it will tke some time before the reality really sinks but the days of the artificial sota being obviously behind human efforts for NLP has come and gone.
And hope OpenAI forever provides the service, and at a reasonable price, latency, and volume?
They are enjoying the be the market leader for now, but OpenAI will soon be facing real competition, and LLM services will become a commodity product. That must be partly why they seeked Microsoft backing: to be a part of the "big tech".
Better quality training data might enable you to build a leaner more efficient model than is far cheaper to implement and run than the expensive model used to generate the data to train it.
See for example: https://twitter.com/SebastienBubeck/status/16713263696268533...
This may apply to text too.
Partial or fully synthetic data is OK when finetuning existing LLMs. I personally discovered its not OK for finetuning ESRGAN. Not sure about diffusion models.
Diffusion models are still approximate density estimators, not explicit. They lose information because you don't have an unique mapping to the subsequent step. Got to think about the relationships of your image and preimage.
So while they have better distribution that GANs, they still aren't reliable for dataset synthesis. But they are better than GANs for that (GANs will be very mean focused, which is why we had such high quality images from them but we also see huge diversity issues and amplification of biases).
Human-curated synthetic data is commonly used in finetuning (or LoRa-training) for SD. I doubt that uncurated synthetic data would be very usable. There might be use cases where curating synthetic data with some kind of vision model would be valuable, but my intuition would be that it would be largely hit-or-miss and hard to predict.
No. Just no. Dear god, no.
This isn't too different from GPT-4 grading itself (looking at you MIT math problems)!
Current models don't accurately estimate the probability distribution of data, so they can't be reliable for dataset synthesis. Yes, synthesis can help, but you also have to specifically remember that typically they don't because they generate the highest likelihood data, which is already abundant. Getting non-mean data is the difficult part and without good density estimation you can't reliably do this. The density estimation networks are rather unpopular and haven't received nearly as much funding or research. Though I highly suggest it, but I'm biased because this is what I work in (explicit density estimation and generative modeling).
At least that's true around the mean. If your application needs to handle long-tail cases, an LLM won't easily give you that. But depending on the application, that may not be necessary. So yeah, sometimes this is a bad idea, but for many applications it may be just fine.
1 = https://arxiv.org/abs/2305.17493v2
Only with heavy curation [0], otherwise your new models will be trained on progressively worse data than earlier models.
I would argue that the first step of the maze makes a ton of sense for the voice recognition/image classification/driving use-cases of 2015 that had binary outcomes, but now-a-days, what would it even mean for an LLM to be right 80% of the time? 8/10 words are predicted correctly? It can speak correctly on 80% of topics?
The reason people are so jazzed about generative AI is that it's not autonomously doing a task - it's helping a human operator by making (sometimes very useful) guesses on their behalf. It's much more of a tool than a solution (even if a lot of people want it to be a solution).
https://spark-public.s3.amazonaws.com/startup/lecture_slides...
Better to ignore folks who have no experience in building cutting edge product - he's just a average philosopher turned VC because it pays more.
Wondering if you're talking about that new one or a previous one
Then, the question becomes: how to create a great fault-tolerant UX?
There are some nice recent cases... Github Copilot is one...