>For the 16 plaintiffs, the complaint indicates that they used ChatGPT, as well as other internet services like Reddit, and expected that their digital interactions would not be incorporated into an AI model.
I don't expect this lawsuit to lead anywhere. But if it does, I hope it leads to some clear laws regarding data privacy and how TOS is binding. The recent ruling regarding web scraping makes the case against OpenAI a lot weaker. [1] Data scraping publicly available data is legal. People didn't need consent to having their data be used, there was an implicit assumption the moment the data was published to the public, like on reddit or youtube.
I keep seeing this idea reoccur in the suit:
>Plaintiff ... is concerned that Defendants have taken her skills and expertise, as reflected in [their] online contributions, and incorporated it into Products that could someday result in [their] professional obsolescence ...
Anyone is able to file a suit, I wish people stopped assuming that a news report automatically means it has merit.
> But if it does, I hope it leads to some clear laws regarding data privacy and how TOS is binding.
One of the "I wonder where this will go" things with the reddit and twitter exoduses to activity pub based systems is that it is trivial for something to federate with it and slurp data without any TOS interposed.
The TOSes for these systems are typically based based on what can be pushed to them - not what can be read (possibly multiple federations downstream).
koheripbal 670 days ago [-]
Article titles which specify the plaintiff classroom amount are a good indicator of poor journalism.
You can usually disregard such articles as you can expect biased/incomplete reporting.
Lawsuit claim amounts have zero bearing on reality. They must be specified in any classroom, but lawyers just always specify massive amounts without justification.
Any reporting on this amount indicates ignorance in the system or intentional dishonesty.
judge2020 670 days ago [-]
Also note that the damages typically can't be adjusted up, only down.
TechBro8615 670 days ago [-]
Regardless of access rights to the data, I've yet to read a compelling argument why LLMs are even derivative works. You can't identify your Reddit comment in a ChatGPT conversation. How is it any different than a human learning English by reading Reddit? That human wouldn't be violating copyright every time they said a phrase that was repeated by hundreds of Redditors.
My favorite LLM analogy so far is the "lossy jpeg of the web." Within that metaphor, I don't see how anyone can claim copyright on the basis of a pixel they contributed that doesn't even show up in the lossy jpeg. They can't point to it.
knaik94 670 days ago [-]
I've been thinking of the output as fanfiction/fan art. It shares many of the same complications regarding the ownership of ideas, commerical intent of writing, competition, and copyright. Fanfiction is generally a protected form of expression, but requires the work to be "transformative". Unlike with parodies and critisisms, fanfiction can be much harder to distinguish from original work. From that perspective, a large amount of the output of LLMs is so generic, that it's not possible to attribute it to one person. It's like trying to find the original author of "Once upon a time".
I came across this with the Eleanor lawsuits - https://www.caranddriver.com/news/a42233053/shelby-estate-wi... - and while I believe that that instance Eleanor falls on the "this shouldn't have been copyrightable" (took a bit to get there), the question is "what protects the representation of Darth Vader?"
In general it tends to be ignored and tacitly encouraged... but it isn't protected.
visarga 670 days ago [-]
It's more like a mirror-house of human thought. It can create countless arrangements and even execute tasks.
JustBreath 670 days ago [-]
> Plaintiff ... is concerned that Defendants have taken her skills and expertise, as reflected in [their] online contributions, and incorporated it into Products that could someday result in [their] professional obsolescence ...
It's been a bit surreal seeing modern day Luddites come out of the wood works basically coming up with any ethical/legal argument they can that is a thinly veiled way of saying "I don't want to be automated!"
Not commenting on whether or not they are right per se, but it's weird seeing history repeat itself.
pksebben 670 days ago [-]
I don't think it's a matter of right or wrong - these are people who are behaving completely rationally given their context.
(I should caveat that I think if they get what they want, we all lose in a big way. Not that I think this is going anywhere)
We're coming up on the outer bounds of our systems of incentives. Captialism, as a system, is designed to solve for scarcity, both in terms of resources and in terms of skill and effort. Unfortunately, one of the core mechanisms it operates on is that it's all-or-nothing. You MUST find a scarcity to solve or you divorce yourself from the flow of capital (and starve / become homeless as a result).
Thus, artificial scarcity. It's easy to spot in places like manufacturing (planned obsolescence) IP (drug / software / etc patents) and so forth. I think this is just the rest of humanity both catching on and being caught up with. Two years ago, everyone thought they had a moat by virtue of being human. That's no longer a given.
One hopes that we'll collectively notice the rot in the foundation before the house falls over (and, critically, figure out how to act on it. We have a real problem with collective action these days that may well put us all in the ground).
oblio 670 days ago [-]
As far as I remember Luddites were smart and not against all technology, they were just protecting their jobs. And they were ultimately right.
Why? Except for the longshoremen in the US getting compensation and an early retirement due to the introduction of containers, I know of exactly 0 (ZERO!) mass professional reconversions after a technological revolution.
Look at deindustrialization in the US, UK, Western Europe.
When this happens, the affected people are basically thrown in the trash heap for the rest of their lives.
Frequently their kids and grandkids, too.
scrollaway 670 days ago [-]
Stables became gas stations. Nintendo used to be a toymaker.
Businesses change and adapt. Workers too — but people often don’t like change, so many choose to stay behind. Should we cater to them?
I used to do a lot of work which is now mostly automated. Things like sysadmin work, spinning up instances and configuring them manually, maintaining them. I reconverted and learned terraform, aws etc when it became popular.
Should I have gotten help from the government to instead stick to old style sysadmin work?
mrtranscendence 670 days ago [-]
> Should I have gotten help from the government to instead stick to old style sysadmin work?
I don't think anyone beyond a few marginal voices are calling for a ban on job automation. What they seem to prefer is that, if they are to be automated out of a job, they should be compensated for their copyrighted works having been used in the process of doing so.
Regardless, at the very least people who are being automated should get some government support. Not everyone can easily retrain.
wizzwizz4 670 days ago [-]
Suppose you're a weaver. It's hard, fiddly work, and you have to get your timing and your tension just right to make quality material. Now, there are mechanised looms that can do the job faster (though the quality's not great: they could still do with some improvement, in your opinion). From this efficiency gain, who should reap the profits?
Suppose you're a farmer. You've been working on your tractors for decades, and have even showed the nice folk at John Deere how you do it. Now they've built your improvements into the mass-produced models, and they say you can't work on your tractors any more. Who should reap the profits?
Suppose you're a writer. You've spent a long time reading and writing, producing essays and articles and books and poems and plays, honing your craft. You've got quite a few choice phrases and figures of speech in your back pocket, for when you want to give a particular impression. Now, there is a great big statistical model that can vomit your coinages (mixed in with others') all over the page, about any topic, in mere minutes. Who should reap the profits?
Suppose you're a visual artist. You enjoy spending your time making depictions of fantasy scenes: you have a vivid imagination, and, so you can make a living illustrating book covers and the like. You put your portfolio online, because why not? It doesn't hurt you, it makes others happy, and maybe it gets you an extra gig or two, now and then. Except now, there's a great big latent diffusion model. Plug in “Trending on Artstation by Greg Rutkowski”, and it will spit out detailed fantasy scenes, photorealistic people, the works. Nothing particularly novel, but there was so much creativity and diversity in your artwork, that few have the eye to notice the machine's subtle unoriginality. Who should reap the profits?
oblio 670 days ago [-]
I've answered this before. The container revolution split some of the resulting profits with those whose livelihoods were destroyed, the longshoremen.
"You build a dam that destroys 10000 homes, who should reap the profits?"
wizzwizz4 670 days ago [-]
It's a good answer, but it raises further questions:
• Should we be destroying people's homes to build dams without their consent?
• In general, are people being compensated when these things happen to them? i.e., while it might be nice, does this actually happen?
The Luddites (the real ones, not the mythological bastardisation of them) continue to be sympathetic characters.
oblio 669 days ago [-]
> • In general, are people being compensated when these things happen to them? i.e., while it might be nice, does this actually happen?
The famous: "it depends" :-)
AI most likely falls under: "they should be", IMHO.
JustBreath 670 days ago [-]
I don't think we should cater to Luddites, but (and it's a big but) if we automate enough jobs out of existence it's essentially undeniable that we will need systemic changes to avoid becoming a completely dystopian society.
TechBro8615 670 days ago [-]
But as the corollary to that, I know of zero successfully stopped technological revolutions. You can't put the genie back in the bottle, and there is no way to stop progress, aside from a one-world authoritarian government that forcibly stops as much of it as they can. But even that would only be marginally effective. Progress would eventually resume.
oblio 670 days ago [-]
Yes, you do know of revolutions stopped and it worked for centuries.
Tokugawa Japan, Qing China, many other places including in Europe for centuries.
That's too extreme.
My point is that we're reaching a point where people need to be compensated. We can't just destroy their lives, collect all the money in 2 bank accounts and call it a day.
JustBreath 670 days ago [-]
Bingo.
That's the real flaw in Luddite thinking -- you can destroy the machines.
wrren 670 days ago [-]
In this case I think it's a little different. People are saying that they don't want to have their own productive or creative output used to undermine their own standard of living. That's not the same as simply not wanting to have your job automated away by someone else's business innovation.
DirkH 670 days ago [-]
To make chatGPT analogous to coal mining automation it would have to be able to automate the thing it is doing without learning from sources online.
To make coal mining automation analogous to chatGPT the machinery would have had to use something the coal miner did to learn how to automate their work? I'm imagining a camera looking at all the coal miner's work and then the machine can immediately do it, but better.
I agree it is a tad different, but like with someone's coal mining which is in the public domain for anyone in the tunnel to see, likewise anything you write unprotected online is in the public domain and fair game I think?
tedivm 670 days ago [-]
The lawsuit is far more nuanced than you're letting on. There are several aspects that come into play-
* Was it published publicly? This is basically defined in the courts as "if you make an unauthenticated web request does the data return?". This is where scraping comes in- if you make the data available without authentication you can't enforce your TOS, because you can't validate that people actually even accepted the TOS to begin with.
* Is the data able to be copyrighted? This is where things are interesting- facts can not be copyrighted, which is why a lot of scrapers are able to reuse data (things like weather, sports scores, even "for hire" notices can be considered factual).
* If it would typically be considered covered by copyright, does fair use come into play?
* Are there any other laws that come into play? For example, GDPR, CCPA, or other privacy laws can still add restrictions to how data is collected and used (this is complicated by the various jurisdictions as well)
* Was the work done with the data transformative enough to allow it to bypass copyright protections? This goes back to when Google was scanning books. Because they were making a search engine, not a library, their search tool was considered transformative enough to allow them to continue.
It's not enough to say "because it's on the internet, it's fair game for everyone to use". This is a really complicated area where things are evolving rapidly, and there's a lot of intersecting law (and case law) that comes into play.
knaik94 670 days ago [-]
I agree that there is additional nuance, but so far public data scraping has very clearly been ruled as legal. It's possible that at the time of scraping, copyrighted data was incorporated into the training data because it hadn't been taken down by the host platform yet. But in my opinion, the core idea proposed by the suit that private data was used intentionally, is not true. The GPT4 browsing plugin is equivalent to web scraping.
And another complication is that OpenAI is not exposing any static data. A response is generated only after prompting. I'd argue that LLMs are closer to calculators than databses in function. The amount of new information that can be added is also limited, it's is not a continuous learning/training architecture.
I do hope this leads to more clear laws regarding data privacy, but I can't imagine the allegations of "intercepting communications", violating CFAA, or violating unfair competition law will hold.
tedivm 670 days ago [-]
My point is that you have to separate the method for collecting the data versus the usage of the data as separate legal questions. Scraping is legal. What you do with the data that you scrap though is a whole other question.
To put it another way, it's legal for me to go to the library and borrow a DVD or a book or poems. That doesn't give me the right to publish the poems again under my own name. Whether I find the poems from scraping, borrowing the book from a library, or even just reading it off of a wall I don't get ownership rights to that data.
The same logic applies to a lot of other laws around data. If you collect data on individuals there are a bunch of laws that come up around it, and many of them don't really concern themselves with how you got the data so much as how you use it. The fact that it was scraped doesn't grant any special legal rights.
knaik94 670 days ago [-]
What you describe misrepresents how LLMs/neural networks and the math works, your analogy does not apply. There's no static data in the networks. The output of LLMs are much closer to parodies and fanfiction. In that case, you very clearly own the copyright to the new work you make.
tedivm 670 days ago [-]
That's weird, since my comment literally said nothing about LLMs. I was simply pointing out that making scraping legal doesn't invalidate any of the other data laws that were out there, and gave one example.
You keep making the claim that because it was scraped people can do whatever they want, as scraping is legal. That is the only thing I'm arguing against, because that is a gross misinterpretation of how the case that made scraping legal was decided. LLMs aren't relevant to that point (which is exactly what I keep saying- the method of collection doesn't magically change the legality of it).
That being said, you're still wrong. The USPO has said that the output of LLMs are the outputs of algorithms and are not creative works. Therefore you can't "own the copyright to the new work you make" because the work itself can't be copyrighted at all. No one can own the output of an LLM.
Also, just because it seems you want to be wrong on every level, it is absolutely possible that a neural network would be able to repeat data from its training set. This is an incredibly known problem in the field.
knaik94 670 days ago [-]
I see your perspective better now. The Linkedin case was specifically regarding CFAA and is relevant to the original suit against OpenAI and web scraping, but I now see you weren't discussing that. The copyright limit you mention is related to completely automated generations, it's not as clear when a human uses it. The UK assigns the copyright to the user/custodian of the AI. The neural network models can repeat data, but it requires a certain frequency, and still relies on a probabilistic output. The complication comes from the fact that there is no "copying" when training a model. Fundamentally, I think we disagree on how data use laws apply in this situation. I appreciate you discussing this with me, it did helped clear some misunderstandings I had.
Even if they were exposing static data, how would that be different than a search engine? Google has been scraping the web for two decades, indexing even explicitly copyrighted content, and then making money by selling ads next to snippets from that content. If you're going to make the case that an LLM is violating copyright, then surely you must also assert that Google is too, because it's the same concept, but Google is actually surfacing exact text from the copyrighted material.
wizzwizz4 670 days ago [-]
By putting something on a public-facing website, it's generally agreed that (absent a robots.txt to the contrary), you intend it to appear in web search results, and you're granting a public limited semi-transferable revocable license to request, download and view your site to your visitors.
That doesn't mean you grant a license to produce derivative works other than search indexes. Legally, it's different. (Germany codifies these as separate "moral rights": Urheberpersönlichkeitsrecht.)
kelipso 670 days ago [-]
These things are just not going to go anywhere, big reason being AI is part of the technological race. If AI research gets constrained in the US, progress will happen in China. Since that can't happen, this won't go anywhere.
TechBro8615 670 days ago [-]
I tend to agree with you, but I also recognize I could be unrealistically optimistic. This is the legal system we're talking about. I wouldn't expect every court case to be decided fairly, nor would I expect any new laws and regulations to necessarily be sensible. Frankly my biggest worry at this point is that regulatory capture from the first mover AI companies will stop me from purchasing more than one GPU.
I'm not too worried about copyright issues because regardless of whatever happens with upcoming case law and legislation, any regulation against the input data will be totally unenforceable. It's nearly impossible to detect whether or not an LLM was trained on some corpus of data (although maybe there is some "trap street" equivalent that could retroactively catch an LLM trained on data it wasn't allowed to read). And even if the weights of a model are found to be in violation of some copyright, it's still not enforceable to forbid them, because they're just a bag of numbers that can be torrented and used to surreptitiously power all sorts of black boxes. That's why I'm much more worried about legislative restrictions on hardware purchases.
supermatt 670 days ago [-]
> I hope it leads to some clear laws regarding data privacy and how TOS is binding
I hope it leads to more people realizing that a TOS doesnt override their individual rights and that the legal system works to support them.
judge2020 670 days ago [-]
One individual right is the right to sign away other rights in exchange for products and services.
webnrrd2k 670 days ago [-]
There are limits to that -- to signing away rights. In the US You can't sign yourself into slavery. You can't sell the right to have someone kill you.
There's sort of an exception for military service, but even soldiers have acess to military courts.
tedivm 670 days ago [-]
Can you point to where that "right" is codified in law?
judge2020 670 days ago [-]
Common law of contracts dictates that you can commit to performing certain services in exchange for the counter-party performing certain services. For example, you provide both money, viewing data, and permission to run DRM and proprietary code on your property (e.g. set-top boxes or smart TVs) to Netflix in exchange for obtaining access to their library of TV shows and movies.
It's codified in the fact that saying you'll do something means you're socially obligated to do it, and legally obligated if you receive something in return.
tedivm 670 days ago [-]
You still haven't said where it's legal that all rights can be signed away. I know for a fact that you can't waive tenant rights when signing a lease, for example. We also don't allow people to sign over so many rights that they're considered slaves, as slavery has been made illegal. I also can't sign away my right to not be sexually harassed- if a company makes me sign something saying that they can sexually harass me they will still end up losing in court. The US has also limited the ability for NDAs to cover discussions about labor practices, so there's another right we can't sign away.
It seems to me there are a to of counter examples to this "right" you speak of. So many that it doesn't seem like it really exists.
mycall 670 days ago [-]
It is open knowledge that ~0% of people read any TOS. While ignorance is no defense for breaking laws or rules, ~0% is compelling in and of itself that the process is completely broken.
littlestymaar 670 days ago [-]
> People didn't need consent to having their data be used, there was an implicit assumption the moment the data was published to the public, like on reddit or youtube.
The same argument could be used to defend ubiquitous face recognition in the street though (“when going to the street, there's an implicit assumption that your presence in this place was public”) but I'd really like if we could not have that…
There's a case to be made that corporation gathering data and training artificial intelligence don't need to have the same right as people: when I go to the street or publish something on Reddit, I'm implicitly allowing other
people to read my comments, but not corporations to monetize it. (GDPR and the likes already makes this kind of distinctions for personal information by the way, so we can totally extend it to any kind of online activity).
nameless_prole 670 days ago [-]
It becomes harder and harder to pretend that this level of data scraping and disregard for consumer privacy is acceptable when things like GDPR exist.
Just because I posted something on reddit because I thought it was funny, doesn't implicitly give permission to anybody to take that post and profit from it. You're doing a disservice to consumers by acting like it's their fault for being exploited.
knaik94 670 days ago [-]
The fundamental issue in that situation isn't about profit, it's about the definition of what is considered publicly accessible and what consent that implies.
I disagree with you on whether it should count as being exploited. I don't see fanfiction writers professional impersonators or as inherently exploitative. I understand that some people would disagree because there is a difference in scale. But using technology to mimic and, in some sense, replace human effort is the reason it is useful.
I believe this will shift how and why people value organic media. The standard of what makes content "good" will rise in the long term. When stable diffusion first came out, I compared the generated art to the elevator music. I feel the same way about the output of LLMs. I might feel differently in a few years if models get better at the rate they currently have been, but that's not likely.
I agree that people should have more control over how their data is used, and I'd love to see this suit lead to stricter laws.
chasing 670 days ago [-]
I mean, it ingested all of the content from my blog. Without my permission. It's not a major part of their corpus of data, but still -- I wasn't asked and I don't really care to donate work to large corporations like that.
So the technology is cool, but I'm firmly of the stance that they cut corners and trampled peoples' rights to get a product out the door. I wouldn't be entirely unhappy if this iteration of these products were sued into the ground and were forced to start over on this stuff The Right Way.
blueridge 670 days ago [-]
One thing I've been thinking about: it's only a matter of time before your friends load an AI assistant on their phone, and it devours every text message you have ever sent to that person, every photo you've shared together, every record of an in-person meeting. This makes me really uncomfortable.
4ggr0 670 days ago [-]
That's what bothers me for years now in the context of contacts on smartphones. Maybe I'm making a mistake when thinking about it, but - if I refuse to share my contacts with let's say Instagram, but all of my friends share their contacts list which includes me, does it really matter if I decline to share or not?
Another part which bothers me is that I have lots of different personalities online. On most sites I use different usernames, and I wonder if there will someday be an AI which can match all the different online profile to a single person, even if different username are being used etc.
Not on the phone yet, but on a Mac which could include iMessages.
taneq 669 days ago [-]
I wanna do that locally with an LLM, fine tune it on my entire sent email history and have it generate auto-responses to most of my emails. :D
zapataband1 670 days ago [-]
* cuts to AT&T in the background hastily dumping texts into ChatGPT *
ChatGTP 670 days ago [-]
Every email you send to a gmail backed account is this.
munchler 670 days ago [-]
Anyone who reads your blog is "ingesting content" from it. That is presumably the purpose of your blog in the first place. Whether that content is used to train a human mind or an artificial one is probably not up to you as the author.
anktor 670 days ago [-]
This type of comments can be seen every single time a thread about LLM, or OpenAI or some such comes up.
And it adds nothing. I'm sorry but saying "Whether that content is used to train a human mind or an artificial one is probably not up to you" may be worse than saying nothing at all.
First because it shows enough doubt on whether it's up to the authors of content (IP laws, fair use, intent of the use, and many things I ignore), while giving no laws as an example or frame of reference.
And second because it's comparing a human mind that we know exist, to an artificial one, which implies:
1. An LLM is an artificial mind, or close to one, whatever that is (again, not defined).
2. If they were to exist, they would be both equivalent and treated the same as a human one.
The amount of jumps in a couple sentences, added to the uncertainty of how copyright would/will work, multiplied by the numer of times I/we read that type of comment every single time, it's getting tiresome. And it's adding noise to the noise-signal ratio.
munchler 669 days ago [-]
I think you’ve missed the point. Copyright laws prevent others from copying your work without permission. (Hence the name.) Copyright laws say nothing about who can read your work.
If you want to prevent a web spider from scraping your blog, use a captcha or robots.txt. Copyright law doesn’t apply to this scenario.
93po 670 days ago [-]
I disagree, and though the GP maybe didn't have this sentiment, my personal view is that intellectual property is a bunch of crap and just because there are laws around it in our capitalist society doesn't mean that the laws are moral/just/ethical/good. IP is constantly ingested and transformed which is exactly what LLMs are doing. The fact that ChatGPT can't even accurately reproduce data from its training (it gets basic facts/dates/quotes wrong all the time) really reinforces that it's not infringing on anyone's IP.
If you're tired of responding to these comments then stop. It's the internet, everyone is at different places in exploring topics and having discussions. Don't poo-poo on someone else's journey and instead move on with your day. There is no required reading (other than TFA) on hacker news.
scarface_74 670 days ago [-]
You don’t get to make information publicly available. But not publicly available. If you want your blog to be restricted, put it behind a login
chasing 670 days ago [-]
Yes I do. I own the work I create, even if it's publicly available. I do get to decide what happens with it.
crazygringo 670 days ago [-]
> I do get to decide what happens with it.
No. Both legally and practically, you absolutely do not.
The only thing copyright law gives you is an exclusive right to sell it for a limited period of time, as a whole in its original form or similar -- and to transfer that right.
Regardless of your desires, anyone can reuse it under the conditions of fair use. They can copy parts of it for parody purposes. If they're not selling anything or taking away from your sales*, they can reproduce it verbatim for private purposes. And even if they are selling something, they can summarize it, quote from it, rephrase it, and so forth.
And you don't actually get to decide any of that.
* Edit: added "or..."
chasing 670 days ago [-]
So you’re saying I’m right except in some narrowly carved-out situations. And I agree with you.
crazygringo 670 days ago [-]
Nope. You said:
> I wasn't asked and I don't really care to donate work to large corporations like that... I do get to decide what happens with it.
And I said:
> No. Both legally and practically, you absolutely do not.
You think you get to decide whether large corporations can train on your work. I'm saying the the law suggests you very much don't get to decide that.
chasing 670 days ago [-]
Read the comments you're replying to. I didn't comment on the legality of ChatGPT training on my content, I said I didn't like it. Regardless, the act of posting content publicly does not mean I give up my copyright claim. Yes, there are fair use situations. Training ChatGPT might be one of them, but I'm not seeing lot of concrete information one way or the other and I am seeing arguments that ChatGPT could be considered a derivative work, which would place OpenAI in violation of my copyright.
Send some links if you see some definitive case law sorting this stuff out.
meithecatte 670 days ago [-]
You are claiming that piracy is legal.
6bb32646d83d 670 days ago [-]
Anyone can read your blog and then post their own blog post using knowledge they learned while reading yours. ChatGPT "learned" from your blog that same way
mrtranscendence 670 days ago [-]
Since the way GPT "learns" is not materially similar to how a human learns, I don't see why this talking point is particularly relevant. Nothing stops the courts from distinguishing between an AI and a human with regard to what may be permissible.
ldhough 670 days ago [-]
I agree, it seems like all the arguments that the use of data by AI should have no more restrictions than the use of data by humans hinge on the implicit (or sometimes explicit) assumption that human learning and machine learning are identical. While there are parallels, there also seem to be significant differences not only in how the learning is done, but also in outcomes for the person whose data is being used. And since a major purpose of IP, copyright, etc. is at least ostensibly to protect the creators of information from negative outcomes, I don't think the outcomes can be ignored when comparing human learning to ML.
myself248 670 days ago [-]
Anthropomorphizing that it "learned" is disingenuous and I expect better from the HN crowd.
If ChatGPT regurgitates verbatim or nearly verbatim, something it slurped up from OP's blog, is that not plagiarism? Where do you draw the line? Where would a reasonable person draw the line?
judge2020 670 days ago [-]
A human is both capable of reciting things from memory in an infringing manner, and learning from experiences to create something new. Maybe we should tape people's mouth shut if they dare to violate copyright by reciting a copyrighted book word for word or put them in a straight jacket if they recreate a copyrighted painting from memory.
cmrdporcupine 669 days ago [-]
Actually I fear that people that say this are doing worse than anthropomorphizing.
Often rather than claiming human aspects to the machine, they are going further, and claiming machine aspects to the human.
Using mechanistic analogies for explaining the human body or mind isn't new, but as machines become better and better at imitating humans, those analogies become more seductive.
That's my rant; the danger with 'AI' isn't so much that humans are enslaved by machines, but that we enslave each other -- or dehumanize each other -- with machines.
tucnak 670 days ago [-]
Like with everything in law, "intent" is paramount. Obviously it's not the trainer's, nor the end-user's goal to reproduce training set data verbatim; quite contrary, overfitting as such is undesirable.
mrtranscendence 670 days ago [-]
Intent only goes so far. If I continually but unintentionally reproduce copyrighted works verbatim, I could still face consequences, particularly if I did not show due diligence in preventing it from happening in the first place.
scarface_74 670 days ago [-]
But ChatGPT doesn’t spit out verbatim from the blog.
tumult 670 days ago [-]
Computers aren't people. Software isn't humans.
Workaccount2 670 days ago [-]
There is a difference between learning from your work and copying your work.
You are entitled to control it's distribution and use. You are not entitled to control it's influence and effects.
tumult 670 days ago [-]
I think you've made up an irrelevant argument. The work has been incorporated into a commercial product, intentionally, under the control of someone else. Software isn't humans that pay taxes, appear in court, have rights, etc.
Workaccount2 670 days ago [-]
No, the work has not been. The impression that the work leaves on a neural network has been though.
AIs are not massive repositories of harvested data. The models are relatively small (<20GB).
tumult 670 days ago [-]
A resized, smaller, or encoded version of an image is still subject to copyright. Calling an encoding an 'impression' is deceitful.
> A US court ruled this week that Google's creation and display of thumbnail images does not infringe copyright. It also said that Google was not responsible for the copyright violations of other sites which it frames and links to.
tumult 669 days ago [-]
Part of this ruling is about how the images are used -- Fair use -- not just that they were stored in a particular way. If Google was using the smaller versions of the images (thumbnails) in other ways, it could have been infringing.
> The Court said that Google did claim fair use, and that whether or not use was fair depended on four factors: the purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; the nature of the copyrighted work; the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and the effect of the use upon the potential market for or value of the copyrighted work.
shagie 669 days ago [-]
My take on copyrights and AI models...
Taking copyrighted material and using it to train a model is not a copyright infringement - it is sufficiently transformative and has a different use than the original images.
Note that AI models can be used for different things. A model trained to identify objects in an image has never had uproar about the output of "squirrel" showing up in the output text.
The model also, as a purely mathematical transformation on the original source material does not get a copyright. If it needs to be protected, trade secrets are the tools to use to protect it. A model is no more copyright worthy than tanking an image and applying `gray = .299 red + .587 green + .114 blue` to it.
The output of a model is ineligible for copyright protection (in the US - and most other places).
The output of a model may fall into being a derivative work of the original content used to train the model.
It is up to the human, with agency in asking the model to generate certain output to be responsible for verifying that it does not infringe upon other works if it is published.
Note that the responsibility of the human publishing the work is not anything new with an AI model. It is the same responsibility if they were to copy something from Stack Overflow or commission a random person on Fiverr... its just that those we've overlooked for a long time - but it is similarly quite possible for the material on those sources to be copyrighted by and/or licensed to some other entity and the human doing the copying into the final product is responsible for any copyright infringements.
Saying "I copied this from Stack Overflow" or "I found this on the web" as a defense is just as good as "Copilot generated this for me" or "Stable diffusion generated this when I asked for a mouse wearing red pants" and represents a similar dereliction on part of the person publishing this content.
Workaccount2 669 days ago [-]
It's none of the those things, these models train on petabytes of data. They store relationships of objects to each other, not objects themselves.
669 days ago [-]
chasing 670 days ago [-]
Actually, people have been successfully sued for plagiarizing other works because they had internalized it and accidentally regurgitated it. So. The fact that content runs through a human brain doesn’t necessarily cleanse it from copyright concerns.
Workaccount2 670 days ago [-]
There is no "actually" because you are still addressing distribution. It wouldn't be hard to have another AI that analyzes outputs for copywriter infringement and culls them as necessary.
Would that satisfy you?
blfr 670 days ago [-]
To some extent. Others can ingest your work, quote it, talk about it, criticize it, summarize, etc.
spaceman_2020 670 days ago [-]
If I read your blog and used its data along with my own knowledge to create a course, would that be plagiarism or copyright violation?
Levitz 670 days ago [-]
>You don’t get to make information publicly available. But not publicly available.
But we do? Open sourcing something with caveats is common. This code is public BUT not for commercial use. This code is public BUT you must display attribution etc.
Sure, blogposts are unlicensed (that I know) but the idea of something publicly available being held to restrictions is nothing new.
H8crilA 670 days ago [-]
Do you allow commercial employees to read the code and incorporate knowledge obtained from the code into their brains?
DirkH 670 days ago [-]
This is a fantastic point. I can legally go pick up any strictly copyrighted book at a store and read parts of it for free which I will then have learnt and have in my brain to share with to anyone else. If I happen to have a superintelligent brain I can potentially gain a lot more and make a lot more inferences from this one outing and consequently add a lot of value to others I share my info to.
But telling me it is illegal to share what I learnt because the original source is copyrighted... doesn't sit right with me.
H8crilA 670 days ago [-]
Copyright just doesn't protect such cases. There's a funny exaggeration that is very illustrative: copyright protects the bugs in the code. I.e. the specific way in which code was written. Reading it and getting inspired was never meant to break copyright.
What protects particular solutions is patents. For example if someone were to obtain a patent for computing GCD of large integers the usual fast way, well then everyone else would have to use a different solution.
This analogy to someone reading a book, perhaps peppered with lots of legalese to the point of being hardly recognizable, will definitely be used in courts at some point. And I can't see how it wouldn't stand as a valid argument.
chasing 670 days ago [-]
If you go read a book, memorize it, write it down later in a substantively similar form, and share it freely or sell it — yes, you might get into copyright trouble. It has happened before and it is at best a tricky gray area.
If you pick up a book and learn a fact, then yeah, you’re allowed to share that fact.
It’s weird that this topic keeps devolving into a form of “so what, it’s illegal for me to learn things?” Because: no, it’s not. And: You and a piece of software are treated differently under the law. You have a different set of rights than ChatGPT.
DirkH 669 days ago [-]
Everything ChatGPT seems gray area and might which is probably why we are where we are.
H8crilA 670 days ago [-]
> You have a different set of rights than ChatGPT.
Gods, no. Where did you get that from?
Levitz 669 days ago [-]
Are you a human being? A citizen of some country? If so you definitely have a different set of rights than ChatGPT.
Those might not be a problem regarding this specific case, but the case can easily be made that it ought to be.
DirkH 669 days ago [-]
I don't think ChatGPT has any rights yet... And a person using it has the exact same rights as someone not using it.
H8crilA 669 days ago [-]
?
I don't understand your point. Do you think it makes any difference whether I use my laptop, or a pen, or ChatGPT to violate copyright?
cool_dude85 670 days ago [-]
Show me where ChatGPT's brain is and your comparison will become relevant.
H8crilA 670 days ago [-]
I mean in the floating point / quantized numbers and the connections that make the model? I'm not sure I follow, the analogy to the human brain has always been obvious, it's even in the name (artificial neural network) ...
mrtranscendence 670 days ago [-]
The analogy is just that: an analogy, and a very imperfect, misleading one. The working of the brain may have motivated early research, but GPT (as instantiated in hardware) does not operate or learn in a way similar to a human brain.
Levitz 670 days ago [-]
Yes, it's completely unfeasible to make a license to control that.
On the other hand, it's completely feasible to make a license that stops someone from training their model with some piece of info, is it not?
jacquesm 670 days ago [-]
Why is it that people keep on flogging dead horses?
sigmoid10 670 days ago [-]
That's not how copyright works.
cmrdporcupine 670 days ago [-]
Another day, another person on HN showing us how they don't understand the difference between Public Domain and Open Source or Copyleft etc.
And regardless -- the problem now is that expectations of how content can be consumed are now fundamentally violated by automation of content ingestion. People put stuff up on the Internet with the expectation of its consumption by human minds, which have inherent limitations on the speed and scale on which they can learn from and reproduce things, and those humans are also legally liable, socially/ethically obligated, etc.
Now we have machines which skirt the limits of legality, and are able to do so on massive scale and without responsibility to society as a whole.
Different game now.
scarface_74 670 days ago [-]
> People put stuff up on the Internet with the expectation of its consumption by human minds
Then people obviously aren’t aware that bots have been indexing web pages and showing summarized information without going to the web page for
three decades.
cmrdporcupine 670 days ago [-]
I think it's a bit intellectually dishonest to claim an equivalence between content indexing for search engines and machine learning for LLMs. They might share an underlying harvesting technique, but their uses -- indexing for information accessibility vs automatic content production are qualitatively different.
Further, almost every site has had an e.g. robots.txt which has permitted content harvesting only for certain accepted purposes for a couple decades now. So clearly people already had a sense of how they wanted their content harvested and for what purposes.
scarface_74 669 days ago [-]
How is it not content production when I search for something on Google and get a box with similar questions and summarizes the answer.
So you’re okay with Google making money off of your content. But not OpenAI?
taneq 669 days ago [-]
Your blog which you posted online for anyone to download and read?
Don't get me wrong, this is a grey area where copyright laws and general consensus haven't caught up with new techonology. But if you voluntarily stick something up online with the intent that anyone can read it, it seems a bit mean to then say "wait no you can't do that" if someone finds a way to materially profit off it.
TechBro8615 670 days ago [-]
You sent your content to them in response to their HTTP requests. That sure looks like affirmative consent to me.
chasing 670 days ago [-]
You’re right! Just like Disney+ did when I watched Star Wars the other day. I’m excited to know Disney has consented to me posting Star Wars in its entirety free online.
TechBro8615 670 days ago [-]
Can you make ChatGPT produce the content of your blog post "in its entirety?" You can share the URL to a ChatGPT conversation, so it should be easy to prove the copyright violation by replying to me with two links: one to your blog post, and one to the ChatGPT conversation containing an unauthorized copy of it.
archontes 669 days ago [-]
If you put your content on a billboard, what expectation should you have that you can control who reads it?
andsoitis 670 days ago [-]
That’s the economics of (non-symbolic) AI. To work, it needs humans to create stuff for free.
Putting it more bluntly, it is somewhere between a parasite and a slave driver.
ben_w 670 days ago [-]
It doesn't require humans to work for free — while that's been a common default MO since everyone looked at Google making a search index and thinking to themselves "if they're doing it surely do can we", there are data sets made by paying people.
mrtranscendence 670 days ago [-]
There are such datasets, and AI companies absolutely pay to have data curated. But I suspect it would be just unimaginably expensive to create a dataset from scratch with enough tokens to feed a model with hundreds of billions of parameters, all the while paying every participant fairly.
ben_w 670 days ago [-]
"fair" is somewhat undefined, as the fair-looking number for being paid for effort can be very different to the fair-looking number for being paid for the resale value of the end product on an open market.
I wonder what would an LLM trained on Google code and internal documents look like?
zug_zug 670 days ago [-]
Hard to understand how this is a crime, or how they came up with 3 billion dollars of damage.
Seems like if it's legal for a person to do it should be legal for software to do for the most part.
data-ottawa 670 days ago [-]
I can personally memorize and recite copyrighted works all I want, but when ChatGPT does it then it’s in a commercial context and they’re liable to be sued for infringement.
If you ask ChatGPT the rules for D&D, the private sourcebooks are all in there.
pessimizer 670 days ago [-]
> I can personally memorize and recite copyrighted works all I want,
Whoever told you that is lying to you. You are not legally allowed to personally memorize and recite copyrighted works all you want, any more than you're allowed to personally memorize, write down copyrighted works, and distribute them as much as you want.
All piracy is a process of computer-assisted remembering and reciting.
DirkH 670 days ago [-]
Last I checked I can legally enter any bookstore with copyrighted books, pick up a book, and read it. And then tell anyone what I read.
I can't go write and commercialize what I learnt directly, but I'm not breaking the law by quickly seeing how some book I didn't buy ends so I can talk about it at a party - and then everyone knows how it ends which might affect whether they want to buy said book and upset the author. But, tough shit, what I did was legal. I can even use the ending as one set of input from dozens of inspirations for my own book where the end result is transformative enough where the sources are unrecognizable. And if I had learnt about the endings from a dozen books without buying those books I didn't break any laws even though I am now commercializing something in being inspired by them all to make something new.
anktor 670 days ago [-]
Maybe it would be useful what "tell anyone what I read" means. Because if you mean 1 to some in a room, then most likely. If you use any type of broadcasting then most definitely no. Try reading outloud a script from a recent movie on twitch/youtube/radio/tv and whether it gets DMCA'd or not. Same for books, songs I guess... not? But not sure.
DirkH 669 days ago [-]
I just mean I can socialize and talk about it without the police telling me that's illegal because I didn't pay to be allowed to talk about said plot points
mrtranscendence 670 days ago [-]
Well, you could memorize and recite copyrighted works all you want, as long as you're doing it in an empty room without anyone listening.
criddell 670 days ago [-]
Would you say reading a book to my kids before bed is illegal?
mrtranscendence 670 days ago [-]
Sorry, I was being a little flip. There's more to it than that, of course. Is the performance sufficiently transformative, is it educational, is it non-profit, etc.
brookst 670 days ago [-]
Not how copyright works.
Being non-commercial is not an automatic fair use exception. Being commercial does not preclude fair use. And rule concepts are not copyrightable, only the specific expression. Rules may have other IP protection, including patents.
pessimizer 670 days ago [-]
> Rules may have other IP protection, including patents.
That's not even true in the US anymore. You'd have to convert those rules into some sort of device, or argue that the game is a business method.
solardev 670 days ago [-]
Isn't that because a performance is different from the creation of a permanent copy? If you published an article and included a significant chunk of the copyrighted work, you'd be liable too unless it fell under fair use. Doesn't matter if you did it or ChatGpt. Commercial use would be one consideration, but not the only one, for both of you.
The rules of games cannot be copyrighted either. The artistic elements can be trademarked, but if ChatGpt merely explains the rules to you in different ways, that isn't infringement either.
hughesjj 670 days ago [-]
> and recite copyrighted works all I want
...wait, isn't that false? legitimately asking.
or is it because it was done by a corporation that makes it illegal?
im thinking of how restaurants dont sing happy birthday and fair use restrictions etc
bena 670 days ago [-]
Like most things, it depends.
If I recite them to myself, in my home, it's fine. If I do it at a gathering at my house where we're playing D&D, fine. If I do it as a performance, in front of a crowd, or as a recording, now I'm no longer fine. Context matters in a copyright cases. Not to mention, to claim fair use, you do have to claim you violated copyright. Fair use is just an allowed violation.
As to Happy Birthday, that's actually ok for them to do now. The person/group that held the copyright to Happy Birthday was found to have not actually have held them in the first place. Happy Birthday is actually an older song called "Good Morning to All". Swap "Good Morning" with "Happy Birthday" and "children" with "dear [PERSON]" and you have the lyrics. This was not deemed a substantive change. And since the copyright on "Good Morning to All" has lapsed, Happy Birthday is in the public domain.
data-ottawa 669 days ago [-]
Yes, I was overly broad and there are restrictions on saying/copying memorized material.
prng2021 670 days ago [-]
I don’t get your point. Whether you use copyrighted material in commercial context or not always matters. That’s one of the most important aspects of different open source licenses.
rpdillon 670 days ago [-]
This is not true for copyright law (the 4-factor test[0]) or for OSI licenses (they almost universally place no restrictions on commercial use). The only exception that comes to mind right now is the Creative Commons NC, which is generally recognized as being unsuitable for software[1].
And CC-NC isn't considered an open source license by the FSF or OSI anyway. And IMO the NC clause is pretty much impossible to define for non-trivial use and Creative Commons basically came up. Not sure non-derivatives is a lot better especially given remixing was one of the original drivers behind CC but it's at least less controversial.
prng2021 670 days ago [-]
Thanks you’re right. I was thinking about the license changes Elastic made to stop cloud providers from redistributing their products as a managed service.
ghaff 670 days ago [-]
No OSI-approved open source license prohibits the commercial use of software. In fact, the Open Source Definition expressly forbids discriminating on the basis of how the software will be used.
brookst 670 days ago [-]
A license does not redefine copyright law.
I can give you a rock that I own, which I hope we all agree is not copyrightable, and ask you to sign a license that you will keep it indoors. If you put it in your yard, you are breaking the license and potentially liable. This has nothing to do with copyright.
goatlover 670 days ago [-]
Has this been decided by the courts?
codekansas 670 days ago [-]
> including personal information obtained without consent
Obtained from (check notes) public internet forums
> For the 16 plaintiffs, the complaint indicates that they used ChatGPT, as well as other internet services like Reddit, and expected that their digital interactions would not be incorporated into an AI model.
You've got to be incredibly naive if you think public Reddit data isn't used to train ML models, not least by Reddit themselves
pessimizer 670 days ago [-]
Or maybe when you started posting on reddit, LLMs hadn't been invented yet. This is true for 99.9% of the people who post on Reddit.
codekansas 670 days ago [-]
People have been training ML models on data scraped from Reddit since at least 2015 [1], back when there were less than a million users
LLMs were invented at least five years ago (BERT) though you could make the case for a few years earlier. My guess is the majority of Reddit users are new since then, not 0.1%?
pessimizer 670 days ago [-]
Your guess is that the majority of Reddit users have joined since 2018? 1) I do not think that is correct, 2) the mere existence of LLMs isn't public awareness about how LLMs are trained, and 3) you know exactly what I'm saying and that 99.9% might be slight hyperbole.
jefftk 670 days ago [-]
1: Reddit has ~1.6B monthly active users, compared to 0.3B in 2018. [1] So 2x user growth seems more likely to me than not.
2: You're the one who went with "invented" ;)
3: I know you're exaggerating, but I think you think you're exaggerating much less than you actually are.
> Your guess is that the majority of Reddit users have joined since 2018?
It's not really important to the debate around unlicensed use of copyrighted works to train AI models, but it wouldn't surprise me at all if the majority of Reddit users have joined since 2018. It's tough to get reliable active user counts, but they seem to have risen substantially over the past five years.
It also wouldn't surprise me if the majority of Reddit users were indeed from prior to 2018, but at the very least > 2018 would be a very substantial minority.
samstave 670 days ago [-]
My account(s) are 17 years old on reddit.
jefftk 670 days ago [-]
Yes? Mine is nearly that old. But we are very clearly the minority!
lionkor 670 days ago [-]
Like operating motor vehicles, carrying guns in some US states, sueing people and companies, submitting content to wikipedia, writing children's books, and writing and voting on laws?
Surely, there is some pretty large subset of things where "if it's legal for a person to do it should be legal for software" does not hold up?
So how about the default is "not allowed"
memefrog 670 days ago [-]
Hard to understand how someone can read the word 'sued' and think it has anything to do with criminal law.
safety1st 670 days ago [-]
Scraping is a bit of a legal gray area though. If you were to go scrape 300 billion words from the Internet, you probably would be committing a crime somewhere. Especially if you then reproduced some of those words verbatim for paying customers as ChatGPT does...
I am sure OpenAI thought all this through, so I can only assume they said "fuck it let's pull an Uber and do this anyway." We are in for lots of interesting legal headlines
pessimizer 670 days ago [-]
> Seems like if it's legal for a person to do it should be legal for software to do for the most part.
If you're going to make a claim this strong, you should expand on it. Should software be able to have custody of children? Should it be able to kill in self-defense? Should it be able to make 14th amendment claims? Exactly what part of the case (other than the damage claim) is hard to understand?
amelius 670 days ago [-]
> Seems like if it's legal for a person to do it should be legal for software to do for the most part.
It's legal for me to look out of the window and watch my neighbor go to the supermarket.
It's _not_ legal for me to build an automated surveillance system that tracks everybody on the street 24×7 and stores everything into a large database.
hbn 670 days ago [-]
I'd say this is more like if someone automated taking pictures of every flyer and missing pet poster people put up on a lightpole and saved it to a database.
There's more deliberate action when you post something on a public online form than just existing in a place outside of your house. Especially considering you've always had the option to use reddit anonymously anyway.
samstave 670 days ago [-]
>use reddit anon....
Read, yes - post no.
And - you can no longer create an account that is not tied to an email...
hbn 669 days ago [-]
OpenAI didn't have access to every poster's email when they crawled reddit. If you're making posts or have an account name that are easily tied back to your personal identity, that's on you. But you could make an account with any random username you wanted, that keeps you anonymous as far as OpenAI is concerned.
samstave 669 days ago [-]
My point was only that 17 years ago - and for more than a decade, reddit required no email address as a requisite to create an account... so it was truly anon... then they tie all (new) accounts to emails now - which makes it a trivial click for survelleince to ID your reddit account...
codetrotter 670 days ago [-]
FTA:
> The lawsuit is seeking class-action certification and damages of $3 billion – though that figure is presumably a placeholder. Any actual damages would be determined if the plaintiffs prevail, based on the findings of the court.
paddw 670 days ago [-]
They're fishing.
samstave 670 days ago [-]
Likely hoping for whatever settlement they can squeeze out of OpenAI as the first such suit against them...
They picked 3B hoping to get several million...
ben_w 670 days ago [-]
If it genuinely makes them redundant and unemployable, a few million each seems "fair" in certain ways.
But that is a moral point, not a legal one; IANAL and can't say anything valuable about the legal merits.
Ideally AI makes us all redundant and the money stops mattering anything like as much, similar to how owning land stopped mattering anything like as much when the industrial revolution happened.
Regardless, I think this is a policy question rather than a legal question, even if this fight happens to be in a court.
anigbrowl 670 days ago [-]
Fishing expedition. Will probably get thrown out because no particular injury can be enunciated. OpenAI scraped HN as well, and I don't consider my HN posts private because anyone can come here and read them, including artificial intelligences.
shubhamgrg04 670 days ago [-]
If we dissect this case, it seems to revolve around two central questions: what constitutes 'public' data and to what extent can AI models leverage such data without infringing upon individual privacy. This lawsuit may well set a significant precedent in defining the boundaries of AI ethics and data privacy.
hospitalJail 670 days ago [-]
When this happened to Stable Diffusion, it was easy for me to consider it a necessary evil to progress humanity.
When this happens to closedAI, it just seems like a profit grab.
Not that it changes the legality of it. Just optics.
Wonder if that matters in court.
cheschire 670 days ago [-]
First they came for the graphics artists, but I did not speak out because I was not a graphics artist.
Then they came for the writers, but I did not speak out because I was not a writer.
Then they came for me, and there was no one left to speak for me... well, except ChatGPT.
gl-prod 670 days ago [-]
As a language model I cannot speak for you. But I can help you express your thoughts and views. I can generate words and sentences in many ways.
111111IIIIIII 670 days ago [-]
I did not speak out because copyright is farcical nonsense that fetishizes the profit motive at the expense of humanity.
pessimizer 670 days ago [-]
That's why copyright violation should be brutally cracked down on when the copyrights of Microsoft are violated, and lawsuits against Microsoft for intentional and widespread copyright violation should be laughed off. Because capitalism is bad.
edit: corporate LLMs have pulled the "one death is a tragedy, ten thousand deaths are a statistic" ploy off fully. If you want people to question whether you're even violating copyright, make sure you violate all of them at the same time. They'll just decide that you're an act of god and not covered under earthly laws.
sebzim4500 670 days ago [-]
>edit: corporate LLMs have pulled the "one death is a tragedy, ten thousand deaths are a statistic" ploy off fully. If you want people to question whether you're even violating copyright, make sure you violate all of them at the same time. They'll just decide that you're an act of god and not covered under earthly laws.
I don't think this is relevant. If OpenAI had trained a model on just one copyrighter holder's content it would likely not be different legally, even if the model would perform much worse.
111111IIIIIII 670 days ago [-]
Primacy of capital is literally the culprit of the inequality you're complaining about, and the reason you cannot win short of reorganizing society.
hospitalJail 670 days ago [-]
It seems people prefer power distributed by capital, rather than military might or factionalism/leaders/politics.
Not that all capital is distributed by merit, plenty of people used military might or factionalism/leaders/politics to obtain disproportionate amount of capital.
But if you are against the last 2 happening, I don't see what you expect a reorganization of society to accomplish since you are going to get a power structure of factionalism/leaders/politics taking priority. (Sorry bud, no an-com utopia ever existed, they all had factionalism/leaders/politics, thus defeating the entire purpose of removing class.)
I think most of us think we can capture/retain power easier with money, than having to climb up inter-party politics.
111111IIIIIII 670 days ago [-]
Capital primacy is maintained by the capitalist state, i.e. the monopoly on violence. This is literally military might.
I don't necessarily disagree with your later points. I do, however, disagree with giving up.
hospitalJail 670 days ago [-]
>Capital primacy is maintained by the capitalist state, i.e. the monopoly on violence. This is literally military might.
At least its equitable (based on value of output), ofc there are legacy issues as well.
Some demagogue can swoon the masses and take it all if not for capital. That demagogue could be Trump or Stalin.
Know the consequences of what you are advocating for.
111111IIIIIII 670 days ago [-]
> At least its equitable (based on value of output), ofc there are legacy issues as well.
It's not. By definition, it's based on control of capital. That's why it's called capitalism. In other words, those aren't "legacy" issues; they are literally the system as designed.
hospitalJail 669 days ago [-]
That is too simple, people can earn their own capital as well.
Since utopia is impossible, its a choice between:
>Capitalism, where people can typically pull off the american dream in their lifetime.
or
>Let politics determine how much material things you get
The latter seems especially scary if you are familiar with history
111111IIIIIII 668 days ago [-]
Your logic is just profoundly short sighted.
Capitalism follows a very simple algorithm. In a capitalist economy, capital always accumulates, with all exceptions being precisely that: exceptions. Are you defending the exceptions or the rules?
Realize there was a very long and quite recent time when capitalism was impossible. By your logic, we should reinstate the divine right of kings.
numbsafari 670 days ago [-]
It’s okay, I’m sure everything is going to be fine when Microsoft and ChatGPT hot mic your next doctor appointment.
Assuming Google, Amazon, etc haven't already been doing exactly that.
dmix 670 days ago [-]
That says it's using GPT4 but it's not clear if it has anything to do with feeding back into ChatGPT.
> Nuance has strict data agreements with its customers, so patient data is fully encrypted and runs in HIPAA-compliant environments
Additionally Epic seems to already be storing these clinical notes in databases and Nuance which Microsoft owns has already technically been a 'hot mic' in these same doctors office for some time. The new offering is an AI-draft note generator.
I'm personally skeptical that model output would suddenly be under different rules than the other voice-to-text AI model output?
bannedbybros 670 days ago [-]
[dead]
lionkor 670 days ago [-]
Discord rolled out a ChatGPT based bot that can be used in (and thus can read) all private conversations. Not surprised there are issues with it.
replwoacause 670 days ago [-]
A tangential question...but does anyone know what software is used to generate legal documents that look like the PDF linked in the article? I’ve played with LaTeX templates a bit, but I seriously doubt law firms are futzing around with LaTeX for documents as complex as this. They must have some software that produces this formatting.
So it was likely made in Word and exported to PDF. (One can anyway guess from the "look" of the paragraphs that they're not using anything like Knuth–Plass line-breaking, which rules out things like *TeX and InDesign.)
calny 670 days ago [-]
Yep it's Word exported to pdf. Source: Am attorney, do this all the time. You write it up in Word, save as pdf. Then upload it to the court website, which (in federal court, at least) puts the case number in blue text at the top for the officially-filed version.
The 1-28 pleading numbers on the side are annoying. They're specific to courts in California and a few other jurisdictions, and the rules of court require them. But many other courts don't have them, and they only help to cite specific lines within pages; eg "Complaint 5:4-9" means "Complaint at page 5, at lines 4 to 9". It's occasionally useful for court filings like this, but more useful for court/deposition transcripts of testimony to show precisely where a witness said something.
Related: I tried building an RNN to generate legal pleadings back around 2018/19 and gathered a bunch of docs like this from courts across the country as training data. Processing text with those pleading numbers was a pain, so I built a CNN to classify whether a document had pleading numbers or not, which affected downstream processing. Probably the wrong approach in a bunch of ways, but I was just learning.
replwoacause 670 days ago [-]
Oh cool, then that settles it. Thank you!
frakt0x90 670 days ago [-]
In my sample size of one, an attorney I talked to said that Microsoft Word was the most important software he and his colleagues used. So my guess is they're just really good with Word.
replwoacause 670 days ago [-]
Thanks! That surprises me but maybe it shouldn’t. I figured it was some purpose-built software for attorneys.
ghaff 670 days ago [-]
Word has pretty good revision tracking and support for footnotes which are probably the main things lawyers use more than most average people do. And remember that lawyers communicate a lot with clients, etc. too so there would be a lot of friction associated with a non-standard tool.
When I worked on an expert witness report for a big law firm we just used Word.
bushbaba 670 days ago [-]
Yep that pdf can be made using ms word
roel_v 670 days ago [-]
Lots of lawyers use WordPerfect. Version 5.1. There's plenty of evidence online if you don't believe me, I wouldn't believe me if this was the first time I heard it.
replwoacause 670 days ago [-]
Interesting, I’ll do some searching on this. Thanks!
670 days ago [-]
intotheabyss 670 days ago [-]
Also tangential, but would it be ironic if portions of the legal documents were written by ChatGPT?
What I noticed is that the privacy setting which should prevent OpenAI to use my data for training purposes, was already deleted twice and I had to set it again. No idea what that means and if the data that I entered before I noticed that setting was gone is now being owned by OpenAI. Anyway, it is obvious that privacy is no priority to them. Also, it's known that YC companies are informally being told they should not worry about privacy while scaling up. Open AI is not a YC company, but its culture is definitely derived from it.
CuriousSkeptic 670 days ago [-]
As I understood it that setting is an opt-out cookie. So must be set on all new browser sessions.
Seems to be a blatant violation of GDPR. So I assume they’ll be fined for it sooner or later and forced to cleanup the training data anyway.
scrollaway 670 days ago [-]
How is that a GDPR violation?
GDPR doesn’t prevent opt outs of this kind of thing.
CuriousSkeptic 670 days ago [-]
In the sense that consent requires active opt-in. The passive “opt-in” by failing to set the cookie doesn’t count as consent.
So if they’re claiming they have the right to process data on the legal basis of consent, and they claim the absence of that cookie constitutes that consent, then they have no legal basis, and are thus in violation of the law.
jwx48 670 days ago [-]
I am not a lawyer, just a Sysadmin; but with that said, the linked pdf of the complaint is absolutely fascinating to me. It's worth it (to me) for the list of resources it cites.
seaerkin 670 days ago [-]
Do we think this is related to media platforms seemingly walling themselves off? Requiring accounts to view content, removing API access. It seems if they can silo data off and make it difficult to access at a large scale, then they are the gatekeeper of the data and can control usage and pricing.
BlueTemplar 670 days ago [-]
No, this always happens with platforms once they feel they have attracted enough users : for instance it happened with Twitter in 2013, or see also what happened to XMPP after Google and Facebook have adopted it, or Reddit going closed source in 2017...
janvanlooy 670 days ago [-]
We talk to a lot of companies and many want to start using generative AI but are afraid of litigation. As long as it is not clear on which data a given model has been trained and that it is explicitly licensed permissively by the owner you are not sure what can happen.
We are actually working on a tool to create billion-size free-to-use Creative Commons image datasets and prepare them for training models like Stable Diffusion. There is a blogpost about it here: https://blog.ml6.eu/ai-image-generation-without-copyright-in...
barathr 670 days ago [-]
Rather than there being lawsuit after lawsuit of this sort, we wrote an op-ed this morning that says there should be a simple, compulsory licensing fee that AI companies pay to the public -- something we called the AI Dividend: https://www.politico.com/news/magazine/2023/06/29/ai-pay-ame...
jbarrow 670 days ago [-]
The order of magnitude of suggested pricing is really interesting: $0.001/word is significantly more expensive than, say, OpenAI's pricing of GPT-3.5-turbo ($0.002/1k tokens, ~750 words, so ~$0.000003/word, assuming I got my zeros correct). So this would increase the cost of running GPT-3 by about 300x.
In terms of implementation, I wonder about a few things:
Do models trained on more data have to pay more? LLaMA was trained on 1.5T tokens, the original GPT-3 was trained on ~300B tokens. And this is only partially related to model quality, LLaMA 13B and LLaMA 65B were trained on the same data, but the 65B model is better. What's the incentive to ever use the 13B model, if the licensing cost is 100x-1000x the model inference cost?
Who defines a word? Each model uses a different tokenizer. I'm personally amused by the idea of a government-mandated tokenizer.
What about generations that never see human eyes? As an NLP researcher, I've generated millions of tokens for training and automatic evaluation purposes -- are those subject to licensing as well?
barathr 670 days ago [-]
Yeah, the idea is that it's much more expensive than current OpenAI pricing but much less expensive than what even a low-end marketing copy writer would charge per word. Its side effect would be to push such tools towards more valuable uses.
The idea is to keep it simple, so it wouldn't be based upon the specifics of training, just whether or not it used public data. Anything else would require companies to divulge trade secrets and that won't fly. And words are defined here as, well, words -- English words. There'd be a separate fee per pixel/voxel, and then a catchall for non-language/non-image models.
codekansas 670 days ago [-]
1. How would this not make tools like Github Copilot exorbitantly expensive? Why should I have to pay a tax to everyone else in the United States to use something that was disproportionately trained on my own data?
2. Given that the internet is global, is every country supposed to make their own versions of this? Will I have to pay the EU tax to use models that might have been trained on data that Europeans posted online?
barathr 670 days ago [-]
To your first question, it would incentivize training of models on one's own data exclusively -- companies could train something like Copilot on their own code, for instance. To your second question, there's no way to have an international policy like this so yes each jurisdiction would do it independently -- just as they do with thousands of other similar things.
codekansas 670 days ago [-]
I don't think a model trained on a single company's data would be nearly as helpful as a model trained on all publicly licensed code on the internet. But suppose it were...
What if I'm not a massive corporation with millions of lines of code to train on and I want to pay for an AI coding assistant? Doesn't this make it effectively illegal for me to purchase such a product for a reasonable price when big companies will presumably be able to use it without paying the tax?
Another situation - let's say you're a company that contributes heavily to open source, but also accepts external contributions. Could Facebook train a model on the React codebase, for example, without having to pay the AI tax?
Another situation - suppose I start an LLM coding assistant and sell it to my friend. Presumably I don't have to pay the tax as a "low revenue" company. Then I get acquired or get some huge seed round and suddenly my customers have to pay the AI tax. Doesn't this just nuke all my customers?
Anyway, as a software engineer, I personally want people to use my code for whatever they want to use it for, without having to pay me for it. I indicate that by using an MIT license. Why throw that precedent out the window?
barathr 670 days ago [-]
The policy would exempt all except big companies from the fees. So if you set up your own, you don't pay. And the effect of the revenue threshold creating an advantage for small businesses is commonplace in policy across the board -- SMBs don't have many of the same costs and obligations as larger companies.
And this would not prevent you from explicitly licensing your code or writing to let people to train on it. But what it would do is say that if someone didn't explicitly license it then it is covered under the policy.
codekansas 670 days ago [-]
Also regarding international policy - good luck getting Chinese citizens to pay the US AI tax. Effectively you'd be nerfing anyone under US jurisdiction
barathr 670 days ago [-]
Not really -- it's the same as selling any service into the US. Yes people cheat on, say, sales tax, just like Amazon did in the early years, but eventually once big enough companies end up having to adhere to the policy.
Workaccount2 670 days ago [-]
Can't wait for the deluge of AI generated content dumped en masse on the internet purely to harvest "AI Dividends".
barathr 670 days ago [-]
The dividend isn't paid to generated content but for generated content -- so generating content (using say ChatGPT) means you're paying into the AI Dividend fund not receiving money from it.
ravenstine 670 days ago [-]
Does every business in the 21st century need to be some form of low-level scam in order to make headway and grow enough to satisfy VCs or investors?
tsunamifury 670 days ago [-]
Yes, that’s where the disruption comes from. In all Seriousness that’s the advantage left in an efficient market.
Look at all the share economy players it boils down to offload the risk, labor and debt but keep the margin.
gumballindie 670 days ago [-]
It does seem like. Playing by the rules limits growth. Stealing, cheating, lying, manipulating, are the endless money cheat, particularly in societies where most people abide by the rules. Once they hit big they find willing politicians to adjust laws to their favor. Rinse and repeat.
elforce002 670 days ago [-]
Only 3? They should go for the whole 10, and settle for 1.
Now that the gates are open, we'll probably be entering the "free money" cycle soon.
locallost 670 days ago [-]
I wonder if we'll see a license for content that forbids its use for training of language models.
zer0c00ler 670 days ago [-]
Maybe as a result OpenAI will have to publish how they trained and what data was exactly used.
m3kw9 670 days ago [-]
Why not sue for 30 billion instead if you are to go full stupid on the price
WA 670 days ago [-]
Interesting, for once this doesn't have anything to do with the GDPR. It's by 16 (US) individuals, filing the complaint in SF.
knaik94 670 days ago [-]
California is the only state that has active data privacy laws. Although, I don't think there's any financial transactions, it's just public data scraping. I wonder if the company can even be held liable for the output of these LLMs. There's no direct hosting of any static data.
It had to happen eventually. There is so much money to go after. This is a case of lawyers creating their own income stream.
ouraf 670 days ago [-]
If anything major comes out of this, is probably EVEN MORE prompts and popups asking for permission to use your data. even with GDPR, data collection and sales never stopped, it just made things more annoying by transforming every webpage into a granular term of service to continue doing the same.
It isn't even turned off by default. Many sites just give you an "i accept" button or even if you want to manage the preferences, the "accept all choices" button is where the "confirm my choices" should be.
Bigger companies will just append this to their TOS and push it down the customer's throat. That if MS doesn't settle out of court and the case gets thrown together with any major oppositon to the data mining
ldehaan 670 days ago [-]
[dead]
submeta 670 days ago [-]
Well, it was too good to be true. Reminds me of the early days of music sharing and Napster.
Cthulhu_ 670 days ago [-]
Which was never legal in the first place, but it was great because it liberated music and content to the masses. It was the necessary precursor to what is now Spotify and the like, instant access to billions of songs. The music industry didn't like that (Napster & co) because they wanted purchases and to get paid every time the music they owned was played.
bannedbybros 670 days ago [-]
Napster was people sharing files. That's a crime! But now it's corporations so it should be legal.
elforce002 670 days ago [-]
Well, I think the main catalyst here is that corporations need to pay creators, etc... Napster cut the middleman, hehe.
mbgerring 670 days ago [-]
Wow, I really don’t get it, if I were to memorize billions of pages worth of people’s private messages and medical records, then recited them live in the Internet, would that be a crime??
pessimizer 670 days ago [-]
What exactly is the difference between that and downloading billions of pages worth of people's private messages and medical records, and putting them in a torrent? If there is a difference, I should be able to make a disability discrimination case under the ADA and erase that difference, because I don't have the memory to do that without the aid of a prosthetic (i.e. my laptop.)
Cthulhu_ 670 days ago [-]
Yeah, unless you had permission from the authors to do so.
I don't expect this lawsuit to lead anywhere. But if it does, I hope it leads to some clear laws regarding data privacy and how TOS is binding. The recent ruling regarding web scraping makes the case against OpenAI a lot weaker. [1] Data scraping publicly available data is legal. People didn't need consent to having their data be used, there was an implicit assumption the moment the data was published to the public, like on reddit or youtube.
I keep seeing this idea reoccur in the suit:
>Plaintiff ... is concerned that Defendants have taken her skills and expertise, as reflected in [their] online contributions, and incorporated it into Products that could someday result in [their] professional obsolescence ...
Anyone is able to file a suit, I wish people stopped assuming that a news report automatically means it has merit.
1. https://www.natlawreview.com/article/hiq-and-linkedin-reach-...
One of the "I wonder where this will go" things with the reddit and twitter exoduses to activity pub based systems is that it is trivial for something to federate with it and slurp data without any TOS interposed.
The TOSes for these systems are typically based based on what can be pushed to them - not what can be read (possibly multiple federations downstream).
You can usually disregard such articles as you can expect biased/incomplete reporting.
Lawsuit claim amounts have zero bearing on reality. They must be specified in any classroom, but lawyers just always specify massive amounts without justification.
Any reporting on this amount indicates ignorance in the system or intentional dishonesty.
My favorite LLM analogy so far is the "lossy jpeg of the web." Within that metaphor, I don't see how anyone can claim copyright on the basis of a pixel they contributed that doesn't even show up in the lossy jpeg. They can't point to it.
https://theinnisherald.com/the-other-once-upon-a-times-a-his...
https://en.wikipedia.org/wiki/Legal_issues_with_fan_fiction
Fanfiction and fan art also tend to run afoul of the infrequently (but occasionally) litigated part of copyright - copyright of fictional characters.
https://en.wikipedia.org/wiki/Copyright_protection_for_ficti...
I came across this with the Eleanor lawsuits - https://www.caranddriver.com/news/a42233053/shelby-estate-wi... - and while I believe that that instance Eleanor falls on the "this shouldn't have been copyrightable" (took a bit to get there), the question is "what protects the representation of Darth Vader?"
In general it tends to be ignored and tacitly encouraged... but it isn't protected.
It's been a bit surreal seeing modern day Luddites come out of the wood works basically coming up with any ethical/legal argument they can that is a thinly veiled way of saying "I don't want to be automated!"
Not commenting on whether or not they are right per se, but it's weird seeing history repeat itself.
(I should caveat that I think if they get what they want, we all lose in a big way. Not that I think this is going anywhere)
We're coming up on the outer bounds of our systems of incentives. Captialism, as a system, is designed to solve for scarcity, both in terms of resources and in terms of skill and effort. Unfortunately, one of the core mechanisms it operates on is that it's all-or-nothing. You MUST find a scarcity to solve or you divorce yourself from the flow of capital (and starve / become homeless as a result).
Thus, artificial scarcity. It's easy to spot in places like manufacturing (planned obsolescence) IP (drug / software / etc patents) and so forth. I think this is just the rest of humanity both catching on and being caught up with. Two years ago, everyone thought they had a moat by virtue of being human. That's no longer a given.
One hopes that we'll collectively notice the rot in the foundation before the house falls over (and, critically, figure out how to act on it. We have a real problem with collective action these days that may well put us all in the ground).
Why? Except for the longshoremen in the US getting compensation and an early retirement due to the introduction of containers, I know of exactly 0 (ZERO!) mass professional reconversions after a technological revolution.
Look at deindustrialization in the US, UK, Western Europe.
When this happens, the affected people are basically thrown in the trash heap for the rest of their lives.
Frequently their kids and grandkids, too.
Businesses change and adapt. Workers too — but people often don’t like change, so many choose to stay behind. Should we cater to them?
I used to do a lot of work which is now mostly automated. Things like sysadmin work, spinning up instances and configuring them manually, maintaining them. I reconverted and learned terraform, aws etc when it became popular.
Should I have gotten help from the government to instead stick to old style sysadmin work?
I don't think anyone beyond a few marginal voices are calling for a ban on job automation. What they seem to prefer is that, if they are to be automated out of a job, they should be compensated for their copyrighted works having been used in the process of doing so.
Regardless, at the very least people who are being automated should get some government support. Not everyone can easily retrain.
Suppose you're a farmer. You've been working on your tractors for decades, and have even showed the nice folk at John Deere how you do it. Now they've built your improvements into the mass-produced models, and they say you can't work on your tractors any more. Who should reap the profits?
Suppose you're a writer. You've spent a long time reading and writing, producing essays and articles and books and poems and plays, honing your craft. You've got quite a few choice phrases and figures of speech in your back pocket, for when you want to give a particular impression. Now, there is a great big statistical model that can vomit your coinages (mixed in with others') all over the page, about any topic, in mere minutes. Who should reap the profits?
Suppose you're a visual artist. You enjoy spending your time making depictions of fantasy scenes: you have a vivid imagination, and, so you can make a living illustrating book covers and the like. You put your portfolio online, because why not? It doesn't hurt you, it makes others happy, and maybe it gets you an extra gig or two, now and then. Except now, there's a great big latent diffusion model. Plug in “Trending on Artstation by Greg Rutkowski”, and it will spit out detailed fantasy scenes, photorealistic people, the works. Nothing particularly novel, but there was so much creativity and diversity in your artwork, that few have the eye to notice the machine's subtle unoriginality. Who should reap the profits?
"You build a dam that destroys 10000 homes, who should reap the profits?"
• Should we be destroying people's homes to build dams without their consent?
• In general, are people being compensated when these things happen to them? i.e., while it might be nice, does this actually happen?
The Luddites (the real ones, not the mythological bastardisation of them) continue to be sympathetic characters.
The famous: "it depends" :-)
AI most likely falls under: "they should be", IMHO.
Tokugawa Japan, Qing China, many other places including in Europe for centuries.
That's too extreme.
My point is that we're reaching a point where people need to be compensated. We can't just destroy their lives, collect all the money in 2 bank accounts and call it a day.
That's the real flaw in Luddite thinking -- you can destroy the machines.
To make coal mining automation analogous to chatGPT the machinery would have had to use something the coal miner did to learn how to automate their work? I'm imagining a camera looking at all the coal miner's work and then the machine can immediately do it, but better.
I agree it is a tad different, but like with someone's coal mining which is in the public domain for anyone in the tunnel to see, likewise anything you write unprotected online is in the public domain and fair game I think?
* Was it published publicly? This is basically defined in the courts as "if you make an unauthenticated web request does the data return?". This is where scraping comes in- if you make the data available without authentication you can't enforce your TOS, because you can't validate that people actually even accepted the TOS to begin with.
* Is the data able to be copyrighted? This is where things are interesting- facts can not be copyrighted, which is why a lot of scrapers are able to reuse data (things like weather, sports scores, even "for hire" notices can be considered factual).
* If it would typically be considered covered by copyright, does fair use come into play?
* Are there any other laws that come into play? For example, GDPR, CCPA, or other privacy laws can still add restrictions to how data is collected and used (this is complicated by the various jurisdictions as well)
* Was the work done with the data transformative enough to allow it to bypass copyright protections? This goes back to when Google was scanning books. Because they were making a search engine, not a library, their search tool was considered transformative enough to allow them to continue.
It's not enough to say "because it's on the internet, it's fair game for everyone to use". This is a really complicated area where things are evolving rapidly, and there's a lot of intersecting law (and case law) that comes into play.
And another complication is that OpenAI is not exposing any static data. A response is generated only after prompting. I'd argue that LLMs are closer to calculators than databses in function. The amount of new information that can be added is also limited, it's is not a continuous learning/training architecture.
I do hope this leads to more clear laws regarding data privacy, but I can't imagine the allegations of "intercepting communications", violating CFAA, or violating unfair competition law will hold.
To put it another way, it's legal for me to go to the library and borrow a DVD or a book or poems. That doesn't give me the right to publish the poems again under my own name. Whether I find the poems from scraping, borrowing the book from a library, or even just reading it off of a wall I don't get ownership rights to that data.
The same logic applies to a lot of other laws around data. If you collect data on individuals there are a bunch of laws that come up around it, and many of them don't really concern themselves with how you got the data so much as how you use it. The fact that it was scraped doesn't grant any special legal rights.
You keep making the claim that because it was scraped people can do whatever they want, as scraping is legal. That is the only thing I'm arguing against, because that is a gross misinterpretation of how the case that made scraping legal was decided. LLMs aren't relevant to that point (which is exactly what I keep saying- the method of collection doesn't magically change the legality of it).
That being said, you're still wrong. The USPO has said that the output of LLMs are the outputs of algorithms and are not creative works. Therefore you can't "own the copyright to the new work you make" because the work itself can't be copyrighted at all. No one can own the output of an LLM.
Also, just because it seems you want to be wrong on every level, it is absolutely possible that a neural network would be able to repeat data from its training set. This is an incredibly known problem in the field.
https://www.bloomberglaw.com/external/document/XDDQ1PNK00000...
That doesn't mean you grant a license to produce derivative works other than search indexes. Legally, it's different. (Germany codifies these as separate "moral rights": Urheberpersönlichkeitsrecht.)
I'm not too worried about copyright issues because regardless of whatever happens with upcoming case law and legislation, any regulation against the input data will be totally unenforceable. It's nearly impossible to detect whether or not an LLM was trained on some corpus of data (although maybe there is some "trap street" equivalent that could retroactively catch an LLM trained on data it wasn't allowed to read). And even if the weights of a model are found to be in violation of some copyright, it's still not enforceable to forbid them, because they're just a bag of numbers that can be torrented and used to surreptitiously power all sorts of black boxes. That's why I'm much more worried about legislative restrictions on hardware purchases.
I hope it leads to more people realizing that a TOS doesnt override their individual rights and that the legal system works to support them.
There's sort of an exception for military service, but even soldiers have acess to military courts.
It's codified in the fact that saying you'll do something means you're socially obligated to do it, and legally obligated if you receive something in return.
It seems to me there are a to of counter examples to this "right" you speak of. So many that it doesn't seem like it really exists.
The same argument could be used to defend ubiquitous face recognition in the street though (“when going to the street, there's an implicit assumption that your presence in this place was public”) but I'd really like if we could not have that…
There's a case to be made that corporation gathering data and training artificial intelligence don't need to have the same right as people: when I go to the street or publish something on Reddit, I'm implicitly allowing other people to read my comments, but not corporations to monetize it. (GDPR and the likes already makes this kind of distinctions for personal information by the way, so we can totally extend it to any kind of online activity).
Just because I posted something on reddit because I thought it was funny, doesn't implicitly give permission to anybody to take that post and profit from it. You're doing a disservice to consumers by acting like it's their fault for being exploited.
I disagree with you on whether it should count as being exploited. I don't see fanfiction writers professional impersonators or as inherently exploitative. I understand that some people would disagree because there is a difference in scale. But using technology to mimic and, in some sense, replace human effort is the reason it is useful.
I believe this will shift how and why people value organic media. The standard of what makes content "good" will rise in the long term. When stable diffusion first came out, I compared the generated art to the elevator music. I feel the same way about the output of LLMs. I might feel differently in a few years if models get better at the rate they currently have been, but that's not likely.
I agree that people should have more control over how their data is used, and I'd love to see this suit lead to stricter laws.
So the technology is cool, but I'm firmly of the stance that they cut corners and trampled peoples' rights to get a product out the door. I wouldn't be entirely unhappy if this iteration of these products were sued into the ground and were forced to start over on this stuff The Right Way.
Another part which bothers me is that I have lots of different personalities online. On most sites I use different usernames, and I wonder if there will someday be an AI which can match all the different online profile to a single person, even if different username are being used etc.
Not on the phone yet, but on a Mac which could include iMessages.
And it adds nothing. I'm sorry but saying "Whether that content is used to train a human mind or an artificial one is probably not up to you" may be worse than saying nothing at all.
First because it shows enough doubt on whether it's up to the authors of content (IP laws, fair use, intent of the use, and many things I ignore), while giving no laws as an example or frame of reference.
And second because it's comparing a human mind that we know exist, to an artificial one, which implies:
1. An LLM is an artificial mind, or close to one, whatever that is (again, not defined).
2. If they were to exist, they would be both equivalent and treated the same as a human one.
The amount of jumps in a couple sentences, added to the uncertainty of how copyright would/will work, multiplied by the numer of times I/we read that type of comment every single time, it's getting tiresome. And it's adding noise to the noise-signal ratio.
If you want to prevent a web spider from scraping your blog, use a captcha or robots.txt. Copyright law doesn’t apply to this scenario.
If you're tired of responding to these comments then stop. It's the internet, everyone is at different places in exploring topics and having discussions. Don't poo-poo on someone else's journey and instead move on with your day. There is no required reading (other than TFA) on hacker news.
No. Both legally and practically, you absolutely do not.
The only thing copyright law gives you is an exclusive right to sell it for a limited period of time, as a whole in its original form or similar -- and to transfer that right.
Regardless of your desires, anyone can reuse it under the conditions of fair use. They can copy parts of it for parody purposes. If they're not selling anything or taking away from your sales*, they can reproduce it verbatim for private purposes. And even if they are selling something, they can summarize it, quote from it, rephrase it, and so forth.
And you don't actually get to decide any of that.
* Edit: added "or..."
> I wasn't asked and I don't really care to donate work to large corporations like that... I do get to decide what happens with it.
And I said:
> No. Both legally and practically, you absolutely do not.
You think you get to decide whether large corporations can train on your work. I'm saying the the law suggests you very much don't get to decide that.
Send some links if you see some definitive case law sorting this stuff out.
If ChatGPT regurgitates verbatim or nearly verbatim, something it slurped up from OP's blog, is that not plagiarism? Where do you draw the line? Where would a reasonable person draw the line?
Often rather than claiming human aspects to the machine, they are going further, and claiming machine aspects to the human.
Using mechanistic analogies for explaining the human body or mind isn't new, but as machines become better and better at imitating humans, those analogies become more seductive.
That's my rant; the danger with 'AI' isn't so much that humans are enslaved by machines, but that we enslave each other -- or dehumanize each other -- with machines.
You are entitled to control it's distribution and use. You are not entitled to control it's influence and effects.
AIs are not massive repositories of harvested data. The models are relatively small (<20GB).
https://www.pinsentmasons.com/out-law/news/google-thumbnails...
> A US court ruled this week that Google's creation and display of thumbnail images does not infringe copyright. It also said that Google was not responsible for the copyright violations of other sites which it frames and links to.
> The Court said that Google did claim fair use, and that whether or not use was fair depended on four factors: the purpose and character of the use, including whether such use is of a commercial nature or is for non-profit educational purposes; the nature of the copyrighted work; the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and the effect of the use upon the potential market for or value of the copyrighted work.
Taking copyrighted material and using it to train a model is not a copyright infringement - it is sufficiently transformative and has a different use than the original images.
Note that AI models can be used for different things. A model trained to identify objects in an image has never had uproar about the output of "squirrel" showing up in the output text.
The model also, as a purely mathematical transformation on the original source material does not get a copyright. If it needs to be protected, trade secrets are the tools to use to protect it. A model is no more copyright worthy than tanking an image and applying `gray = .299 red + .587 green + .114 blue` to it.
The output of a model is ineligible for copyright protection (in the US - and most other places).
The output of a model may fall into being a derivative work of the original content used to train the model.
It is up to the human, with agency in asking the model to generate certain output to be responsible for verifying that it does not infringe upon other works if it is published.
Note that the responsibility of the human publishing the work is not anything new with an AI model. It is the same responsibility if they were to copy something from Stack Overflow or commission a random person on Fiverr... its just that those we've overlooked for a long time - but it is similarly quite possible for the material on those sources to be copyrighted by and/or licensed to some other entity and the human doing the copying into the final product is responsible for any copyright infringements.
Saying "I copied this from Stack Overflow" or "I found this on the web" as a defense is just as good as "Copilot generated this for me" or "Stable diffusion generated this when I asked for a mouse wearing red pants" and represents a similar dereliction on part of the person publishing this content.
Would that satisfy you?
But we do? Open sourcing something with caveats is common. This code is public BUT not for commercial use. This code is public BUT you must display attribution etc.
Sure, blogposts are unlicensed (that I know) but the idea of something publicly available being held to restrictions is nothing new.
But telling me it is illegal to share what I learnt because the original source is copyrighted... doesn't sit right with me.
What protects particular solutions is patents. For example if someone were to obtain a patent for computing GCD of large integers the usual fast way, well then everyone else would have to use a different solution.
This analogy to someone reading a book, perhaps peppered with lots of legalese to the point of being hardly recognizable, will definitely be used in courts at some point. And I can't see how it wouldn't stand as a valid argument.
If you pick up a book and learn a fact, then yeah, you’re allowed to share that fact.
It’s weird that this topic keeps devolving into a form of “so what, it’s illegal for me to learn things?” Because: no, it’s not. And: You and a piece of software are treated differently under the law. You have a different set of rights than ChatGPT.
Gods, no. Where did you get that from?
Those might not be a problem regarding this specific case, but the case can easily be made that it ought to be.
I don't understand your point. Do you think it makes any difference whether I use my laptop, or a pen, or ChatGPT to violate copyright?
On the other hand, it's completely feasible to make a license that stops someone from training their model with some piece of info, is it not?
And regardless -- the problem now is that expectations of how content can be consumed are now fundamentally violated by automation of content ingestion. People put stuff up on the Internet with the expectation of its consumption by human minds, which have inherent limitations on the speed and scale on which they can learn from and reproduce things, and those humans are also legally liable, socially/ethically obligated, etc.
Now we have machines which skirt the limits of legality, and are able to do so on massive scale and without responsibility to society as a whole.
Different game now.
Then people obviously aren’t aware that bots have been indexing web pages and showing summarized information without going to the web page for three decades.
Further, almost every site has had an e.g. robots.txt which has permitted content harvesting only for certain accepted purposes for a couple decades now. So clearly people already had a sense of how they wanted their content harvested and for what purposes.
So you’re okay with Google making money off of your content. But not OpenAI?
Don't get me wrong, this is a grey area where copyright laws and general consensus haven't caught up with new techonology. But if you voluntarily stick something up online with the intent that anyone can read it, it seems a bit mean to then say "wait no you can't do that" if someone finds a way to materially profit off it.
Putting it more bluntly, it is somewhere between a parasite and a slave driver.
I wonder what would an LLM trained on Google code and internal documents look like?
Seems like if it's legal for a person to do it should be legal for software to do for the most part.
If you ask ChatGPT the rules for D&D, the private sourcebooks are all in there.
Whoever told you that is lying to you. You are not legally allowed to personally memorize and recite copyrighted works all you want, any more than you're allowed to personally memorize, write down copyrighted works, and distribute them as much as you want.
All piracy is a process of computer-assisted remembering and reciting.
I can't go write and commercialize what I learnt directly, but I'm not breaking the law by quickly seeing how some book I didn't buy ends so I can talk about it at a party - and then everyone knows how it ends which might affect whether they want to buy said book and upset the author. But, tough shit, what I did was legal. I can even use the ending as one set of input from dozens of inspirations for my own book where the end result is transformative enough where the sources are unrecognizable. And if I had learnt about the endings from a dozen books without buying those books I didn't break any laws even though I am now commercializing something in being inspired by them all to make something new.
Being non-commercial is not an automatic fair use exception. Being commercial does not preclude fair use. And rule concepts are not copyrightable, only the specific expression. Rules may have other IP protection, including patents.
That's not even true in the US anymore. You'd have to convert those rules into some sort of device, or argue that the game is a business method.
The rules of games cannot be copyrighted either. The artistic elements can be trademarked, but if ChatGpt merely explains the rules to you in different ways, that isn't infringement either.
...wait, isn't that false? legitimately asking.
or is it because it was done by a corporation that makes it illegal?
im thinking of how restaurants dont sing happy birthday and fair use restrictions etc
If I recite them to myself, in my home, it's fine. If I do it at a gathering at my house where we're playing D&D, fine. If I do it as a performance, in front of a crowd, or as a recording, now I'm no longer fine. Context matters in a copyright cases. Not to mention, to claim fair use, you do have to claim you violated copyright. Fair use is just an allowed violation.
As to Happy Birthday, that's actually ok for them to do now. The person/group that held the copyright to Happy Birthday was found to have not actually have held them in the first place. Happy Birthday is actually an older song called "Good Morning to All". Swap "Good Morning" with "Happy Birthday" and "children" with "dear [PERSON]" and you have the lyrics. This was not deemed a substantive change. And since the copyright on "Good Morning to All" has lapsed, Happy Birthday is in the public domain.
[0]: https://fairuse.stanford.edu/overview/fair-use/four-factors/ [1]: https://creativecommons.org/faq/#can-i-apply-a-creative-comm...
I can give you a rock that I own, which I hope we all agree is not copyrightable, and ask you to sign a license that you will keep it indoors. If you put it in your yard, you are breaking the license and potentially liable. This has nothing to do with copyright.
Obtained from (check notes) public internet forums
> For the 16 plaintiffs, the complaint indicates that they used ChatGPT, as well as other internet services like Reddit, and expected that their digital interactions would not be incorporated into an AI model.
You've got to be incredibly naive if you think public Reddit data isn't used to train ML models, not least by Reddit themselves
[1] https://www.kaggle.com/datasets/ehallmar/reddit-comment-scor...
2: You're the one who went with "invented" ;)
3: I know you're exaggerating, but I think you think you're exaggerating much less than you actually are.
[1] https://www.bankmycell.com/blog/number-of-reddit-users/
It's not really important to the debate around unlicensed use of copyrighted works to train AI models, but it wouldn't surprise me at all if the majority of Reddit users have joined since 2018. It's tough to get reliable active user counts, but they seem to have risen substantially over the past five years.
It also wouldn't surprise me if the majority of Reddit users were indeed from prior to 2018, but at the very least > 2018 would be a very substantial minority.
Surely, there is some pretty large subset of things where "if it's legal for a person to do it should be legal for software" does not hold up?
So how about the default is "not allowed"
I am sure OpenAI thought all this through, so I can only assume they said "fuck it let's pull an Uber and do this anyway." We are in for lots of interesting legal headlines
If you're going to make a claim this strong, you should expand on it. Should software be able to have custody of children? Should it be able to kill in self-defense? Should it be able to make 14th amendment claims? Exactly what part of the case (other than the damage claim) is hard to understand?
It's legal for me to look out of the window and watch my neighbor go to the supermarket.
It's _not_ legal for me to build an automated surveillance system that tracks everybody on the street 24×7 and stores everything into a large database.
There's more deliberate action when you post something on a public online form than just existing in a place outside of your house. Especially considering you've always had the option to use reddit anonymously anyway.
Read, yes - post no.
And - you can no longer create an account that is not tied to an email...
> The lawsuit is seeking class-action certification and damages of $3 billion – though that figure is presumably a placeholder. Any actual damages would be determined if the plaintiffs prevail, based on the findings of the court.
They picked 3B hoping to get several million...
But that is a moral point, not a legal one; IANAL and can't say anything valuable about the legal merits.
Ideally AI makes us all redundant and the money stops mattering anything like as much, similar to how owning land stopped mattering anything like as much when the industrial revolution happened.
Regardless, I think this is a policy question rather than a legal question, even if this fight happens to be in a court.
When this happens to closedAI, it just seems like a profit grab.
Not that it changes the legality of it. Just optics.
Wonder if that matters in court.
Then they came for the writers, but I did not speak out because I was not a writer.
Then they came for me, and there was no one left to speak for me... well, except ChatGPT.
edit: corporate LLMs have pulled the "one death is a tragedy, ten thousand deaths are a statistic" ploy off fully. If you want people to question whether you're even violating copyright, make sure you violate all of them at the same time. They'll just decide that you're an act of god and not covered under earthly laws.
I don't think this is relevant. If OpenAI had trained a model on just one copyrighter holder's content it would likely not be different legally, even if the model would perform much worse.
Not that all capital is distributed by merit, plenty of people used military might or factionalism/leaders/politics to obtain disproportionate amount of capital.
But if you are against the last 2 happening, I don't see what you expect a reorganization of society to accomplish since you are going to get a power structure of factionalism/leaders/politics taking priority. (Sorry bud, no an-com utopia ever existed, they all had factionalism/leaders/politics, thus defeating the entire purpose of removing class.)
I think most of us think we can capture/retain power easier with money, than having to climb up inter-party politics.
I don't necessarily disagree with your later points. I do, however, disagree with giving up.
At least its equitable (based on value of output), ofc there are legacy issues as well.
Some demagogue can swoon the masses and take it all if not for capital. That demagogue could be Trump or Stalin.
Know the consequences of what you are advocating for.
It's not. By definition, it's based on control of capital. That's why it's called capitalism. In other words, those aren't "legacy" issues; they are literally the system as designed.
Since utopia is impossible, its a choice between:
>Capitalism, where people can typically pull off the american dream in their lifetime.
or
>Let politics determine how much material things you get
The latter seems especially scary if you are familiar with history
Capitalism follows a very simple algorithm. In a capitalist economy, capital always accumulates, with all exceptions being precisely that: exceptions. Are you defending the exceptions or the rules?
Realize there was a very long and quite recent time when capitalism was impossible. By your logic, we should reinstate the divine right of kings.
https://news.ycombinator.com/item?id=36498294
> Nuance has strict data agreements with its customers, so patient data is fully encrypted and runs in HIPAA-compliant environments
Additionally Epic seems to already be storing these clinical notes in databases and Nuance which Microsoft owns has already technically been a 'hot mic' in these same doctors office for some time. The new offering is an AI-draft note generator.
I'm personally skeptical that model output would suddenly be under different rules than the other voice-to-text AI model output?
The 1-28 pleading numbers on the side are annoying. They're specific to courts in California and a few other jurisdictions, and the rules of court require them. But many other courts don't have them, and they only help to cite specific lines within pages; eg "Complaint 5:4-9" means "Complaint at page 5, at lines 4 to 9". It's occasionally useful for court filings like this, but more useful for court/deposition transcripts of testimony to show precisely where a witness said something.
Related: I tried building an RNN to generate legal pleadings back around 2018/19 and gathered a bunch of docs like this from courts across the country as training data. Processing text with those pleading numbers was a pain, so I built a CNN to classify whether a document had pleading numbers or not, which affected downstream processing. Probably the wrong approach in a bunch of ways, but I was just learning.
When I worked on an expert witness report for a big law firm we just used Word.
Seems to be a blatant violation of GDPR. So I assume they’ll be fined for it sooner or later and forced to cleanup the training data anyway.
GDPR doesn’t prevent opt outs of this kind of thing.
So if they’re claiming they have the right to process data on the legal basis of consent, and they claim the absence of that cookie constitutes that consent, then they have no legal basis, and are thus in violation of the law.
We are actually working on a tool to create billion-size free-to-use Creative Commons image datasets and prepare them for training models like Stable Diffusion. There is a blogpost about it here: https://blog.ml6.eu/ai-image-generation-without-copyright-in...
In terms of implementation, I wonder about a few things:
Do models trained on more data have to pay more? LLaMA was trained on 1.5T tokens, the original GPT-3 was trained on ~300B tokens. And this is only partially related to model quality, LLaMA 13B and LLaMA 65B were trained on the same data, but the 65B model is better. What's the incentive to ever use the 13B model, if the licensing cost is 100x-1000x the model inference cost?
Who defines a word? Each model uses a different tokenizer. I'm personally amused by the idea of a government-mandated tokenizer.
What about generations that never see human eyes? As an NLP researcher, I've generated millions of tokens for training and automatic evaluation purposes -- are those subject to licensing as well?
The idea is to keep it simple, so it wouldn't be based upon the specifics of training, just whether or not it used public data. Anything else would require companies to divulge trade secrets and that won't fly. And words are defined here as, well, words -- English words. There'd be a separate fee per pixel/voxel, and then a catchall for non-language/non-image models.
2. Given that the internet is global, is every country supposed to make their own versions of this? Will I have to pay the EU tax to use models that might have been trained on data that Europeans posted online?
What if I'm not a massive corporation with millions of lines of code to train on and I want to pay for an AI coding assistant? Doesn't this make it effectively illegal for me to purchase such a product for a reasonable price when big companies will presumably be able to use it without paying the tax?
Another situation - let's say you're a company that contributes heavily to open source, but also accepts external contributions. Could Facebook train a model on the React codebase, for example, without having to pay the AI tax?
Another situation - suppose I start an LLM coding assistant and sell it to my friend. Presumably I don't have to pay the tax as a "low revenue" company. Then I get acquired or get some huge seed round and suddenly my customers have to pay the AI tax. Doesn't this just nuke all my customers?
Anyway, as a software engineer, I personally want people to use my code for whatever they want to use it for, without having to pay me for it. I indicate that by using an MIT license. Why throw that precedent out the window?
And this would not prevent you from explicitly licensing your code or writing to let people to train on it. But what it would do is say that if someone didn't explicitly license it then it is covered under the policy.
Look at all the share economy players it boils down to offload the risk, labor and debt but keep the margin.
Now that the gates are open, we'll probably be entering the "free money" cycle soon.
https://iapp.org/resources/article/us-state-privacy-legislat...
https://leginfo.legislature.ca.gov/faces/codes_displayText.x...
It isn't even turned off by default. Many sites just give you an "i accept" button or even if you want to manage the preferences, the "accept all choices" button is where the "confirm my choices" should be.
Bigger companies will just append this to their TOS and push it down the customer's throat. That if MS doesn't settle out of court and the case gets thrown together with any major oppositon to the data mining
Thankfully AI doesn't work by memorization.