Congress Wants Tech Companies to Pay Up for AI Training Data

stopthatgirl7@kbin.social · 10 months ago

Congress Wants Tech Companies to Pay Up for AI Training Data

riodoro1@lemmy.world · 10 months ago

So we’ll get a couple of big players who managed to gobble and hoard everything before any regulation was in place and nobody else. Oh the sweet smell of monopoly

burliman@lemmy.world · 10 months ago

Yep. Effectively outlawing AI with this licensing hogwash (which no human who is learning how to write or draw from the same content must pay), will only drive it into the bowels of the rich and powerful. Then you will have your AI dystopia.

wewbull@iusearchlinux.fyi · 10 months ago

The choices here are to respect copyright or destroy it. Having and AI exception is nonsense.

"I’m not illegally downloading the latest blockbuster/ best seller / chart topping album. I’m scraping the internet for training data for my AI. It just so happens I need to filter the data by hand before it can injest it. I keep looking for suitable data, but haven’t identified any yet. "

There’s plenty of non copyright material out there to do research on. It won’t make for useful AI products, but they can start licensing for that.

Grimy@lemmy.world · edit-2 10 months ago

“What would that even look like?” asks Sarah Kreps, who directs the Tech Policy Institute at Cornell University. “Requiring licensing data will be impractical, favor the big firms like OpenAI and Microsoft that have the resources to pay for these licenses, and create enormous costs for startup AI firms that could diversify the marketplace and guard against hegemonic domination and potential antitrust behavior of the big firms.”

As our economy becomes more and more driven by AI, legislation like this will guarantee Microsoft and Google get to own it.

Motavader@lemmy.world · edit-2 10 months ago

Yes, and they’ll use legislation to pull up the ladder behind them. It’s a form of Regulatory Capture, and it will absolutely lock out small players.

But there are open source AI training datasets, but the question is whether LLMs can be trained as accurately with them.

Mechanize@feddit.it · 10 months ago

Any foundation model is trained on a subset of common crawl.

All the data in there is, arguably, copyrighted by one individual or another. There is no equivalent open - or closed - source dataset to it.

Each single post, page, blog, site, has a copyright holder. In the last year big companies have started to change their TOS to make that they are able to use, relicense and generally sell your data hosted in their services as their own for the intent of AI training, so potentially some small parts of common crawl will be licensable in bulk - or directly obtained from the source.

This does still leave out the majority of the data directly or indirectly used today, even if you were willing to pay, because it is unfeasable to search and contract every single rights holder.

On the other side of it there have been work to use less but more heavily curated data, which could potentially generate good small, domain specific, models. But still they will not be like the ones we currently have, and the open source community will not be able to have access to the same amount and quality of data.

It’s an interesting problem that I’m personally really interested to see where it leads.

wikibot@lemmy.world · 10 months ago

Here’s the summary for the wikipedia article you mentioned in your comment:

Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely provides its archives and datasets to the public. Common Crawl's web archive consists of petabytes of data collected since 2008. It completes crawls generally every month.Common Crawl was founded by Gil Elbaz. Advisors to the non-profit include Peter Norvig and Joi Ito. The organization's crawlers respect nofollow and robots.txt policies. Open source code for processing Common Crawl's data set is publicly available. The Common Crawl dataset includes copyrighted work and is distributed from the US under fair use claims. Researchers in other countries have made use of techniques such as shuffling sentences or referencing the common crawl dataset to work around copyright law in other legal jurisdictions.As of March 2023, in the most recent version of the Common Crawl dataset, 46% of documents had English as their primary language (followed by German, Russian, Japanese, French, Spanish and Chinese, all below 6%).

^article ^| ^about

Motavader@lemmy.world · 10 months ago

Thanks for the link to Common Crawl; I didn’t know about that project but it looks interesting.

That’s also an interesting point about heavily curated data sets. Would something like that be able to overcome some of the bias in current models? For example, if you were training a facial recognition model, access a curated, open source dataset that has representative samples of all races and genders to try and reduce the racial bias. Anyone training a facial recognition model for any purpose could have a training set that can be peer reviewed for accuracy.

General_Effort@lemmy.world · 10 months ago

Face recognition is probably dead as an open endeavor. The surveillance aspect makes it too controversial. I mean that not only will we not see open source work on this, but any work is behind closed doors.

In general, a major problem is that it is often not clear what reducing bias means. With face recognition, it is clear that we just want it to work for everyone. With genAI it is unclear. EG you type “US president” into an image generator. The historical fact is that all US presidents were male, and all but one were white. What’s the unbiased output?

One answer is that it should reflect who is eligible for the US presidency. But in the future, one would expect far more people to be of “mixed race”. So would that perhaps be biased against “interracial marriage”? In either case, one could accuse the makers of covering up historical injustice. I think in practice, people want image generators that just give them what they want with minimum fuss; wants which are probably biased by social expectations.

In any case, such curated datasets are used to fine-tune models trained on uncurated data. I don’t think that is known how such a dataset should look like exactly, to yield an unbiased model (however defined).

General_Effort@lemmy.world · 10 months ago

These open datasets are used to fine-tune LLMs for specific tasks. But first, LLMS have to learn the basics by being trained on vast amounts of text. At present, there is no chance to do that with open source.

If fair use is cut down, you can forget about it. It would arguably be unconstitutional, though.

That’s not even considering the dystopian wishes to expand copyright even further. Some people demand that the model owner should also own the output. Well, some of these open datasets are made with LLMs like ChatGPT.

wewbull@iusearchlinux.fyi · 10 months ago

If fair use is cut down…

It’s not a case of cutting down fair use. It’s a case 9f enforcing current fair use limits.

General_Effort@lemmy.world · 10 months ago

Can you give an example of something that is outside fair use?

Just in case, there is confusion here: Obviously there is no past precedent on exactly the new circumstances, but that does not put new technologies outside the law. EG the freedom of speech and the press apply to the internet, even though there is no printing press involved.

be_excellent_to_each_other@kbin.social · 10 months ago

Well fuck all those artists and writers who made the original works then I guess. Licensing is impractical.

Dran@lemmy.world · 10 months ago

They’re going to get fucked either way, may as well live in the world where smaller AI companies have a chance. It’s already bad enough that openai got to slurp reddit and twitter for free and nobody else can.

burliman@lemmy.world · 10 months ago

They won’t be fucked. They can use the AI tools as well to make novel content, and augment their production quality and quantity.

MysticKetchup@lemmy.world · 10 months ago

augment their production quality

Lmao

TORFdot0@lemmy.world · 10 months ago

And what about the authors whose works were injected without compensation? What should we do for them? I don’t think that these commercial AI models should get to infringe on their copyrights for nothing. If I pay for a ChatGPT subscription and ask it to tell me about the war the Middle East and it basically regurgitates and plagiarizes information it learned from a journalist, then ChatGPT has essentially stolen the copyrighted work from that journalist and the revenue that my click would have earned them.

I don’t see a problem using publicly posted copyrighted data for non-commercial use for training local language models but don’t think its fair to allow copyright infringement for commercial use.

General_Effort@lemmy.world · 10 months ago

You’re repeating some talking points which are simply misinformation. An author who makes something “for hire”, like an employed journalist, does not own the copyright. Do you believe that construction workers benefit when rents go up?

Copyrights are called intellectual property, because they work a lot like physical property. Employees create them and employers own them. They are bought and sold. A disproportionate share of property belongs to rich people, which is how they are rich.

This is about funneling more wealth to property owners. The idea that this would benefit anyone else is simply the good old trickle-down. It will not happen.

Grimy@lemmy.world · edit-2 10 months ago

I think it’s better be pragmatic then to give everything to the big corporations.

OpenAi isn’t going to takes its tool offline so the loss of revenue isn’t going away. Payments won’t end up in the pockets of any individual journalist. The money the few journalistic sites will receive will be used to pay for the subscription fee to the next big model while cutting off their staff since it will net them more money.

If this goes through, Google and Microsoft will spend the next few years buying data or the companies that have it. The walls will be raised and we will be fucked, legislation will only help them.

And there is simply not enough public domain data to build a competitive product. Better to tax and redistribute through UBI while keeping the field competitive and avoiding monopolies imo.

soulfirethewolf@lemdro.id · 10 months ago

I think it would be better to enforce open, readable training sets that anyone can browse through to submit legal requests

kingthrillgore@lemmy.ml · edit-2 10 months ago

The irony this is coming back to being a copyright extension issue in the year of our lord and savior, Steamboat Willie, is not lost on me.

Eggyhead@kbin.social · 10 months ago

Regulating data collection on publications: congressional action is a go!

Regulating data collection on consumers: everybody look the other way!

aelwero@lemmy.world · 10 months ago

Are they going to pay for anything that ever inspired them? Every time you publish an article, you owe a dollar to every English teacher you ever had? Fill out your taxes and you owe your math teachers?

It’s fucking goofy…

DrMcRobot@lemmy.world · 10 months ago

That’s a pretty dumb comparison. Are you suggesting that people who create stuff used to train AI are obligated to provide that education for free? People who create books/educational aids for teachers to use in classrooms still demand to be paid for that. Teachers are paid for delivering that education. The kids don’t pay the teachers, as a society we tax people because education benefits us all, but the teachers are still paid (not enough!)

Dran@lemmy.world · 10 months ago

I think he’s suggesting that it’s pretty dystopian to let creators decide that their content is free to view but only if you’re a human willing to let companies spy on you while watching it.

It’s either free or it’s not.

SkyNTP@lemmy.ml · edit-2 10 months ago

Not a great comparison. AI training is not problematic because of consumption, it’s a problem because it is then used to circumvent copyright law.

If you do want to argue with technicalities, you also have to contend with the fact that a large part of the concerned usage is not really free. Much of this online data is funded by advertisement. Scraping it constitutes circumvention.

Dran@lemmy.world · 10 months ago

There are no contracts signed agreeing to that exchange by either me nor the scrapers. Legally, it’s free.

Shurimal@kbin.social · 10 months ago

Human brain (any brain, really) is a natural neural network which is trained throughout its life the same way an artificial neural network is. Nothing is original, every creator is “stealing” from every other creator who’s work they have studied to become better creators. No creator ever in history has created anything in pure, absolute vaccuum. Every creation is a remix and amalgamation of previously created works.

And intellectual property is a spook , anyway. No-one can own an idea.

Barbarian@sh.itjust.works · edit-2 10 months ago

I hate the term intellectual property. It’s a word used to describe vastly different concepts with vastly different legal backgrounds and problems.

Copyright is theoretically a good thing, giving an artist or writer the time to profit from their work before the work becomes public domain, incentivizing the work. The current international agreements around it are absolutely bonkers thanks to Disney. The fact that the copyright persists after death, let alone for a century, is complete madness. The artist obviously can’t profit from their work after they’re dead. It’s an absolute shameless cash grab that destroys culture.

Patents are also theoretically a good thing, allowing companies to release specifications of machines that allow for 10 years of exclusive use. Without patents, companies would hide their designs as trade secrets. It guarantees that after a decade, the designs will be publicly available for anyone to see. They need to be much more heavily restricted in what you can put patents on though. Patenting a specific machine design is fine, patenting molecules or math breaks the entire system. Software patents are blatantly absurd and broken.

EDIT: Should also mention that 3D printers are a patent success story of the system working as intended. Patented in 1986, the inventor made good money making expensive machines with his own company. In 1996, the patent expired and we had an explosion of competing machines, getting ever cheaper and more effective. Everybody won. The inventor made bank for his decade of exclusivity, and then everyone benefited from the design being public domain, free for everyone to use.

Trade secrets, the protection of specific recipes, client lists and strategies, can be abused to protect companies against disclosing information that may be very pertinent to their customers and governments. The Coca-Cola recipe or lists of clients as a trade secret is fine imho, but they can also abuse trade secret law to hide systems that lie about your car’s emissions.

Trademarks help protect consumers against knockoff brands that pretend to be what they’re not. This is the least abused type of “IP”. This doesn’t mean there aren’t bad actors out there registering tons of different trademarks to squat on those designs & names, hoping to force a new company to pay up to use the name. Trademark squatting could theoretically be solved by annulling the trademark if the company isn’t actively using it. Trademarks are currently much too easy to maintain.

All of this to say, lumping all of these different laws into “IP” is not useful at all when talking about the goals of the different legislations, what they’re trying to do, and how they fail.

TimeSquirrel@kbin.social · edit-2 10 months ago

Human brain (any brain, really) is a natural neural network

But the big difference between us and the AI is that we have motivation and drive. We don’t exist for a split second for the sole purpose of fulfilling a prompt. We can take what we’ve learned and create new things with it. The AI just spits out what it already knows. Not what is possible to do with what it knows. It cannot invent.

treefrog@lemm.ee · edit-2 10 months ago

Property is a spook generally.

But I can’t blame journalist’s for wanting to eat.

Which is what this is really about. Food and paying the bills. Not intellectual property.

txmyx@feddit.de · 10 months ago

The human brain isn’t a product that is being sold. Also in most cases, education is not free (school, university, …)

Shurimal@kbin.social · 10 months ago

Education is free in most of the world. And people sell their brains all the time. It’s called “a job”.

burliman@lemmy.world · 10 months ago

You’re getting downvoted, but it will be the next thing. Don’t you dare thank the people or books that inspired you when you give that Peabody acceptance speech.

blazeknave@lemmy.world · 10 months ago

Don’t hate me… Did Hawley just grow up in the wrong place?

He always ends up in the reasonable end of some of this shit, except he has to do it under the guise of his firebrand bigoted bullshit.

Someone tell me what opinions I’m supposed to form about this guy

doylio@lemmy.ca · edit-2 10 months ago

I think most of the crazy lawmakers are not actually crazy. You probably need to be quite intelligent to make it through all the hoops to get elected to congress. It’s an act that they know gets them attention on social media, but on issues that aren’t partisan they can actually act like adults

snooggums@kbin.social · 10 months ago

Unfortunately the GOP has decreed since at least the 90s that everything is partisan.

GiddyGap@lemm.ee · edit-2 10 months ago

Hawley happens to be reasonable on some issues that have bipartisan support. He’s a true asshat on social issues.

vsh@lemm.ee · 10 months ago

Good decision imo. AI is getting ridiculously out of hand. Law can’t even keep up with whatever shenanigans they generate in their labs.

yamanii@lemmy.world · 10 months ago

Would finally make the snakes that gather the training data accountable, since AI companies use them as scapegoats.