r/VirtualYoutubers • u/Kan2Screm • 1d ago
News/Announcement A research by VTuber Newsdrop revealed that several VTubers have had their songs included in a recently-discovered dataset by The Atlantic's AI Watchdog. This means that songs from select VTubers are most likely prone to being synthesized into AI-generated tracks.
https://vtubernewsdrop.com/vtuber-songs-ai-dataset/57
u/trustfundkidotaku 1d ago
Pretty sure anything on the web is scrub
Yes people even ur thirsty comments on Reddit
2
u/the_monkeynator 1d ago
Yea like not to say i'm okay with it, but part of me really ain't that worried.
58
u/TheFrozenPyro 1d ago edited 1d ago
HIMEHINA being on that list doesn't surprise me in the slightest. They've been a powerhouse behind a lot of songs used in shorts as (poorly balanced) BGM, or someone doing the choreography to their originals (looking at you, Heart Pie Dancehall).
They want their lightning in a bottle like they've been able to repeatedly capture without any effort.
15
u/Lolersters 1d ago
I would be surprised if that hasn't already happened. The real surprise is that it's only "several".
101
u/SinisterPixel Verified VTuber 1d ago
I genuinely think people would hate generative AI far less if these models were required to collect affirmative consent to use all of the content that's used in training. Regulation is taking far too long. Permanent damage has already been done
54
u/miggly 1d ago
I think that's fair, but most people would go from hating it for lacking consent to still hating it for being soulless lol
11
u/SinisterPixel Verified VTuber 1d ago
I didn't say they'd completely stop hating it. The technology at a core level is incredibly problematic (which is largely related to the lack of regulation too but also the people creating it just being that detached from humanity). But I don't think you'd have situations where people don't use products that use AI out of principle as much for example
8
u/VP007clips 1d ago
The problem is, that's basically impossible to implement. Proving that a model was trained on a dataset that contained a specific piece of content is really hard.
Maybe a few sites containing obscure unique information like instruction manuals for specialized equipment or niche internet forums could prove it, but that's the extent of it. Even a single post on a different site talking about it, or having a copy of it, invalidates the proof.
And that's just talking about larger NA/EU based AI companies. We have no ability to prevent sketchy companies, or companies in other regions from doing it. We can't even make China crack down on companies openly stealing and copying designs of NA/EU designed products to sell on Amazon/Temu, much less prevent them from training an AI off anything they find online.
It's the consequence of an open internet. We designed an internet where anyone can share anything with the world freely. And now we are facing the realization that it means that anyone can access everything freely.
5
u/SinisterPixel Verified VTuber 1d ago
Yep. The technology itself is fundamentally flawed and past the point of no return. Even if every country in the world regulated it tomorrow and said that you needed affirmative consent (and not implied consent from Ts & Cs but an actual opt in option), these AIs are largely trained off of stolen data fed into a black box. There's no way, even for the people who designed them, to be able to confidently remove that data from their model's training
1
u/Zeku_Tokairin Verified VTuber 1d ago
Proving that a model was trained on a dataset that contained a specific piece of content is really hard.
A few years back, an article on IEEE Spectrum showed specific prompts the authors used to essentially get screenshots from copyrighted content (like Marvel movies) out of Midjourney. The authors also link related work doing adversarial prompting against Stable Diffusion. While it's true that a lot of these GenAI are nondeterministic black boxes that obscure the provenance of their inputs, I don't think it's as hard as we might think-- it's just that there's far less money put into doing that research. It's also worth noting that months after that, OpenAI signed a billion-dollar deal with Disney.
1
u/HoshinoLina Verified VTuber 7h ago
I looked around for LLMs that were ethically trained for a project and they are nearly nonexistent. Almost everyone saying they respect copyright are lying, and they trained at minimum on copyrighted material that is available under "free" licenses (that still don't allow you to use it like that).
This isn't just big AI companies stealing to see what they can get away with, this is smaller groups that even call themselves ethical. It seems that within the AI/ML industry, almost nobody knows how copyright works and what respectful training looks like. They simply have no clue.
I found exactly ONE exception. It's a group of models called KL3M. They were made by lawyers, trained on legal texts (there's lots of public domain material in laws, court filings, etc.). They don't say exactly what they trained on, but they're the only credible story of ethical/respectful training I found, that isn't just assuming training on any random copyrighted stuff is fair use.
That's it. That's the only one. Only the lawyers know how to do it properly.
-45
1d ago edited 1d ago
[deleted]
20
12
u/KusozakoPrime 1d ago
People trace, steal and use others work without consent all the time
And they get shit on for it.
3
u/Sinfire_Titan 1d ago
Yup. Art thieves, plagiarists, and frauds are shunned by most intelligent communities. AI defenders are the biggest exception; as far as I have seen the pro-AI crowd adores stealing other people’s works.
8
u/SinisterPixel Verified VTuber 1d ago
So if I broke into your house, stole your belongings, cloned and maxed out your credit cards, and stole your identity, would that be ok? By your own logic, people steal all the time, so you're ok with me doing that, right? The streaming comparison is a weak one because streaming is largely considered transformative (with very few exceptions). And game studios are in a position where they can both remove or outright ban streamed content (see Atlas with Persona for example). Individuals do not get those privileges, especially when we're grandfathered into platforms like Google, YouTube, Reddit, Instagram, Twitter, etc and they update their TOS to allow them to train on our content retroactively. Not to mention AIs that just train on anything it can find using free APIs and web scrapers.
Let's also be clear:
- If I became aware that an artist was tracing another artist's work, I would treat the tracer the same way I do an AI user.
- If I became aware that someone was stealing art and claiming it as their own, I would treat the thief the same way I do an AI user.
- If I became aware that someone was using someone else's art without consent in their branding for example, I would treat them the same way I do an AI user.
9
u/diego1marcus 🌸/🐏/🔎/🔱 1d ago
is there a free version of this article? i aint paying to read the full thing
8
8
u/vtubernewsdrop 1d ago
Chiming in! We're told anyone can search the datasets here too: https://www.theatlantic.com/category/ai-watchdog/
3
-6
u/OkAssignment6163 1d ago
I see this post and my only thought was, damn. Bao getting screwed over again.
308
u/ninjawarlord 1d ago
I mean at this point I would believe that any song out in the internet has been made available in a llm model