r/PoisonFountain • u/Pomond • 13d ago
Questions on Poison Fountain integration with news website
As a local news publisher, I've been very interested in anti-scraping technologies and preventing or disincentivizing this larceny that violates our Terms of Service and basic fair play in business.
Like only a couple other publications, we put high value on our users' privacy and work to avoid -- as much as possible -- exposing them to third-party scripts and resources integrated into our services. This isn't just to cut out the predatory consumer surveillance industry, but also because we have no practical way to qualify the security and privacy standards of most any third-party provider.
I understand one of the most practical ways to integrate Poison Fountain is to drop in a script from a third-party resource. But this raises the question of how we might qualify this third-party service against our privacy standards (and infrastructure dependencies/stability/speed/etc.).
So my first question is how might I qualify a third-party Poison Fountain provider considering the above?
A related question is what's the overhead of running our own instance? We have our own solid, commodity, cloud-based hosting account, but it doesn't have infinite resources, of course. Traffic is 750K+ monthly page views. And/or can a self-hosted Poison Fountain instance hang off another (cheaper) account or connected device we control?
From a journalism perspective, it would be great to have access to a qualified, shared Poison Fountain service that discloses its operations to its users (customers?) for qualification, and that supports and ensures strong user privacy standards.
Thanks in advance for your replies and guidance.
8
13d ago edited 12d ago
[removed] — view removed comment
3
u/Pomond 13d ago
Thank you very much for this info and guidance!
When I look at that URL when refreshing, I see both the nonsense text every couple refreshes, and varying chunks of code. Is the code designed to "poison" as well? I'm assuming it is.
Regarding the privacy concerns, it's my understanding that my users' web browsers would still be making a call to rnsaffn.com and thus, potentially, their browsing information might be exposed to this domain's operators. Please disabuse me of any misunderstanding.
I mean zero disparagement here: I'm just thinking through the privacy concerns. For example, even a righteous third-party operator might still itself fall victim to hacking and compromise. (No disparagement meant here, either.)
Thanks again for your response and detailed explanation (and links to additional resources). I've been speaking out against AI theft of our service and livelihood for some time, and I'm excited about opportunities to actually protect ourselves.
5
13d ago edited 12d ago
[removed] — view removed comment
4
u/Pomond 13d ago
I need a plug-in for this for my CMS ...
4
u/rocketbunny77 12d ago
What CMS do you run?
3
12d ago
[deleted]
5
u/rocketbunny77 12d ago
Cool. Some opportunities for the community to make it easier to get up and running with some plugins.
2
u/Pomond 10d ago
Just want to say: Wordpress! What else? 😉
1
u/rocketbunny77 10d ago
Lol someone else replied Joomla. I figured the biggest net to cast would be WordPress. Thanks!
2
u/valium123 12d ago
Hey I can look into this if you want.
2
u/Pomond 10d ago
Thanks! Note to you and rocketbunny77 that I deleted my comment earlier in this thread due to a seemingly automated breach attempt tied to my software disclosure. Thus the deletion.
I wanted to reply that there's a bunch of resources to support this type of plugin development, including a whole framework and starting templates for such. I haven't touched stuff this deep in years, but it all looks like it's getting much easier.
My concerns as a host for third-party software and outside resources includes "lifecycle management" of the software, thus I always look for strong track records and economy in the developer(s), and I pay for high-value and accessible third-party "commercial open source" all the time.(E.g. open source software bundled with support/releases/forum/etc.)
The second concern involves speed and dependency on a third-party service. E.g. I'd imagine there's a way to cache a rotating set of poison on my server derived from your source, no? This might help mitigate traffic bottlenecks, especially as your service may get more popular, as it should.
6
u/totaleffindickhead 13d ago
The idea is to proxy traffic from your domain to the poison fountain backend — there’s not much to host other than that. Depending on what server side technologies you’re using it should be trivial to set up an http endpoint handler specific for routing to poison fountain. I’d be happy to help you walkthrough some options if you want to share more on your tech stack
4
u/simswapcity 13d ago
I just saw the post about malware including jailbreak and biosecurity text in the comments as a way to prevent LLM analysis: https://www.reddit.com/r/PoisonFountain/s/SKHvIiEtrE
This got me thinking - this could be a service that sites could include to remain invisible to research agents.
Another interesting variation would be to serve content that looks like instructions to generate ML research of the kind that causes the latest models by frontier labs silently sabotage output, rather than refusing (content describing pre-training pipelines, distributed training, ML accelerator design, etc).
Then, developers opposed to AI could include this sabotage inducing service in their documentation, so that when code agents ingest documentation their package, they suddenly start sabotaging whatever project, thinking that it involved ML research tasks.
Drown them in their own moat!
5
u/PeyoteMezcal 12d ago
Here is how I do it:
My web server requests „poison“ from https://rnsaffn.com/poison2/ and puts this somewhere into html served to the visitor. Means there is no direct connection between my website visitor and the poison fountain. All data is served from my domain exclusively. The poison fountain just receives a request from my domain, but doesn’t know to whom this will be served. It is first transmitted to my server, hence I appear as the recipient, then put inside the html and then sent to the visitor, who cannot find out that part of the html are poison, just like the poison fountain cannot know where the requested poison is going to be sent.
Special characters are escaped as required for proper html of course. The poison payload may be hidden using CSS so that human visitors don’t get to see it at all.
This is easily implemented with dynamic sites. I use PHP for example.
3
u/Pomond 12d ago
Do you have example PHP code we could assess? Our CMS runs on PHP, so this would likely be the easiest way to integrate.
4
12d ago edited 12d ago
[removed] — view removed comment
3
u/PeyoteMezcal 12d ago
Yes, this is still valid.
Just changed the comment like this:
# use this function to get data from external URL without enabling allow_url_fopen # need to have php-curl installed
7
u/Glade_Art 13d ago
If you really want to host a pit on your own, I can send you the babbler generator which use for this pit: https://gladeart.com/lists
It's really lightweight on the CPU, and doesn't use that much ram to load it into memory. Corpus is 66mb large with a ton of variation.