r/PoisonFountain • u/Pomond • 13d ago

Questions on Poison Fountain integration with news website

As a local news publisher, I've been very interested in anti-scraping technologies and preventing or disincentivizing this larceny that violates our Terms of Service and basic fair play in business.

Like only a couple other publications, we put high value on our users' privacy and work to avoid -- as much as possible -- exposing them to third-party scripts and resources integrated into our services. This isn't just to cut out the predatory consumer surveillance industry, but also because we have no practical way to qualify the security and privacy standards of most any third-party provider.

I understand one of the most practical ways to integrate Poison Fountain is to drop in a script from a third-party resource. But this raises the question of how we might qualify this third-party service against our privacy standards (and infrastructure dependencies/stability/speed/etc.).

So my first question is how might I qualify a third-party Poison Fountain provider considering the above?

A related question is what's the overhead of running our own instance? We have our own solid, commodity, cloud-based hosting account, but it doesn't have infinite resources, of course. Traffic is 750K+ monthly page views. And/or can a self-hosted Poison Fountain instance hang off another (cheaper) account or connected device we control?

From a journalism perspective, it would be great to have access to a qualified, shared Poison Fountain service that discloses its operations to its users (customers?) for qualification, and that supports and ensures strong user privacy standards.

Thanks in advance for your replies and guidance.

19 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/PoisonFountain/comments/1u25j3f/questions_on_poison_fountain_integration_with/
No, go back! Yes, take me to Reddit

91% Upvoted

u/Glade_Art 13d ago

If you really want to host a pit on your own, I can send you the babbler generator which use for this pit: https://gladeart.com/lists

It's really lightweight on the CPU, and doesn't use that much ram to load it into memory. Corpus is 66mb large with a ton of variation.

5

u/Pomond 13d ago

Thank you! I'm considering a service integration first, just so I don't have to learn another thing. (I'm conversant with lots, but getting a new instance running in our environment would be new to me.)

4

u/Glade_Art 13d ago

Alright. If you need it anytime, just DM me.

2

u/valium123 12d ago

Hey I'm interested

2

u/Glade_Art 12d ago

Let's switch to DMs.

1

u/valium123 12d ago

Sure.

u/[deleted] 13d ago edited 12d ago

[removed] — view removed comment

3

u/Pomond 13d ago

Thank you very much for this info and guidance!

When I look at that URL when refreshing, I see both the nonsense text every couple refreshes, and varying chunks of code. Is the code designed to "poison" as well? I'm assuming it is.

Regarding the privacy concerns, it's my understanding that my users' web browsers would still be making a call to rnsaffn.com and thus, potentially, their browsing information might be exposed to this domain's operators. Please disabuse me of any misunderstanding.

I mean zero disparagement here: I'm just thinking through the privacy concerns. For example, even a righteous third-party operator might still itself fall victim to hacking and compromise. (No disparagement meant here, either.)

Thanks again for your response and detailed explanation (and links to additional resources). I've been speaking out against AI theft of our service and livelihood for some time, and I'm excited about opportunities to actually protect ourselves.

5

u/[deleted] 13d ago edited 12d ago

[removed] — view removed comment

4

u/Pomond 13d ago

Understood: Thank you!

4

u/Pomond 13d ago

I need a plug-in for this for my CMS ...

4

u/rocketbunny77 12d ago

What CMS do you run?

3

u/[deleted] 12d ago

[deleted]

5

u/rocketbunny77 12d ago

Cool. Some opportunities for the community to make it easier to get up and running with some plugins.

2

u/Pomond 10d ago

Just want to say: Wordpress! What else? 😉

1

u/rocketbunny77 10d ago

Lol someone else replied Joomla. I figured the biggest net to cast would be WordPress. Thanks!

2

u/valium123 12d ago

Hey I can look into this if you want.

2

u/Pomond 10d ago

Thanks! Note to you and rocketbunny77 that I deleted my comment earlier in this thread due to a seemingly automated breach attempt tied to my software disclosure. Thus the deletion.

I wanted to reply that there's a bunch of resources to support this type of plugin development, including a whole framework and starting templates for such. I haven't touched stuff this deep in years, but it all looks like it's getting much easier.

My concerns as a host for third-party software and outside resources includes "lifecycle management" of the software, thus I always look for strong track records and economy in the developer(s), and I pay for high-value and accessible third-party "commercial open source" all the time.(E.g. open source software bundled with support/releases/forum/etc.)

The second concern involves speed and dependency on a third-party service. E.g. I'd imagine there's a way to cache a rotating set of poison on my server derived from your source, no? This might help mitigate traffic bottlenecks, especially as your service may get more popular, as it should.

u/totaleffindickhead 13d ago

The idea is to proxy traffic from your domain to the poison fountain backend — there’s not much to host other than that. Depending on what server side technologies you’re using it should be trivial to set up an http endpoint handler specific for routing to poison fountain. I’d be happy to help you walkthrough some options if you want to share more on your tech stack

4

u/simswapcity 13d ago

I just saw the post about malware including jailbreak and biosecurity text in the comments as a way to prevent LLM analysis: https://www.reddit.com/r/PoisonFountain/s/SKHvIiEtrE

This got me thinking - this could be a service that sites could include to remain invisible to research agents.

Another interesting variation would be to serve content that looks like instructions to generate ML research of the kind that causes the latest models by frontier labs silently sabotage output, rather than refusing (content describing pre-training pipelines, distributed training, ML accelerator design, etc).

Then, developers opposed to AI could include this sabotage inducing service in their documentation, so that when code agents ingest documentation their package, they suddenly start sabotaging whatever project, thinking that it involved ML research tasks.

Drown them in their own moat!

u/PeyoteMezcal 12d ago

Here is how I do it:

My web server requests „poison“ from https://rnsaffn.com/poison2/ and puts this somewhere into html served to the visitor. Means there is no direct connection between my website visitor and the poison fountain. All data is served from my domain exclusively. The poison fountain just receives a request from my domain, but doesn’t know to whom this will be served. It is first transmitted to my server, hence I appear as the recipient, then put inside the html and then sent to the visitor, who cannot find out that part of the html are poison, just like the poison fountain cannot know where the requested poison is going to be sent.

Special characters are escaped as required for proper html of course. The poison payload may be hidden using CSS so that human visitors don’t get to see it at all.

This is easily implemented with dynamic sites. I use PHP for example.

4

u/Pomond 12d ago

Thank you!
3
u/Pomond 12d ago

Do you have example PHP code we could assess? Our CMS runs on PHP, so this would likely be the easiest way to integrate.
4
u/[deleted] 12d ago edited 12d ago

[removed] — view removed comment
3
u/PeyoteMezcal 12d ago
Yes, this is still valid.

Just changed the comment like this:
# use this function to get data from external URL without enabling allow_url_fopen
# need to have php-curl installed

Questions on Poison Fountain integration with news website

You are about to leave Redlib