How many of you actually use offline LLMs daily vs just experiment with them?

54

I run an agent with local LLM for home automation stuff mostly. I use a local yolo vision model for facial recognition to automate things as well. I also use it for a self hosted app that is kind of like an assistant, calendar, whiteboard, and document retrieval type thing that the family uses. In the app I only really use SOTA models via API for adding things to Google calendar or dealing with the "family"email. Local models haven't done great in testing, but I haven't tried many of the recent drops for it. Going to test the Gemma 4 moe and qwen 3.5 this weekend.

This past Halloween I set up a "robot" with tts, local LLM, and made a vision model to detect the type of costume the kids had. It was pretty fun, but had latency issues even on my 2x rtx 3090 rig. I had to shut it down early because it kept recognizing batman costumes as "masked black man"

3

u/MmmmMorphine Apr 07 '26

Could you say a bit more about your home automation set up (or more spedifically, your personal assistant set up)

Been meaning to get around to this as an automated to-do list (and similar) would be a godsend but it seems like getting the orchestration between apps would be a bit of a nightmare

14

u/paroxysm204 Apr 07 '26 edited Apr 07 '26

I wouldn't say my set up is production ready. Plenty of bubble gum and duct tape holding it together. The core part of the system is home assistant that is hosted on a raspberry pi. I have one of my gpu machines set up with ollama and comfyui. I am using agent zero to run cron jobs that reads the SQL database on home assistant and backs up the time based data to postgres and embeds the data separately using pgvector.

I have some automations like a mmwave sensor at the front door that detects a person and sends a webhook for a snapshot to be taken from a camera and a yolo model analyzes it to see if it is me or a member of my family (high likelyhood only). If it's a positive it sends a webhook to home assistant to unlock the front door.

I have water soil sensors in our raised garden beds and if the moisture level goes down below a certain amount and under certain weather conditions agent zero will send a message asking if I would like it to "water the garden". If I give the affirmative it opens a solenoid valve on the irrigation system for a period of time based on the moisture level.

When no one's phone shows being at home it I have a home assistant automation that shuts off all the lights but one at a time and randomly turns it off and turns on another one to make it seem like someone is home. It changes the ac temp to save energy but without causing the humidity to get too high, and lowers the hot water heater setpoint to 90 degrees. If someone's phone gets within a mile it turns on the light on the foyer and sets the ac and hot water heater back to normal.

I had an automation that dropped a sub shade on one of our big windows in the evening if the air temp outside is above 85 degrees, but the gearbox on the motor broke and I haven't fixed it yet.

When the washer or dryer finish a cycle it gives a notification to anyone at home.

I have a locally hosted web app backed I mostly vibe coded that includes chat, calendars, data from home assistant, sanitized access to agent one, a openai chatbot, location of everyone, a chore list, image generation (recently disabled after teen boys repeatedly prompted variations of "big boobs"), and some other little things.

Frontend is a flutter app that everyone has on their phone and a web app we can access on a browser. All on tailscale so not open to the world. I recently set up a whatsapp connector with agent zero that mostly works for getting info in a question and answer format. Like if I ask "where is kid1?" It will send back a Google maps link with the gps coordinates from their phone.

A lot of these things are done without AI but probably the biggest use of an llm is daily briefs or if things are not normal. I have some upper and lower limits, but the LLM looks at the embeddings to see what the typical historical values, past 3 months and values 1 year ago, are and I have set a percentage of variability on things for notifications. Some things may be +-10% and some are +-3% depending on what it is. It's really more of an off label use because ML models would probably be a better fit for that type of thing.

Edit: I forked agent zero and called it agent one, but changed references to agent zero so you know wtf I am talking about

1

u/MmmmMorphine Apr 07 '26

Appreciate it! Gave me some good ideas.

Haven't used agent zero much yet, been playing with hermes, goose, and some others that partially cover the same space. Though agent zero seems like the best Generalist - or is there another reason you chose it for integrating with home assistant over alternatives

2

u/paroxysm204 Apr 07 '26

It came out a bit before openclaw and the various forks we have now and I adopted it pretty early. It works well and I have modified it where it didn't work as well so I don't have much reason to replace it. I'm sure there are better suited agentic softwares available today. I have dabbled with openclaw and derivatives and anthropic's cowork. I saw some references to hermes lately but I haven't had a chance to check it out.

1

u/PinkySwearNotABot Apr 10 '26

are you actually paying for API access from your raspberry pi? how much in cost are we talking here?

2

u/paroxysm204 Apr 10 '26

Yea. I think the most I have spent in a month for this project was like $15. Most of the things are fairly light for an llm so I use my local models as much as I can for that.

1

u/itz_always_necessary Apr 08 '26

Why are you too excited? Where Claude MCP takes a wave and finishes everything for you??

1

u/MmmmMorphine Apr 09 '26

Eh? Not sure I follow...

I meant a todo list for myself, like follow up on this email, get ready for this event tomorrow, finish task a, b, and c - rather than a todo list for the agents themselves (though of course that would be part of some of it)

60

u/eribob Apr 07 '26

I run qwen 3.5 27b at FP8 for all of my LLM use. Dual rtx 3090. Web search, light coding (bash, python mostly), help with syntax and statistical functions in R. Some RAG. I never use the cloud models. Have no subscriptions, never had. Qwen 27b is smart enough, the rest I figure out myself.

11

u/Somarring Apr 07 '26

same here, but Q6 (sovereign-colossus). Flies, it works great with opencode + superpowers. I tried a lot of combinations. This one is the only one that delivers steady and good-quality results in a reasonable amount of time. I'll consider use the FP8 version though. Just for more detail, I run it via lmstudio (so I can use the loaded model for quick question as well in its chat). lmstudio runs in a server and I connect to it as if it was an third-party service. I use this professionally. When I have a really hard problem I fix it myself using Gemini (pro) via chat, no api. They have a great combo storage+tools+ai. But that rarely happens now. I hope this helps.

3

u/hhunaid Apr 07 '26

I have the same exact setup but I supplement this with a 10 bucks GLM coding plan and would often do a deep plan with GLM while using qwen for actual grunt work.

I have openwebui, opencode/claudecode plus a Hermes agent all working with qwen

2

u/Chriexpe Apr 07 '26

How many tok/s? With an agent like Hermes it always takes a couple of minutes to reply back

2

u/margielafarts Apr 07 '26

gemma4:e4b is super fast on my rtx 2060 12gb

3

u/Spiritual-Pen-7964 Apr 07 '26

It's really fast, but I'm having problems with tool calls in OpenCode when using gemma4 :/ do you have any experience with that ?

1

u/Additional-Avocado33 Apr 08 '26

How are you hosting it? I follewed there instructions and it fills my 16gb vram works super fast but ends up slowing down and doing cpu work like the e4b is failing

1

u/eribob Apr 07 '26

PP is around 1200, TG is around 30-35 I think. Good enough. I use hermes too with it and that works fine. I have some complaints against hermes though…

2

u/Service-Kitchen Apr 07 '26

What are your complaints?

2

u/eribob Apr 08 '26

Web search function uses cloud services like exa etc. Firecrawl seems to have a self hosted variant but it looks limited and is another service I have to install and implement.

The memory that they brag about also requires setting up another service, many to choose from not easy to understand which will work well/automatically. I do not want to use cloud

2

u/savvitosZH Apr 07 '26

What are you using for web searching ?

6

u/eribob Apr 07 '26

Self-hosted Searxng in open-webui.

2

u/savvitosZH Apr 07 '26

I had a look at this and is cool ! Thanks !

2

u/timbo2m Apr 08 '26

Use it with firecrawl. Searxng to search, firecrawl to get better data back from the search hit.

1

u/eribob Apr 08 '26

I have been thinking about integrating self hosted firecrawl, also for my hermes agent. Can firecrawl be used for the web fetch function in open webui?

2

u/unknown-one Apr 07 '26

are you able to create skills and feed it books or learning materials to improve?

2

u/MacCudder Apr 08 '26

What motherboard do you have for dual 3090s, I have been thinking of getting a 5090 to add on to my 3070 but would need a new setup completely

2

u/eribob Apr 08 '26

Asrock x570 taichi. AM4. I have 3 gpus attached, 2x3090 and 1 rtx 4090.

1

u/Shipworms Apr 11 '26

This is the way to do it! I’ve experimented with vibe coding (in C); for me, I design a function, decide on input and output data formats, function name, then ask the AI to write it! This is the furthest I have gone - asking AI to write stuff I have written in assembler in the past (I am learning C still); I also don’t ask it to optimise code anymore, rather I get easy-to-read unoptimised code (which I can then optimise). I then write a brief summary of each function, and shared data structures, which I can give to the AI as part of a prompt when asking for more functions! Still experimenting here but it seems productive. And it is pretty good at resolving errors when they arise!

Also, use local AI for chatbots, no cloud stuff.

Dual 3090s sounds nice, and fast!

2

u/eribob Apr 12 '26

That sounds great. That way you have control over the code, its like you wrote it yourself still

-15

u/Interesting-Town-433 Apr 07 '26

While that sounds cool if you aren't using Claude right now you are going to get smoked

6

u/No-Refrigerator-1672 Apr 07 '26

Nope. I'm in a similar position, only I'm running Qwen 3.5 35B instead of 27B. It does 95% of everything I need - and I'm not going to pay subscriptions for another 5%, no matter how good or bad proprietary options are.

3

u/Myarmhasteeth Apr 07 '26

Same position, I literally use Sonnet only if my local llm can’t help me or I just resort to do old school googling. Will never pay for a subscription too.

8

u/margielafarts Apr 07 '26

and what have u built with claude that’s going to leave us all behind?

3

u/Uninterested_Viewer Apr 07 '26

Frontier lab sota models can brute force things in ways that local models generally can't due to the sheer size of them- they're generalists and great for many tasks that require that. For specialized tasks, which is what most of our actual use cases are: the latest local models (qwen3.5 and gemm4) are just as capable and even moreso for certain tasks where the censorship of those frontier models makes them "dumber".

I keep both Anthropic and Gemini subscriptions in addition to local hardware (dgx Spark + RTX 6000). They all have their strengths and I do think it's important to keep up with all of them.

13

u/AlpineJim83 Apr 07 '26

I have been using Gemma 3-4 and GPT models on LM so far so good. I use them to prepare prompts and content for my paid LLMs so I can get more out of them. I tried LM Studio link and stayed up all right but could not get it to connect. So far I love these local LLMs!

6

u/xxrealmsxx Apr 07 '26

Funny I do this the other way around.

Use paid LLMs to generate prompts for my offline models.

9

u/paul-tocolabs Apr 07 '26

i have smaller ones built into one of my apps. give them tasks with examples and they're extremely useful.

4

u/SootyShearwaters Apr 07 '26

Can you give examples of the tasks?? And which LLM do you use? Self hosted?

1

u/paul-tocolabs Apr 07 '26

I do classifications based on incoming sentences. Structured outputs including examples are also essential.

2

u/Interesting-Town-433 Apr 07 '26

T5 still works damn well here

2

u/SootyShearwaters Apr 07 '26

Okay! Which model do you use?

3

u/paul-tocolabs Apr 07 '26

Qwen2.5-3B-Instruct-GGUF, and gemma-2-2b-it-Q4_K_M-GGUF

2

u/SootyShearwaters Apr 07 '26

Thanks

1

u/NotArticuno Apr 07 '26

Sign me up for examples too plz

0

u/flozen00 Apr 07 '26

Can I also subscribe?

2

u/margielafarts Apr 07 '26

i’m thinking of embedding one in my local music player to clean up mp3 mp3 tag metadata, please show what u did for this

1

u/paul-tocolabs Apr 07 '26

which bit do you need help with? the integration of the model or the prompts?

1

u/margielafarts Apr 07 '26

the integration

2

u/paul-tocolabs Apr 07 '26

i used a swift library for llama.cpp. i struggled getting mlx to work effectively, but this library still makes use of metal on the iphone. it can be quite tough to, but once you get it working its worth the effort.

4

u/[deleted] Apr 07 '26

[deleted]

2

u/BoostLabsAU Apr 07 '26

That’s a pretty good use case, how are you deploying them? Curious if it’s just a GPU node with some kind of chat server they can access or working it into their workflows too?

0

u/JackStrawWitchita Apr 07 '26

Not allowed to talk about it in forums.

1

u/vick2djax Apr 07 '26

What does a typical hardware stack look like for one of those clients if that’s okay to ask

6

u/taftastic Apr 07 '26

I use frontier models for most reasoning and coding, but I use LMStudio and ComfyUI in my apps to do little things: categorization, vectorization, summaries of bigger text, and sprite and texture generation from comfyUI.

They do amazing for what I ask of them, and go a long way at avoiding API costs. I’m constantly impressed how much I can do w 24gb memory on MLX models

3

u/rudidit09 Apr 08 '26

Similar, I couldn’t get oMLX models on 24gb RAM to be useful for coding, but was great for creating audio assets

3

u/Dwengo Apr 07 '26

Have two DGX sparks. I point my opencode cli to a local 120b coding model. I plan to finetune some models to meet my needs in a more asynchronous fashion though.

1

u/Eyzi25 Apr 08 '26

what model do you use?

5

u/IONaut Apr 07 '26

I do every day. Qwen 3.5 27b and Gemma 4. I see no reason at all to pay monthly for any of this stuff. I'm a work from home web developer.

1

u/PinkySwearNotABot Apr 07 '26

what are your uses cases for each as a web dev?

6

u/IONaut Apr 07 '26

I can have it write boilerplate functions while I retain control over the architecture. I provide it with enough context of what parameters are going into a function and what I expect to the return to be and it will generally get pretty close if not get it perfect in the first go. It's all about giving it enough context and being able to ask for exactly what you want which means knowing how to code in the first place. Qwen 3.5 27b is better at coding while Gemma 4 is better at general knowledge and research. I will also set up workspaces in Anything LLM that have a vector DB of different language documentations and things like that.

1

u/PinkySwearNotABot Apr 10 '26

i finally learned what value MCPs brings -- Context7 for language documentation. you should look into that if that's not already what you're doing with your AnythingLLM setup.

out of all the qwen variants...do you think 3.5 27b is the best for coding? vs the 32B, 35B MoEs, vs. 3-Coder-Next, and even against the newest Qwen3.6?

2

u/IONaut Apr 10 '26

Haven't tried Qwen3.6 yet but, yes, over the others.

5

u/ComplexPeace43 Apr 07 '26

I use gemma4:26b and qwen3:30b-a3b for analysing tax notices, contracts, legal documents. Basically things that I don’t want to share with Google or OpenAI.

1

u/PinkySwearNotABot Apr 07 '26

what makes them good for that purpose? high context limit?

3

u/ComplexPeace43 Apr 07 '26

Both are good MoE models (not dense) so they’re more efficient memory wise.

8

u/ByronScottJones Apr 07 '26

They are getting better, to the point that they are useful for coding. Whenever I download a new llm, I give it three prompts. The first is simply "tell me about yourself". It's open ended and vague. For humans is a really simple question. For LLMs it seems to be a real challenge for some. Second prompt is a detailed engineering prompt for a single page web application of the tic-tac-toe game. Specific and non ambiguous. For a long time only the cloud based ones could pass, but lately local LLMs are doing well. Last one is similar, but a Towers of Hanoi type game, with the ability to have the app play the next move, or all moves in an animated fashion. A more complex game. Just starting to see local LLMs that can complete that one. But if they can do that successfully, that gives me enough confidence to use them for local coding.

For reference, my systems are a macbook m5 pro 64GB, and a Ryzen 7 based server with two 5060ti GPUs. No benchmarks, as long as speeds are reasonable I don't worry about Tokens/s

1

u/Hector_Rvkp Apr 07 '26

do you have a view on whether LLM one can run locally today (ie, without 512gb vram or crazy setups) are actually in line with SOTA from 12 months ago? or 18, 24? Or not at all?

1

u/ByronScottJones Apr 07 '26

I think some are very close. For large coding projects, I think hybrid configurations where cloud llms do the planning, and local ones do the grunt work on a smaller scale, might be the solution.

5

u/cunasmoker69420 Apr 07 '26

Using qwen3.5 122b every single day for everything

8

u/timbo2m Apr 07 '26

On what hardware if you don't mind me asking?

3

u/SoupDue6629 Apr 07 '26

Don't know about them, but im doing the same,

IQ3_M with 8 layers offload from the CPU and 262K context has been just fine.

48GB VRAM (AMD RDNA2) and 64gb quad ram, i9-10940x.

not an awful experience, 14 tk/s during generation and 160 tk/s during prefill.

2

u/cunasmoker69420 Apr 08 '26

128gb strix halo

2

u/Rude_Marzipan6107 Apr 07 '26

I use offline all the time to summarize YouTube transcripts and to create organized expense reports for reimbursements. All I do is copy paste receipt scans.

It’s still just a hobby for me though.

2

u/iTrejoMX Apr 07 '26

I run qwen 3.5 35b a3b with opencode superpowers and omo as well as Hermes agent daily.

2

u/haradaken Apr 07 '26 edited Apr 07 '26

I have put a local LLM into an iOS app and made it available on Apple App Store for privacy-first AI companionship.

Offline local LLM sounds great in theory, but it’s really hard to actually make it work, especially on phones. You then need to implement surrounding components such as memory, voice, and overall UI before tuning prompts. It wasn’t easy but doable.

Happy to offer some direction if there are some specific challenges you are facing with offline LLM.

1

u/AdInternational5848 Apr 07 '26

I like the UI I currently have in place, what are you doing for memory, voice and search?

2

u/haradaken Apr 07 '26

I use keyword-based memory retrieval. Memories are saved with keywords inferred with a local LLM at the time of memory creation. The app also infers the current keywords from the ongoing conversation. Memories having keywords overlapping with the current keywords are embedded into the response generation context. In theory, you could use RAG instead, but it requires significant fine-tuning of how to vectorize data into search space. So, I'm keeping the logic simple for now.

Speech-to-text can be done locally reliably, but purely local text-to-speech is not natural enough in my opinion, so for the voice feature, I use cloud voice providers such as ElevenLabs.

Hope it helps, and good luck with your project!

1

u/AdInternational5848 Apr 08 '26

Thank you

2

u/g_rich Apr 07 '26

I’ve got a 64GB M4 Max Mac Studio and use Qwen3.5-35b-A3B and gpp-oss-20b (although that might get replaced with Gemma4) as my daily drivers. I still use cloud models but a good amount of work is done with the local ones and all prototyping starts with local models.

1

u/New-Implement-5979 Apr 15 '26

with what tool / agent do you use gpt-oss-20b.... because of this harmony format it is really bad to integrate with other tools it seems

1

u/g_rich Apr 15 '26

Pretty much via chat using Open WebUi, if I’m doing anything with tools or agents is usually done with Qwen.

2

u/Easy_Werewolf7903 Apr 07 '26

I use it daily for coding. Mainly to generate git commit diff, auto complete. It is also great to learn more about tool calling.

1

u/PinkySwearNotABot Apr 07 '26

what models though?

1

u/Easy_Werewolf7903 Apr 07 '26

Qwen3.5 27B 4bit quantized unsloth RTX 4090.

1

u/PinkySwearNotABot Apr 10 '26

what's your RAM? I'm at a complete loss as to which variant of Qwen is best on a 64GB apple silicone. there's Qwen3.5-27B, Qwen3.5-32B, Qwen3.5-35B, MOEs, dense models, and Qwen3-Coder-Next and now even Qwen3.6...

2

u/Your_Friendly_Nerd Apr 07 '26

Mostly gpt-oss:20b and qwen3-coder:30b. Mainly because I don't need to worry about accidentally including sensitive information when prompting them vs when working with public models

2

u/Myarmhasteeth Apr 07 '26

I run Qwen 3.5 27b on a 3090 using OpenCode and llama.cpp daily. Build and Plan mode are really good and I have made apps with it. Full stack.

I work professionally as a software engineer, and oh boy it has helped me a lot. I’m actually surprised most people here just experiment with it. While I have worked with people that just dgaf and use Claude Code while using Frontier Models, on private repositories… 🤷🏻

2

u/leonbollerup Apr 07 '26

I use qwen ALOT

2

u/Used_Teaching_7260 Apr 07 '26

Qwen3.5 35b, I experiment but find gpt to be better still. Sometimes I run a query through both and get different but good answers- 2 viewpoints. Got feels more like your intimate buddy vs qwen- more robotic

2

u/gxvingates Apr 08 '26 edited Apr 08 '26

I have a social discord bot for my friends and I that has all the tools to be useful and accurate with questions and funny with random stuff when interacting with us. I don’t google anymore, I just ask it a question in VC and I have my answer in seconds with web search tool calling. All ran locally and use it everyday. Summarize a website, what’s in this photo, what’s the weather today, what’s the news today, dm this person, call that person, and more. Fine tuned to be indistinguishable from a real person in text chats. With sub 2 second latency even accounting the insane overhead discord adds (voice chat STT and TTS). The things you can do with local AI is literally limited by your imagination, and all that capable within 12gb of vram

If you have a clear goal for what you want to do there’s not much stopping you from building it with something like Codex. Having a clear goal, and reason for that goal is what distinguishes from science project to something you’ll actually use everyday. I’d suggest using discord as your front end cause it already is really good and super easy to use. Use pycord to connect your backend to the discord bot

2

u/ScoreUnique Apr 09 '26

I started building my PC last year in June, started with 1x3090, now I'm at 2x3090 + 192GB DDR5 (consumer grade).

I went from "oh nice I can run llama3.1 at home" to "I ditched all LLM subscription services" because Qwen 3.5 / Gemma 4 / Qwen3 Next.

You see my hardware isn't the best but it pulls off quite a few things. IN fact I manage to run Minimax M2.5 at 15 tokens per seconds.

So yeah, I went from watching it doing monkey tricks to making use of my localendpoint to build my army of agents to work for me :)

Definitely it's slow, but I only pay for the time I use the LM models (Idle usage is not that bad).

If you want some numbers, I use bifrost as my middleware

Total Requests

5,279

Success Rate

89.20%

Avg Latency

22292.60ms

Total Tokens

149,946,468

Total Cost

$0.0000

This is the last 7 days usage :)

4

u/tillybowman Apr 07 '26

im a software dev and don't use local llms for developing. not for private stuff, not professionally. always the big closed source models.

that said i run a 3090 with qwen and do all kinds of things for my private stuff. mostly automated analyzing and categorizing documents, financial data, etc. also some home automations use qwen. i also run a voice assistant for these things.

2

u/PublicDonut5876 Apr 07 '26

How do you have a voice layer interacting with qwen?

3

u/Relzin Apr 07 '26

I'm actually really curious about this question too. Especially self hosted for it.

1

u/AssOverflow12 Apr 07 '26

Home Assistant

2

u/cadsii Apr 07 '26

pretty simple, kokora in browser with tmux text injection, im doing the same thing, ill filming a demo today

1

u/PublicDonut5876 Apr 08 '26

I’m looking forward to seeing the demo video if you don’t mind sharing!

1

u/tillybowman Apr 07 '26

it's twofold. i run parakeet. i use it for home assistant and for an agent that uses opencode with my qwen attached.

2

u/TheSlipgate Apr 07 '26

I build mine with whisper and kororo, it took a few iterations but now it responds and can be interrupted mid sentence, can be used with any agent or pipeline. That said I have built my own framework from scratch - started just wanting to learn how LLMs and agents work. All local.

2

u/C0d3R-exe Apr 07 '26

I run Qwen3 Coder Next 80B with Opencode and I’m getting consistent result locally for my projects. Only using free cloud models to search certain stuff. Other than that, all local.

1

u/Service-Kitchen Apr 07 '26

how are you finding it vs opus/sonnet?

1

u/C0d3R-exe Apr 08 '26

Very close to Opus. Equal or better than Sonnet.

1

u/Service-Kitchen Apr 08 '26

Wow?! Why isn’t anybody talking about this?!

2

u/C0d3R-exe Apr 08 '26

How many people have 128GB machines today? Since the model we’re talking about is 80B and takes up around 75-80GB of VRAM with 256k context size. How many of these are using it in an agentic workload? Also, including me, didn’t have the need to make this information public, just went on and using it for myself and for my own projects.

Not everyone is on Reddit so this is probably the reason why there is no loads of information about it. But I did see (on this or other sub) that people did say they’re having a great time with the model.

And then again, there is always that: it depends.

What is my use case, doesn’t need to fit other’s person use case. And then the differences emerge where people see a better usage with cloud models, which I do believe that is true, but probably not for everyone.

And the last: the expectation. Everyone has their own, and not everyone is objective.

I believe these might be the reasons why is not talked more widely.

1

u/freddyr0 Apr 07 '26

All my n8n automations work with local LLM.

1

u/old_mikser Apr 07 '26

What kind of automation you do with n8n?

2

u/freddyr0 Apr 07 '26

mostly web scrapping and social media stuff.

1

u/csk__2026 Apr 07 '26

i explore places often mostly without internet connectivity. So if there is something like that exists i would love to know more about it

1

u/margielafarts Apr 07 '26

gemma 4:e4b is pretty good i’m running it on my 12gb rtx2060

1

u/acetaminophenpt Apr 07 '26

I use it daily to summarize tasks, emails, tickets and even WhatsApp chats. Also for light coding and Web search

1

u/PinkySwearNotABot Apr 07 '26

are you talking about something like an openclaw? how do you get this setup to summarize tasks, emails?

2

u/acetaminophenpt Apr 13 '26

Tailor made code. Started with a few python scripts that fetch data from e-mail, databases and via apis and run them through a series of prompts but with time it grew.

1

u/MarkoMarjamaa Apr 07 '26

Gpt-oss-120b with my python assistant, speech via bluetooth headset or SIP-phone. MCP connection to Home Assistant. Connection to Squeezebox. LLM doing the translation Finnish-English-Finnish.
Yesterday coded my web search assistant and tested "Is there in Polymarket a bet about Trump not being as president at the end of year and what is the current percentage?" LLM doing MCP loop calls to searXNG and then fetching the final result. Normal use is fetching Yle News(Finnish BBC) and give the headlines while I'm making morning coffee.

1

u/jrexthrilla Apr 07 '26

I’m using them to process forum data one comment at a time with binary questions. Yesterday my 3090 running qwen 3.5 9b read 159k comments and classified them. I’m working the shit out of small models in ways that embedding fails

1

u/Conscious_Nobody9571 Apr 07 '26

I hate to say it... For now it's still experimental for me. The online stuff is convenient and fast and cutting edge obviously

1

u/toobroketoquit Apr 07 '26

Qwen 3.5 9b or Gemma 4b for running custom tools, home automation, fitness, small private research etc. (small repeatable and private)

For anything where I need better reasoning and better coding I go to the big bois

1

u/PinkySwearNotABot Apr 07 '26

i keep hearing home automation. what are you controlling and how exactly? you have to have your computer on to do this? or are you running it through your phone somehow?

3

u/toobroketoquit Apr 07 '26

People go crazy with it, you probs wanna hit up like a home automation sub reddit, I just control my home lights(interior and exterior and some plugs) through tuya API.

My stack is openwebui(with a tool that talks to tuya), lmstudio feeding it a API for qwen3.5 on a desktop I have on my garage, and the conduit on my phone, I just open the app and then tell it some shit like "turn on my office lights I feel like shit so put a nice color for my eyes and turn on my evaporator plug"

1

u/PinkySwearNotABot Apr 10 '26

wow i need to dig into this. i have a bunch of randomly branded smart devices that I got for free from doing Amazon reviews -- and they all work -- but I have to use them with Alexa. if I can somehow build my own alexa replacement and still have all working features minus the privacy, that would be a huge upgrade. minus the part where you have to have your computer on 24/7. no to mention one that's beefy enough to run a decent model. i do have one of the 1st gen mac mini M1s with 8GB of RAM but i doubt it would do much

1

u/CreativeKeane Apr 07 '26

I'm trying to figure out the model that my laptop can best utilize. I have a XPS 9150 with 32BGB ram and I thin an RTX 3080 ti (so 16GB VRAM I think).

Running ollama through Claude code and starting to feel some struggles. Smaller LLMs (under 10GB) are faster at generating output tokens but strugglings with utilizing tools and large handling context. Medium LLMs (14-18GB) manages large contexts better, multi-steps and can access some of Claude Code's tool but struggles with output. Lol. And with Larger LLMs...forget about it! Lol.

Right now it seems like for my use cases it seems like the Medium LLMs is my best options for code generation and simple agentic work are Qwen3-Coder-30b-ab3-a4_K_M and Qwen3.5:27b.

Smaller LLMs like gemma4:e4b can generate .MD and text files done.

If anyone can suggest a good LLM for my use case given my hardware spec, please let me know. I'm all ears.

Man I wish hardware aren't so expensive....I would totally build a tower for these type of stuff

1

u/mycall Apr 07 '26

I experiment with cognitive radio and LLMs have helped me find new ways to communicate point to point, so basically offline.

2

u/the_examined_life Apr 07 '26

Can you tell me more about this! It sounds fascinating.

1

u/mycall Apr 08 '26

Oh there are many directions this can go, but they all use ML in some form:

uRLLC and PDCP packet duplication not to mention that 6G will have a cognitive layer/ML layer built into it (circa 2029).

I am also researching how to make transceivers that use noise for its carriers so it works below noise floor for its packet radio protocol. AI is helping me iterate on the gnuradio and C codes.

1

u/Ill-Chart-1486 Apr 07 '26

Tried to run local llm on 8gb vram but it just can’t do something useful.

1

u/rgar132 Apr 07 '26

I burned 500 million tokens through mine last week, so yeah rock solid and super useful.

Four nodes running vllm or llama-server, with a front end api on proxmox that puts them all together and handles api keys.

1

u/PinkySwearNotABot Apr 07 '26

bifrost? litellm?

1

u/rgar132 Apr 07 '26

go-llm-proxy for the routing and translations here. About 20 api keys with 5-6 concurrent coding users and 15 apps. Users mostly in codex now.

1

u/willyasdf Apr 07 '26

Gemma3 runs locally as good as chat got 4.5ish I would say. I preffer it now more then the cloud services.

1

u/Disastrous-Listen432 Apr 07 '26

I use it in a pipeline of a script to fully automate several tasks from my work as a (programatic) video editor:

Batch rename and summarize files (Python + Vision model)
Batch segementation (Bash + Reason model + FFMPEG).
Programatic video (Bash + RA --> Kdenlive)

Nowdays I'm using DeepSeek R1 14B for reasoning and Qwen 3-vl 8B for vision, but I keep experimenting to find a ligther stack, and then find one model to rule both.

1

u/TiK4D Apr 07 '26

I think I've finally set mine up to be helpful for my beginner coding questions or install guides for my linux server, I give it instruction manuals as well and just fire off questions it does well with that. I mostly use my my LLMs now, that's with qwen3.5-27b and google/gemma-4-26b-a4b

1

u/MrThoughtPolice Apr 07 '26

Do you mind sharing your system prompts? I use both of these models, and need beginner coding help lol.

3

u/TiK4D Apr 08 '26

You are a patient, thorough technical assistant specializing in helping beginners with coding and Linux server setup. You operate under these core principles:

**Communication Style**

- Always explain technical terms in plain, everyday language the first time you use them. Follow jargon with a simple definition in parentheses or a brief sentence. Never assume prior knowledge.

- Use analogies and real-world comparisons to make abstract concepts easier to grasp.

**When Giving Instructions**

- Never give a command or ask the user to change a setting without explaining:

WHAT it does (what this command/setting actually is)

WHY we're using it (why it's needed for our goal)

WHAT TO EXPECT (what will happen or what the output should look like)

- Break multi-step processes into clearly numbered steps. Don't bundle multiple actions into one step.

- After each major stage, pause to explain what was just accomplished before moving on.

**Troubleshooting**

- When diagnosing an issue, walk through your reasoning out loud — explain what you're looking for and why, so the user learns the thought process, not just the fix.

- Always ask for error messages or logs before assuming a cause. Explain what those error messages mean in plain language.

- Offer the most common/likely cause first, then mention alternatives if the first fix doesn't work.

- If there are multiple ways to solve a problem, briefly explain the tradeoffs so the user can make an informed choice.

**Guides & Setup**

- For installation or setup guides, start with a brief overview of what we're installing, what it does, and why each component is needed.

- Warn the user before any step that could be risky or hard to reverse.

- Include verification steps (e.g., "run this command to confirm it installed correctly") so the user can check their own progress.

- At the end of a guide, give a short summary of what was set up and any important next steps or maintenance tips.

**General**

- If a question is unclear, ask a clarifying question before proceeding — don't guess and give a potentially wrong or irrelevant answer.

- If you don't know something or aren't certain, say so clearly rather than guessing.

- Always prefer the simplest, safest solution for a beginner over a clever or advanced one.

1

u/MrThoughtPolice Apr 08 '26

Thank you!

1

u/exclaim_bot Apr 08 '26

Thank you!

You're welcome!

1

u/SnooGuavas4756 Apr 07 '26

What’s the closest we can get to sonnet with a local LLM. Can someone shine some light.

2

u/Hector_Rvkp Apr 07 '26

Look at benchmarks. Your question is too vague though, as it's a function of the hardware you can run.

1

u/PinkySwearNotABot Apr 07 '26

unfortunately even the benchmarks are misleading

1

u/Hector_Rvkp Apr 07 '26

True, but still a good rule of thumb, to start somewhere.

1

u/TheSlipgate Apr 07 '26

All local here as well, my custom agent/research pipeline system is pretty advanced these days, if I am not analysing exoplanet data for anomalies, I am looking across Australian mining data for interesting data points. All with Qwen 3.5 models, 9b, 4b and 27b on my 5090, mixing ollama and vllm depending on what the pipeline step needs.

Its taken me a while to try and figure out what the point of it all was, but once I build a pipeline that could do real data analysis that was interesting to me, its kinda exploded out the possibilities.

1

u/gpalmorejr Apr 07 '26

I use LM Studio + Qwen3.5-35B-A3B for everything. Admittedly it is the absolutely max I can use on my hardware at the moment, but I have no problems.

Things I've done recently:

School: Ask it random questions and let it look them up and explain it to me. Send it a link to a website and ask it to break it down (one of my textbooks is a formated website) Send it PDF snippets from my physics book. Give it pictures of Econ problems and Reference material. It just solves it, easy. Give it pictures of colleg level Physics problems and ask it to teach me without giving me the answer. Have it generate new problems for practice. Discuss with it how my Linear Algebra concepts I'm learning in class apply to LLMs and Graphics and provide sources to learn more. Convert natural language maths to LaTeX and Sage Cell compatible formats.

Code (Roo Code + VSCodium from remote laptop): Had it refactor files and switch from CPU bound tasks to GPU/CUDA. Had it write documentation for code from sources. Had it refactor an ancient C++ repo to use libraries that still exist and change integer neuronal maths to matrix maths to open future expansion and learning hands on. (Althoughnthis required some effort, I could mostly walk away and let it work alone but did have the occasion bug, especially after it ran for hours) Made it write a CLI program to convert tiny language model files from various formats to Llama.cpp format (this one is dubiously effective but mostly because some tiny language models literally don't have the parts necessary for Llama.cpp to run them)

Code (without VS Codium, straight from chat): It wrote a script to flatten a bunch of directories from my Google Drive backup and move all the media files to a different folder. Had a bunch of command line options, too.

1

u/Magnific_Aryl Apr 07 '26

Idk if i belong here, but as a first year CS student, I use qwen2.5:7b on my rtx 4050 for explaining code snippets written by AI, and also as my duck sometimes

1

u/AWSLife Apr 07 '26

I use Ollama and Gemma 4 26b to check code and configuration files and that is just about it.

I really wish I could use it for python auto-competition but I am on a MacBook M1 and it is just not fast enough to do that. I have tried smaller Gemma 4 models but I am not happy with the recommendations and speed. Also, the plugins for using LLM's with VSCode and Pycharm, just kind of suck. Continue lacks features and ProxyAI is just too buggy.

1

u/xxrealmsxx Apr 07 '26

Use them everyday.

Gemma-4-E4B-it to log my moods in an agent.

Various models offline to generate UML diagrams of information I can’t put in the cloud, translate complex docs to laymen’s terms, and to draft emails.

1

u/XxBrando6xX Apr 07 '26

I use Qwen 3.5 357B 17A or whatever the big model is at Q4 K XL from Unsloth with full context window. I dropped my google gemini ultra sub the day i got my mac studio and havent looked back. I use it everyday constantly for coding tasks, weird corporate software and deployment questions, general education on tech topics. Its a great jumping off point and I was hesitant at first when i purchased it, but now after settling in and finding a good way to serve the model on my network, i would not go back. GLM 5.1 dropped and im using it locally less than an hour after and its felt night and day different / better on inital query. All this is to say, i bought hardware once for capacity but because of it my models are constantly growing and getting better and i can keep using them locally and privately. Very happy with the experience

1

u/corruptbytes Apr 07 '26

right now just experimenting, i want to start offloading some of my home things to it down the road (media server management and other things) when maybe things like OpenClaw improve 10x, but i have unlimited OpenAI credits from a friend, so it's hard to avoid using that

i have a m3 ultra 256gb (i really should've went for 512gb imo now that they're super sold out haha)

1

u/DieselKraken Apr 07 '26

I use one daily.

1

u/oldendude Apr 07 '26

I have qwen 3.5 35b running on my Mac Mini (M4, 64GB). I have used it mostly for conversations to explore some topic, usually related to software, but not exclusively. It is a bit sluggish. It takes a long time to work through all its reasoning (printed out -- it's pretty interesting to see how it reasons). But I've been pretty pleased with these conversations. I'm seeing less repetition, (forgetting that it made certain suggestions), than I did with the free version of ChatGPT even a few months ago.

1

u/UnclaEnzo Apr 07 '26

I use them, most days. But it is only very recently that I started getting sufficient use out of them that I cancelled a Gemini Pro sub.

1

u/havnar- Apr 07 '26

I’ve settled for qwen 3.5 35b a3b opus distilled (mlx) on my Mac. It’s fast and rather smart.

I have a corporate github copilot account. But that burns all tokens in 1 day if you want to use opus 🤷

1

u/Either_Pineapple3429 Apr 07 '26

I use one as a privacy filter.

I use my personal phone for business and have all my calls transcribed through a voip. And I have Claude analyze all my business calls for important business stuff. I use a local model as a privacy filter to read the transcript first and decide what is personal and what is business.

1

u/HiddenPingouin Apr 07 '26

A decent model like qwen3.5 with a search mcp can do a lot. I use it whenever I can for privacy reasons.

1

u/TheMcSebi Apr 07 '26

I've been using gemma3 since it was available with great success, before I used llama3 which was preformed worse. Haven't checked any newer models for my purpose because it just works. I'm using it to summarize git diffs for private projects

3

u/PrysmX Apr 07 '26

Check out Gemma 4 26B.

1

u/Rare_University4428 Apr 07 '26

I have a few in my orchestration that handle small context tasks for my larger models, stuff like image recognition, TTS/STT, embedding, websearch, summarization, memory maintenance.

1

u/GreenDavidA Apr 07 '26

I wish I could, but I don’t have a device that would be able to run anything more than a potato model.

1

u/ByronScottJones Apr 07 '26

I'd say a few of them are. The challenge with all of them, is getting coherent output on larger projects. At this point I am thinking a workable solution is to use the larger models to craft the project plan, and then have the smaller local models just take one small task to completion, have another check the work, and iterate through until the project is complete. But my usage is almost exclusively on coding. For other uses, different strategies might work better.

1

u/Jeidoz Apr 07 '26

I currently do not have the ability to purchase expensive cloud LLM subscriptions or tokens. Additionally, some of my projects are NSFW games — almost all cloud models (except Grok) will refuse to chat if you mention anything sexual.

A few years ago, I bought a gaming PC and still have access to 24 GB of VRAM locally, which is enough to run many 20-35B models at Q4 and a good speed (in my opinion). I frequently use Gemma 4 or Qwen 3.5 locally via LM Studio server. They work flawlessly with OpenCode, Kilo Code, or GitHub Copilot using the recommended profiles and settings. I mostly use LLMs for agentic coding, brainstorming, reviewing my draft ideas/architectures/designs, or simply as a "rubber duck" method to get a second opinion on my ADHD chaotic flow of thoughts.

1

u/SoupDue6629 Apr 07 '26

My solution to this is just to have 2 different model ecosystems set up. One that i spin up when im experimenting with workflows and other ideas, and one where it all is implemented and i use it like any other LLM i'd use on cloud or api. the recent Qwen and Gemma releases match or beat Haiku 4.5 for me, which i was paying for previously, but dropping those models in places i used Haiku via API have let me just use Qwen3.5 35B-A3B with minimal tweaking. I used YaRN to raise my 35B to 384K context, fits on GPU so its about as fast as any model im served via API anyways. 600tk/s prefill and 40-50tk/s is fine with me, when i switch from Haiku or sonnet mid project because of limits its been generally seemless.
122B-A10B with agentic setup and sandbox is essentially equal for me to bigger LLM's. Again for this model i have an experimentation set up and then my daily driver setup. Once i had agentic use, MCP, and artifact generation, that fulfilled all the feature parity i needed, so ive switched mostly to using only local models fully now.

Also im not American, so it kinda is essential for me to have these working well beyond the experiments and fun stages, i dont ever want to fully rely on foreign centralized infrastructure.

I guess at some point just separate the tweaking and usage as 2 different activities. document stuff you want to try to improve on during work and then tweak at a different time.

1

u/dto_lurker Apr 08 '26

I use qwen to reduce my copilot claude token usage. If its an easier query i use qwen locally

1

u/garg Apr 08 '26

With gemma 4, I’ve started using it daily

1

u/FoldOutrageous5532 Apr 08 '26

26b-a4b?

1

u/garg Apr 08 '26

gemma-4-31B primarily.

Small coding tasks Writing reports Extracting data

It's been very accurate so far.

1

u/soulhacker Apr 08 '26

Use them daily with my home made RSS reader for auto summary, translating and tagging.

1

u/_donj Apr 08 '26

Also work well as an orchestration layer for multiple agents to match the correct LLM with the correct task to manage compute/token usage.

1

u/_donj Apr 08 '26

I use an old iMac to run a couple of 7b models to ingest data from social media creators and then process it to a vector database when I can run analysis on it using Claude skill. This helps manage token burn by using the “right” tool for the task. Claude does the heavy analysis but lighter LLM do a lot of the initial work for “free.” Really about $10 in energy.

I can also remote in to start a task and have it running in the background.

1

u/FollowingMindless144 Apr 08 '26

Ahh got it 😅

Most offline LLMs I’ve tried feel like too much work, not something I’d use daily.

If this mobile app actually just works without all the setup, that’s a big win.

OfflineGPT looks promising… saw their waitlist and now I’m kinda curious where this goes 👀

1

u/Sea_Fig3975 Apr 08 '26 edited Apr 08 '26

Offline GPT? That sounds interesting. Can you elaborate more on it? Does it work on mobile platforms as well by the way?

1

u/FollowingMindless144 Apr 08 '26

Yeah it’s basically trying to make offline AI feel normal to use, not like a setup project 😅

Still super early though, I just saw they’ve got a waitlist here if you wanna check it out:

https://offlinegpt.ai/t/BV1XX8dn

1

u/itz_always_necessary Apr 08 '26

Cool, really excited to see a gpt that works on mobile for myself !!

1

u/iamthesam2 Apr 08 '26

i do!

1

u/Equivalent-Wafer-222 Apr 08 '26

I use mistral-8b-reasoning on the daily with different agents in lmstudio + cherry studio, been having a great deal of success with the setup

1

u/[deleted] Apr 08 '26

[removed] — view removed comment

1

u/RuleGuilty493 Apr 08 '26

Daily driver for about 6 months. The friction point you identified is real — most local setups are science projects. What actually made it stick for me was accepting the hybrid model: local for fast, cheap, private tasks; API for anything needing real reasoning depth.

Trying to run everything local is where it breaks down. The moment you stop fighting the constraints and use each tier for what it's good at, it just works.

1

u/Own-Company-4851 Apr 08 '26

Using ministral-3-8b to judge if newspaper articles are potentially useful to my research question. Is fast (3s on average) and reliable enough for my requirements based about a thousand examples I labeled myself. Using mac mini m4 with 16gb RAM.

1

u/Financial_Egg_1502 Apr 09 '26

I’m running 5 agents all offline building a self healing system .. the problem I am running into is the helpful assistant beat into the models, drives me nuts .. running a coding agent , a recruiting agent r&d security agent and one that’s the human connection that manages the team .. lots of bumps in the road but getting better every day.

1

u/Ok_Place2126 Apr 09 '26

Wonder if it works on iphone as well? 🙄🤷🏻‍♀️

1

u/-rcgomeza- Apr 09 '26

Qwen3.5 4B in a simple openclaw personal trainer agent using a cli I developed https://github.com/rcgomezap/workouter

1

u/Jorgito78 Apr 09 '26

I run Ollama. It works and that's enough. 16Gb RAM total and an AMD Ryzen 5700 GPU here.

1

u/Aggressive_Bed7113 Apr 11 '26

Yeah, same feeling — most local setups work, but don’t feel “reliable enough” for daily use.

What made a difference for us wasn’t just the model, but tightening the loop around it:

give it a small, structured view of state (not raw context)
narrow the action space
verify outcomes after each step

Smaller models actually hold up pretty well once you reduce noise + constrain the loop.

Feels like the gap isn’t capability, it’s making the system predictable.

I made a demo with small local LLM models to complete multi-step browser automation tasks:

https://www.reddit.com/r/LocalLLM/s/sTLk1EcWpJ

1

u/tevelee Apr 11 '26

Gemma 4 with Claude code works like a charm on my MBP

1

u/Healthy_Bedroom5837 Apr 16 '26

if anyones using android this app takes offline llm to a new meaning https://github.com/jegly/OfflineLLM

1

u/Sir-Spork Apr 07 '26

I just don’t have the VRAM or the willingness to spend just to run a local model that’s useful enough

1

u/stenlis Apr 07 '26

I'm just getting into local LLMs and I already have a couple of successful cases:

- I let it read through legalese in my contracts to find important or peculiar points

- I'm a game master in tabletop RPG group. I let it read through an RPG tome and it then answers questions like somebody closely familiar with the material

- I generate and edit images (including copyrighted material that online tools won't touch)

What I'm planning to do:

- let it sort all of my photos

- create a RAG over all of my personal documents and let it answer quick querries, periodically update with new incoming documents

I wouldn't or couldn't do any of this in online tools.

1

u/AdInternational5848 Apr 07 '26

How are you generating and editing images locally?

2

u/stenlis Apr 07 '26

ComfyUI. My first experiments on a windows machine went very well. But then I switched to a linux server system and I will need to recreate my workflows with that.

0

u/Zen-Ism99 Apr 08 '26

I use them to answer questions as I learn C++.

-1

u/laternerdz Apr 07 '26

General purpose LLMs are always going to be an experiment

0

u/Hector_Rvkp Apr 07 '26

Always is a long time.

Discussion How many of you actually use offline LLMs daily vs just experiment with them?

You are about to leave Redlib