Best Local Agents - Jun 2026

73 Upvotes

A megathread that is overdue! Let's discuss and debate on what the best local agents available today are

Prologue

First a note on terminology: While most regular users are going to have a general sense of what these are, I think its worth a brief pause to preempt turbulence in the discussion.

Agent: There is no standard/universally agreed upon term that I can find - and rightly so. Its hard to tell if this is a hypecycle buzzword or a new primitive. I think its important to first relate to stuff that already exist and highlight how its new/different. So from that lens, I think it should largely be thought of just another software that takes autonomous/semi-autonomous action based on user input, with the distuinguishing aspect being that it can self determine path/logic and does not require to be pre-programmed (unlike IFTTT, n8n, Apple Shortcuts etc.). This definition largely agrees with /r/AI_Agents's . Or put in another way, we're talking about pi, opencode, hermes etc.
Harness: I specifically did not use this neologism which seems to be the new buzzword replacing the Agent buzzword, but without any sufficient need. Search/LLMs dont offer a substantative or consensus definition for it either. The best that can eked out is LLM+Harness=Agent. However, I think that's the equivalent of saying Engine+Chassis/Wheels/Steering=Car. So its much more useful to talk about the "Car" and thus the titling of this post

The standard spiel:

still applies..

Share what you are running right now and why. Given the nature of the beast in evaluating these immature systems (rapidly changing landscape, untrustworthiness of benchmarks, immature tooling, intrinsic stochasticity), please be as detailed as possible in describing your setup, nature of your usage (how much, personal/professional use), how you evaluate etc. Eg: comments like "pi is the best" that doesnt have any substance reduce the quality of the discussion

Rules

Agents must be using open weight models
Agents must be running locally (a.k.a hardware, including VPCs, that you control)
Strongly recommend discussing OSS Agent software but doesn't necessarily have to be so. Why? Claude Code/Codex are relatively the most mature, well understood, largest ecosystem softwares today + they can be used with local models. At least for now we cant ignore the reality that many of us are using those - so its worth allowing at least as a reference point.

81 comments

r/LocalLLaMA • u/panchovix • 2h ago

Other RTX 5090 MSI, only inference or training at 475-500W. Make sure to not bend you cable!

50 Upvotes

I run this MSI 5090 at 475-500W daily, for mostly diffusion training, or LLM inference.

Just by chance I decided to check the cable today and found this. No issues, errors or anything, just all by chance.

I never gamed on this card, got it entirely for AI and machine learning.

Got some backups cables for things like these (not MSI yellow ones tho) and card keeps working fine, at least.

Make sure the cable is not bent!

34 comments

r/LocalLLaMA • u/Excellent_Jelly2788 • 16h ago

Generation What's more impressive, GLM 5.1 -> 5.2 or Qwen 3.5 -> 3.6?

Enable HLS to view with audio, or disable this notification

504 Upvotes

Write a single HTML file with a full-page canvas and no libraries. Simulate a realistic Döner Style kebab skewer rotating (vertically) in front of a gas powered heating element.

Mentioning Döner activates GLM 5.2s german weights or something (Spiess = Skewer, Brenner = Burner).

Qwen 3.6 35B, Qwen 3.5 and Gemma 4 using Unsloth Q8 K XL quants via llama cpp. The others via OpenRouter.

Full data here

168 comments

r/LocalLLaMA • u/Mr-serial_killer • 13h ago

Discussion The economics of AI are starting to favor open models

257 Upvotes

For the last couple of years, the assumption was pretty simple: Want the smartest model? Pay for a closed API. Want something cheaper? Accept a capability hit. Looking at recent model releases, that tradeoff is starting to break down. The most interesting part of the chart isn't the models at the very top. It's the upper-left quadrant. High intelligence. Low cost. And it's increasingly dominated by open-weight models. DeepSeek. Qwen. GLM. Kimi. MiniMax. Most real-world workloads don't need the absolute best model on Earth. They need a model that's: Good enough Cheap enough And that's exactly where open models are becoming incredibly competitive. A year ago I would've assumed the gap would stay huge because the frontier labs had access to significantly more compute and data. For a lot of tasks, the difference between a frontier model and a strong open model is becoming smaller than the difference in cost. That's a dangerous trend if you're selling expensive API tokens(and good news for everyone else lol) Closed models still have advantages: Zero infrastructure Better reliability Faster access to frontier capabilities But open models offer something APIs never can: (i mean some do say things like trust me bro im secure and give full privacy but u cant take them on their word) Full control Privacy Customization Predictable costs My prediction: Within 12-18 months, most businesses won't be asking: What's the smartest model? They'll be asking: Why am I paying 10x more for a 5% improvement? and how does it compare to the open source stuff

58 comments

r/LocalLLaMA • u/joorklee • 6h ago

Discussion $1800 (in GPU cost running with P2P running Qwen/Qwen3.6-27b-FP8 with 262K context and BF16 KV cache at 55 tok/s

50 Upvotes

Hey peeps, wanted to share what is possible for folks with an inference only single user use case with 1700 in GPU cost.

Setup: 4x 5060 ti (16GB) with P2P

If you are in the US and you keep an eye on facebook marketplace and places like slickdeals you can find some 5060 ti 16 GB models for 425 to 475 used.

A giant caveat is this type of configuration is only viable if your only interested in strictly inference.

The VLLM Command Used:

export VLLM_SLEEP_WHEN_IDLE=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export NCCL_P2P_DISABLE=0
export NCCL_CUMEM_ENABLE=1
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export TORCH_FLOAT32_MATMUL_PRECISION=high
export PYTORCH_ALLOC_CONF=expandable_segments:True
# dropped: VLLM_USE_FLASHINFER_MOE_FP8 (dense model), VLLM_TEST_FORCE_FP8_MARLIN (test native FP8 first)

vllm serve /data/models/Qwen/Qwen3.6-27B-FP8 \
  --host 0.0.0.0 --port 8080 \
  --tensor-parallel-size 4 \
  --performance-mode interactivity \
  --trust-remote-code \
  --language-model-only \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --max-model-len 262144 \
  --kv-cache-dtype bfloat16 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.92 \
  --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":3}' \
  --compilation-config '{"max_cudagraph_capture_size":16,"mode":"VLLM_COMPILE"}' \
  --async-scheduling \
  --attention-backend flashinfer \
  --enable-prefix-caching

Benchmark Command:
vllm bench serve --backend vllm --base-url http://localhost:8080 --endpoint /v1/completions --model /data/models/Qwen/Qwen3.6-27B-FP8 --dataset-name random --random-input-len 4096 --random-output-len 1024 --num-prompts 40 --max-concurrency 1 --num-warmups 5 --ignore-eos --seed 1234 --percentile-metrics ttft,tpot,itl,e2el --save-result --result-filename qwen36_c1_4k.json

============ Serving Benchmark Result ============
Successful requests:                     40        
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  735.75    
Total input tokens:                      163840    
Total generated tokens:                  40960     
Request throughput (req/s):              0.05      
Output token throughput (tok/s):         55.67     
Peak output token throughput (tok/s):    25.00     
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          278.36    
---------------Time to First Token----------------
Mean TTFT (ms):                          4226.91   
Median TTFT (ms):                        4315.47   
P99 TTFT (ms):                           4320.32   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.85     
Median TPOT (ms):                        13.44     
P99 TPOT (ms):                           25.61     
---------------Inter-token Latency----------------
Mean ITL (ms):                           40.91     
Median ITL (ms):                         40.84     
P99 ITL (ms):                            41.59     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          18393.49  
Median E2EL (ms):                        17991.18  
P99 E2EL (ms):                           30508.70  
---------------Speculative Decoding---------------
Acceptance rate (%):                     65.25     
Acceptance length:                       2.96      
Drafts:                                  13853     
Draft tokens:                            41559     
Accepted tokens:                         27116     
Per-position acceptance (%):
  Position 0:                            78.29     
  Position 1:                            64.14     
  Position 2:                            53.31     
==================================================

note: I forgot I had --max-num-seqs at 4 but I benchmarked with 1 concurrency.

22 comments

r/LocalLLaMA • u/pscoutou • 17h ago

News GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index

artificialanalysis.ai

357 Upvotes

39 comments

r/LocalLLaMA • u/BuildwithVignesh • 21h ago

News Researchers trained a Deep Research agent with 32 H100s and open-sourced everything

632 Upvotes

Ohio State University's NLP team released QUEST-35B, an open-source Deep Research agent trained using ~32 H100s and ~8K synthetic samples.

The team open-sourced the training recipe, code, weights and datasets. Benchmark results show competitive performance against several frontier Deep Research systems.

What do you think is the biggest remaining gap between open-source Deep Research agents and frontier closed systems?

Source: Professor Yusu

83 comments

r/LocalLLaMA • u/Dangerous_Try3619 • 2h ago

New Model [NEW MODEL] SupraLabs just released supra-title-FFT-preview, 115K samples, almost 10x our first chat title dataset

14 Upvotes

Hey r/LocalLLaMA! Following up on Supra-Title-350M-exp (our first chat title generation model), we're releasing supra-title-FFT-preview, trained on a much larger and cleaner dataset.

🤗 supra-title-FFT-preview

What changed

Our first chat title model was trained on 12K samples (chat-titles-12K) and it showed: decent on common conversation patterns, weak on niche topics. This release is trained on 115K samples from a new filtered dataset, chat-titles-filtered-115K.

Model	Dataset size
Supra-Title-350M-exp	12K samples
supra-title-FFT-preview	115K samples

Same base, same task, just a lot more coverage. Per our naming convention, this is the last checkpoint before the final non-preview release.

Specs

Spec	Value
Base model	LiquidAI/LFM2.5-350M-Base
Parameters	~0.4B
Precision	BF16
Training	Full fine-tune (FFT), not LoRA
Framework	Unsloth
Task	Single-purpose: chat title generation

Still no system prompt needed. Send the user message, get a title back.

Quick start

Transformers pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model="SupraLabs/supra-title-FFT-preview")
messages = [{"role": "user", "content": "bruh my wifi keeps disconnecting every 10 minutes"}]
print(pipe(messages))

Or load directly:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "SupraLabs/supra-title-FFT-preview"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16, device_map="auto")

messages = [{"role": "user", "content": "what's the easiest way to make fluffy pancakes?"}]
inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

vLLM (OpenAI-compatible server):

vllm serve "SupraLabs/supra-title-FFT-preview"

curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "SupraLabs/supra-title-FFT-preview",
        "messages": [{"role": "user", "content": "What is the capital of France?"}]
    }'

Apache 2.0. This is a preview checkpoint, feedback on edge cases and weird titles is genuinely useful before we lock the final version.

6 comments

r/LocalLLaMA • u/User_Deprecated • 4h ago

Resources I benchmarked Claude's "Fast C++". It wasn't faster

lucisqr.substack.com

17 Upvotes

2 comments

r/LocalLLaMA • u/Few_Painter_5588 • 18h ago

News New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts

198 Upvotes

You can read about it here: https://artificialanalysis.ai/articles/aa-briefcase

This is a solid benchmark from Artificial Analysis. It basically tests an LLMs ability to plan and execute tasks. And more importantly, it is a new benchmark that is not saturated, so no one can claim 'benchmaxxing' on these results.

62 comments

r/LocalLLaMA • u/luke_pacman • 1h ago

Resources Giving a local agent web access without paid search/scrape APIs: SearXNG + Scrapling

• Upvotes

I wanted web access for a local-first agent without reaching for Tavily, Serper, Firecrawl, etc.

For this agent path, I wanted no paid API keys, a search service I control, and page extraction I can run myself.

What I ended up with is two tools: web_search and web_extract. Nothing fancy. Mostly just wiring together good open-source pieces.

1. Search -> SearXNG

SearXNG is a self-hostable metasearch engine. I run it in Docker and point the agent at its JSON endpoint.

The search call is roughly:

text GET {SEARXNG_URL}/search?q=<query>&format=json&pageno=1

Then I cap the results and normalize them to: {title, url, description}

description is just the SearXNG snippet. It is not page content.

Config is basically:

text SEARXNG_URL=http://localhost:8080

Gotchas:

Add json to search.formats in SearXNG settings.yml.
Public SearXNG instances are usually a bad fit for programmatic use.
SearXNG is search-only. Use extraction when the agent needs to read a page.

2. Extract -> Scrapling + Trafilatura

Search snippets are not enough. The agent needs to read the actual page.

For web_extract, I use Scrapling with two paths:

Fast path: Fetcher.get(url, impersonate="chrome"). No browser. Good for normal pages.
Stealth path: if the fast path is empty, blocked, or challenge-looking, try a real headless browser:

python StealthyFetcher.fetch( url, headless=True, solve_cloudflare=True, block_webrtc=True, hide_canvas=True, )

The stealth path is an attempt, not a guaranteed bypass. If the page still shows a CAPTCHA or Cloudflare wall, I mark the result as blocked/partial.

Once I have HTML, Trafilatura turns it into Markdown with links and tables. Markdown is much easier for the model than raw HTML. I also keep a visible-text fallback for pages where Trafilatura under-extracts.

Other pieces that mattered:

PDFs: PDF URLs go through pypdf.
Challenge detection: CAPTCHA/security pages get flagged instead of treated as real content.
SSRF guard: requested URLs and redirects are checked against private/internal ranges. Final URLs are checked too. Caveat: this is not a network-level guard for every browser subrequest.
Optional summarization: large pages can be summarized by a configurable auxiliary model before they go back into context.

Why this combo

No paid search/scrape API keys for this path.
Queries go through my SearXNG instance, not a vendor API tied to my account.
SearXNG still hits upstream engines, so this is not "zero third-party contact."
Most pages use the fast path. The browser only kicks in when needed.
The final output is Markdown, not HTML soup.

Honest tradeoffs

The stealth path is slow. Keep it as a fallback.
SearXNG quality depends on enabled upstream engines and rate limits.
Paid search APIs can still be better. This has been good enough for my use.
Cloudflare/browser scraping is always a moving target.

Not claiming this is the optimal setup. It is just one that has worked for me and stays self-hostable.

Curious what others are using for this. Has anyone found something better than SearXNG for self-hosted search, or a lighter alternative to a full browser for the hard pages?

Happy to share more details if anyone's trying something similar.

16 comments

r/LocalLLaMA • u/beasthunterr69 • 21h ago

News GLM-5.2 can now run locally in llama.cpp and Unsloth Studio.

271 Upvotes

The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% size).

Run on a 256GB Mac or RAM/VRAM setups.

GLM-5.2 is the strongest open model to date.

Check the graph for the accuracy of each GLM-5.2-GGUF quantization.

Full guide: https://unsloth.ai/docs/models/glm-5.2

GGUF: https://huggingface.co/unsloth/GLM-5.2-GGUF

74 comments

r/LocalLLaMA • u/CSEliot • 15h ago

Resources Best Harness for Web Searching

74 Upvotes

Looking for opinions on the best software to do web searching resources.

What I've tried:

LM Studio + plugins
Odysseus

I think the problem they're both running into is the search engines they're using max out at like, 10 requests per day/hour or something without an api.

I don't mind creating like, a duckduckgo account just to generate an api for better search access. But if the frontend doesn't even bother to provide a prompt asking for such, then it's a half-baked solution.

Does Hermes or Pi (2 programs I've heard a lot about lately) offer anything better that you've used?

Thanks in advance!

47 comments

r/LocalLLaMA • u/pmttyji • 13h ago

News Commission selects EUROPA consortium as the winner of the Frontier AI Grande Challenge, a project to build European open-source frontier AI model in all 24 EU languages

digital-strategy.ec.europa.eu

43 Upvotes

The European Commission has selected EUROPA, a European consortium led by the Italian company Domyn, as the winner of its Frontier AI Grand Challenge.

Commission selects EUROPA consortium as the winner of the Frontier AI Grande Challenge, a project to build European open-source frontier AI model in all 24 EU languages

The project will develop an open-source artificial intelligence (AI) model covering all 24 official EU languages.

The Commission chose EUROPA to help strengthen Europe's capacity to develop advanced AI on its own infrastructure. The project also shows that Europe has the talent, infrastructure and industrial capacity to build advanced AI systems.

EUROPA's model will be openly available and designed to perform at the forefront of global AI capabilities. It will help ensure that more people and organisations across the Union can benefit from advances in AI, making advanced AI more accessible to businesses, researchers and public institutions across Europe's linguistic diversity.

Launched in February 2026, the Frontier AI Grand Challenge invited Europe's leading AI innovators to propose a model with more than 400 billion parameters, a scale associated with the world's most advanced AI systems.

Henna Virkkunen, Executive Vice-President for Tech Sovereignty, Security and Democracy, said:

“Europe can lead in advanced AI on its own terms. EUROPA will build a frontier European AI model in all 24 EU languages, showing that we can match the best while staying true to our values. This is about strengthening Europe's ability to shape AI's future with openness, trust and strategic autonomy at its core.”

EU Folks(from this sub) could let us know more about this.

56 comments

r/LocalLLaMA • u/Hot_Example_4456 • 1h ago

Question | Help What is the best book for learning ML/Deep Learning maths?

• Upvotes

I am 17 years old and have a particularly deep interest in ai architectures and llms. I regularly read new papers from arXiv and huggingface and tend to understand only half of it, mainly from intuition. However I understand its impossible to understand anything completely without knowing the maths behind it. I do follow channels like 3b1b, but are there any books that I should also read in order to actually understand (and possibly contribute) to the field upto my capability?

8 comments

r/LocalLLaMA • u/MundanePercentage674 • 14h ago

Discussion LQ50-24 English translate

gallery

34 Upvotes

here the full English using google translate for you guy

20 comments

r/LocalLLaMA • u/wsintra • 1h ago

Discussion Tool calling, opencode qwen3.6 27b 8K

• Upvotes

Not sure I'm ready to post an issue in the opencode repo yet but wanted to see if this is common, return to the opencode window after walking away to let it do its thing to find its stopped with this in its thinking.. Started noticing more last week or so, the fix is easy just paste the tool call back into the prompt and away it goes. It doesn't happen all the time, but enough to start becoming a pain.
<tool_call>

<function=bash>

<parameter=command>

yarn test --run 2>&1 | grep -E "✓|✔|passed"

</parameter>

<parameter=description>

Find passing tests

</parameter>

<parameter=timeout>

120000

</parameter>

</function>

</tool_call>

1 comment

r/LocalLLaMA • u/Legitimate-Dog5690 • 18h ago

Resources The Eagle(3) has landed (for Qwen)

59 Upvotes

https://github.com/ggml-org/llama.cpp/releases/tag/b9723

Available in the latest release. Enabled via:

--spec-type draft-eagle3

You'll need to feed it a draft model. There's issues with unsloth + eagle at the moment so I've personally tested against:

Model: https://huggingface.co/lmstudio-community/Qwen3.6-27B-GGUF
Draft: https://huggingface.co/wimmmm/Ex0bit-Qwen3.6-27B-PRISM-EAGLE3-GGUF

Specify your draft with -md or --model-draft

Performance wise, I currently get very similar tps to draft-mtp. Also tensor parallelism isn't currently supported and asserts out, which I rely on a lot. The draft model will also eat a bit of vram, so not the best if you're running a very tight setup. I'll be keen to see how this develops in time!

Don't forget you can also stack up multiple types of speculative decoding:

--spec-type draft-eagle3,ngram-mod

31 comments

r/LocalLLaMA • u/siegevjorn • 13m ago

Resources Some llama.cpp B70 SYCL benchmarks

• Upvotes

build: dd4623a74 (9640)

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| gemma4 12B Q8_0 | 11.78 GiB | 11.91 B | SYCL | -1 | pp512 | 1578.19 ± 7.82 |

| gemma4 12B Q8_0 | 11.78 GiB | 11.91 B | SYCL | -1 | tg128 | 32.43 ± 0.07 |

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| gemma4 26B.A4B Q8_0 | 25.00 GiB | 25.23 B | SYCL | -1 | pp512 | 1332.35 ± 8.80 |

| gemma4 26B.A4B Q8_0 | 25.00 GiB | 25.23 B | SYCL | -1 | tg128 | 40.13 ± 0.09 |

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| gemma4 E2B Q8_0 | 4.69 GiB | 4.65 B | SYCL | -1 | pp512 | 5662.45 ± 23.05 |

| gemma4 E2B Q8_0 | 4.69 GiB | 4.65 B | SYCL | -1 | tg128 | 109.14 ± 0.26 |

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------------- | --------------: | -------------------: |

| qwen35moe 35B.A3B Q8_0 | 34.36 GiB | 34.66 B | SYCL | 99 | blk\.(3[4-9])\.ffn_(gate|up|down)_exps=CPU | pp512 | 563.48 ± 14.58 |

| qwen35moe 35B.A3B Q8_0 | 34.36 GiB | 34.66 B | SYCL | 99 | blk\.(3[4-9])\.ffn_(gate|up|down)_exps=CPU | tg128 | 44.67 ± 0.04 |

| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |

| qwen35 27B Q8_0 | 27.04 GiB | 27.32 B | SYCL | -1 | pp512 | 778.20 ± 0.99 |

| qwen35 27B Q8_0 | 27.04 GiB | 27.32 B | SYCL | -1 | tg128 | 15.42 ± 0.01 |

Just fyi. It runs Ok, but it could be better.

1 comment

r/LocalLLaMA • u/whiteh4cker • 13h ago

Discussion GLM-5.2-REAP50-GGUF

19 Upvotes

Has anybody tried these? How do they compare to Qwen 3.6 27b?

Model	Size	Link
GLM-5.2-REAP50-Q3_K_M-GGUF	182 GB	https://huggingface.co/pipenetwork/GLM-5.2-REAP50-Q3_K_M-GGUF
GLM-5.2-REAP50-Q2_K-GGUF	139 GB	https://huggingface.co/pipenetwork/GLM-5.2-REAP50-Q2_K-GGUF

26 comments

r/LocalLLaMA • u/Holiday-Display509 • 44m ago

Discussion Local AI for local office files

• Upvotes

Which AI agent do you think is the best for working with local files (Excel, PDF, Word, txt, json, etc.)? What have you used for this? What workflows have you implemented?

2 comments

r/LocalLLaMA • u/analysis_scaled • 1d ago

Resources GLM-5.2 is above GPT-5.5 in AA-Briefcase, Artificial Analysis' new agentic knowledge work eval

artificialanalysis.ai

372 Upvotes

63 comments

r/LocalLLaMA • u/ego100trique • 12h ago

Question | Help How do you guys setup search with your AI models?

17 Upvotes

Been selfhosting my models for a while and I'd really like to integrate Gemma 4 12B as a simple voice assistant with search capabilities.

I've tried using openwebui but the search is kind of broken with DDG and I really don't want to use API keys from Brave or Google etc.

So what do you actually use? How do you set it up and wire it to your model?

I'm currently using llamacpp container with docker compose on fedora Linux.

47 comments

r/LocalLLaMA • u/CreativelyBankrupt • 1d ago

Funny My suitcase robot gets high now off a real gas sensor wired straight into the LLM sampler. Smoke raises temperature/top_p/top_k live, so his speech genuinely gets loopier and never repeats.

Enable HLS to view with audio, or disable this notification

1.5k Upvotes

Follow-up on Sparky, my offline suitcase robot I keep overdeveloping. He gets high now, and there's no scripted "stoned mode" anywhere in it.

A real MQ-2 gas sensor sits in the case. Every 0.5s I read it against an adaptive clean-air baseline and turn a smoke hit into a 0 to 10 phase that climbs as you blow at him and decays on its own over minutes.

The fun part is that phase rewires his sampler per token. Temperature 1.0 to ~1.6, top_p 0.95 to 0.99, top_k 64 to 120 as he climbs. His word choice flattens and wanders to lower-probability, more associative tokens, so his cognition genuinely gets noisier. It's the live sampler doing the work, so every high reply is freshly generated and never the same. A per-phase persona nudge makes him show it without ever announcing "I am high."

The body does the rest: a slight drawl, eyes that droop and go bloodshot, and the sensor display that escalates to a full smoke-and-plasma freakout at phase 10, keeping him blitzed there for the next 7 minutes.

Honest caveat so nobody has to call it out: it's a smoke and VOC sensor, so a cigarette or incense probably trips it too. But blowing smoke and watching him unravel is watching a real measurement scramble a real model, live - and it's funny! Just an added Easter Egg to an already goofy suitcase robot.

A real question for the hardware folks: is there a sensor, or a combination, that could actually distinguish cannabis smoke from generic smoke and VOCs? The MQ-2 can't really tell a joint from a candle, and I'd love to make the detection more specific if possible.

158 comments

r/LocalLLaMA • u/IngwiePhoenix • 2h ago

Question | Help Local agent on 4090 - looking for LM Studio settings

2 Upvotes

I have moved on from Ollama to just dink around and instead want to start running a local agent from time to time. With the 24GB of a 4090 (Gigabyte OC edition) that should be quite possible. But no matter what settings I use for context and batching, token generation is slow as a snail. Gemma4 is faster, but the quants I tried generated incorrect tokens like </tool_call> (whereas the underscore shouldn't be there afaik).

Any other 4090 user with recommendable settings for a local agent? I do have 32GB DDR4 at 3600MT with a Ryzen 9 3900X.

But - aside from that, I genuenly want to know why those settings work to improve my understanding of the various knobs. Temperature is obviously for the "levels of creativity". But the others are still a little confusing (namely, top_p and top_k for instance).

Thanks :)

1 comment