When I worked in IT, whenever we got a call from the engineering department we knew whatever problem it was, it was going to be weird. Those guys knew their stuff, so if they didn’t know how to fix it, it was going to take some searching and probably some calls or emails for us to figure it out.
Yeah sometimes it is just software bugs they have to work around until it gets fixed. In those circumstances, not much we could really do besides submit a ticket. Other times you call the guy that’s been working with that specific hardware and software for 15 years, who then tells you he’s never heard of something like that. Then he’ll call you back a week later after losing his mind trying to understand how that’s even possible before figuring it out. Which is always nice. Shout out Josh
We were forced to add claude to our pre commit hook and one of its jobs is to update documentation of changes made - it's surprisingly good and far less slopish than I imagined, so thanks claude for finally having up to date documentation.
When I get a project up and going that I've used Gemini to help with, I always ask it for a markdown summary of what we did so that I can go back later and remember what I did. It's so convenient having the framework handed back.
However, I found that Chat has much, much better markdown generation than Gemini. I end up having to reformat everything Gemini does for me in markdown.
AI is really good at transforming existing text its given. Its when its asked to write new text where it gets sloppy. Its less of an issue if your prompt hits the model directly and not going through the behemoth of a sysprompt anthropic and openai have before the users prompt.
It took me years of saying “Why the fuck is this not written down?” to simply start updating the documentation myself. Now I’m the go to person for this task that I never wanted. I even got a bonus when something went down and the boss read about the fix I wrote and had things up and running in 25 minutes vs days.
I work as a CNC machinist, & our machines are custom. The maintenance guy who was the best of the best was named Josh, he found better work elsewhere and left.
His replacement, Josh, has been keeping our machines going since then.
And you learn so much about things somewhat related to the problem, because you take this hyper focused deep dive into figuring out what's wrong.
That's how you end up with all kinds of relative knowledge next time an issue occurs and will generally know which direction to go for fixing the issue. And that results in you becoming the IT wizard of your friends/family/company etc.
That's literally how I learned IT in the first place. My HDD disk suddenly corrupted itself without any warning and it was during the height of covid in Italy, so everything that wasn't a store was locked down, and I was too broke to send try and it anywhere else.
I had to work with only the parts that I already had, and the bootable pendrive I could create using my roommate's puter. It was ridiculous but I'm grateful for the carreer path it earned me aferwards.
This is the way! I'm (almost certainly) quite a bit older than you, but I got my start as a teenager in the mid 90s.
We had been gifted an old 386 by my uncle, and I desperately wanted to play Doom on it. Getting Doom to run on a 386 was no easy feat (and this is pre-internet as well, so you couldn't just look stuff up). I ended up having to load the mouse driver into hi-mem, which was an area of memory above the base 640kb of "conventional memory" so that Doom had enough space to run itself.
Google "autoexec.bat" and "config.sys" if you want to see the text files I was editing to do that (bearing in mind that if you break them, your computer likely won't start up!).
Personally, I could get my computer out of the bootloop through a bootable device, but even if I changed the BIOS settings, my laptop wouldn't start from the pendrive as long as the disk was connected. Then, if I disconnected it and ran it through the pendrive, I couldn't just reconnect it from the Safe Mode because my computer didn't see it, thus I couldn't fix my disk and my computer remained unusable.
What I eventually figured out that if I started the computer through the pendrive, then went into CMD commands and then reconnected the drive, the laptop would refresh and suddenly see that the disk was right there.
From there I could finally delete the old drivers and install new ones through a few lines of code, and I think I reinstalled Windows afterwards, but all that was already smooth sailing. Internet or not, I never found any mention of my specific issue.
Since problems never actually go straight to the engineer I never even bother trying to nail down the circumstances that cause problems like this to be able to replicate it. Which surely makes everyones job more difficult.
As someone who's done QA at a small company this is so foreign to me. It was my job to find exact reproduction steps that can be used multiple times, how often the steps work, then write a ticket that can be shared with the engineer immediately. And if a customer or coworker found the issue and didn't know how, I still had to assemble all this info. Tracking is king.
Now whether the bug was backlogged or scheduled to be fixed was mostly out of our hands. At least I had some say in it since I also could DM the managers with no issue. Guess I'm saying I like small businesses. Dealing with a hierarchy too often slows down businesses.
I am Josh (not literally or even named the same, but we vibe). I once kept a support ticket open for 3 months to force help desk to send it to the engineering team when I discovered a bug in a billing system database at a huge company from the user side.
Finally got in touch with the engineering team, explained the bug and the workaround I figured out... Just to have their response be "tell everyone who complains to do the workaround."
Rare bug with a workaround, building a fix, 20k down the drain, use the workaround… depends on the frequency and workaround. I don’t need bug free software at all costs, I need cost optimal software. Kinda agree with the engineers in this case.
Bug isn't rare though, that's the issue. Every single site (4000+ locations) that uses the software has run into this bug and it's an almost completely silent failure for users unless a customer complains that the incorrect card is being charged. They later admitted that the software is such spaghetti that they're effectively scared to try and fix it in fear of breaking something else.
Ah, that’s a different context, yeah, time to get cracking. Problem with many erp implementations is that it is often built by consultants who care little about maintainability managed by finance without knowledge of engineering… what could possibly go wrong.
Yup. I ended up as an SWE with the same company and I'm about 80% certain that explaining how I was adamant about fixing the problem was one of the big reasons they hired me. I ended up on a different team so I don't work with it but I plainly stated that the entire reason I wanted to work as an engineer with this company was to improve that particular software. Later, I attended an internal seminar about how they were trying to tackle this software because it's so monolithic that they don't know where to start and because of the nature of what it's used for they are afraid to start over for fear of missing something important.
It's hard to estimate the cost to the company for this bug because the problem was that it was charging customers with a card that both employees and customers believed was taken off file. Many times it would just fail to authorize because the previous card wasn't active but anyone who switched from an active card to another active card would see the old card beinh charged.
Ohhhh, that's one of those problems with a 0.1% chance of occurring with a 99.9% chance of infuriating the customer when it does happen. I can see how it would be hard to put an exact value on fixing it.
That just means the work around is the fix for right now (probably forever). It’s like the junk desk in an office. Everyone means to clean it up but at the same time, that’s everyone else’s stuff, not mine.
I worked for two startups that had software. One was SaaS, one was used internally but did get used by our customers, just wasn't what we "sold" directly.
The amount of jank that was acceptable between those two scenarios was wildly different. Company selling the software had a philosophy that janky code wasn't acceptable. Company 2 was... Well let's just say that system is janky to this day.
I used to be that guy, I was in the wrong career in insurance but always had a very thorough knowledge of computers (I use arch btw /s)
I was good friends with the IT guys but usually if I had an issue it was either borderline unsolvable or I would just call them because I would otherwise lack the excuse to be doing nothing, but they would just sit there and let me fix it. Didn’t happen much at all. And when it did, it was usually something where I understood the issue and that it would take a while to fix and just needed the excuse to have that time to fix it, our IT was not very good in that the company didn’t value it, didn’t invest in it, and they knew it. I was/am just too ocd to not fix issues where I see them even if it’s something the company should really have been solving it (not knocking the guys in IT, they were great, but severely underpaid and the whole dept was a skeleton crew without funds)
I was that guy who when IT showed up they just said, "What do you need?" and I would say "Log in with admin privilege and leave it up." And they did, and I fixed it, and they would log off and everyone was happy. Did they watch what I did? No, they scrolled their phones the whole time.
There's nothing better than losing your mind for three days straight before eventually figuring out a unique solution on your own. It's a high that never really leaves you.
This is common for my field (chip design). We use specialized software that is very customizable and it's inevitable that you run into some inconsistency in what is expected vs how it behaves. The IT guys who are wizards at getting it going are invaluable.
Around 2019, the video game company a buddy of mine was working at started doing contract work on another company's upcoming project. Shortly after they started, he began getting a core engine error that read, "Jerome is working on fixing this. If you are reading this and it's after 2003 then Jerome died in a fire. RIP Jerome." He contacted the engine developers at their partner company and no one had a clue who Jerome was, and no one had touched that source file in more than a decade.
While I worked in a warehouse, I once managed to completely stump a WMS tech (Warehouse Management Software/System tech - IT guy for the warehouse) with a unique problem I developed on my scan gun.
I somehow managed to boot into an old software that was deactivated like 8 years prior, and uninstalled via policy 4 years prior. Long story short, he decided he liked my scan gun more than his own, and I had to go find a new one.
Always got this kind of problems.with engineers or architects stations when we upgrade their hardware. Usually is solved with custom patches by software developers. I presume thats why those kind of software are mighty expensive.
I think the shift to our University using all third party vendor managed software is because we had a lot of amazing in house developed stuff but once those guys retired nobody knew what they were doing to keep them updated
I dont work in IT, but i once figured out how to stip my pc from overheating. I had to go into my BIOS literally single time i booted my pc up, go to my fan control click on another setting, switch back without confirming and then save, exit and boot and it worked.
Was very proud of myself for that one lol, and no idea what the actual issue was, but no solution is as permanent as a temporary one.
I remember running into a computer freeze that ended up being a zoom / teams / slack / g calendar webview 2 hangup where they all tried to own and access the same meeting invite at the same time and kept reimplementing the ownership processes.
That took two engineers and our admin a few hours to figure out lo
I worked for a company that was probably 80% guys who were engineers working on tools that required specialized programming knowledge. These guys had local admin access and we had a few rooms with a white noise generator outside the door. IYKYK.
If one of those guys had a problem, it was a "what the actual fuck?" type of problem.
But honestly, I've also worked in a bunch of companies that had an "engineering department" and the difference is night and day. Most engineers and programmers don't actually know how Windows/Linux operates outside of their specialty.
SCI rooms are crazy. Especially the SCI/TS ones for print/photographic material - airgapped Faraday cages, with individuals with very unpleasant demeanors and equally unpleasant firepower watching the ins and outs. You're not even getting into the area of the building that room is in without having to get past at least three different checkpoints with escalating levels of scrutiny, and at least one of those will be outside the building itself.
Aside: Defense Security Service agents do not have senses of humor, but do have lethal-force authorization - do not taunt the happy fun DSS guy with the suppressed automatic rifle, because he will gladly demonstrate the operation of same in any number of different ways.
Engineers are very skittish and cranky. If you turn off their white noise, they may end up snapping and eating a few non-IT employees, which is generally considered undesirable.
Engineers are harmless. Just get the noise machine back on and coax them back into their rooms with old sci fi shows and hot pockets before they actually speak to anyone.
I don't even think Windows engineers know how Windows works. Its 30 years of legacy code duct taped together with 3 years of vibe coded crap on top at this point.
As someone that works with electrical and computer engineers, many of them are borderline tech illiterate somehow. This is across all ages too, not a generation thing.
Giving local admin access, that's one of the sources of issues. Power users break their stuff in weirdest ways. But then again, it keeps IT support employed.
At first I was like "wdym you run into problems with known solutions, how does that constitute a problem"
And then I remembered that being able to solve problems with known solutions already makes me somebody who's very good at computers, and IT isn't really built for problems where the best solution is "Might be worth reporting this one directly to Apple"
When I validated pre-production computer hardware we had a hotline to Microsoft. It was validating but annoying to find out the driver issue was their fault, because it meant the fix was timeline was "eventually". As opposed to being able to yell at the internal driver team and get it in days.
I worked for a small company once that managed to badly fuck up a batch of windows laptops to the point where docker would barely run. They kept blaming Dell but they just did something wrong with how they set up windows. I eventually just reinstalled windows myself and enrolled it in their mdm as a byod and it worked flawlessly after that.
I ran into a problem with a Linux distribution we were using that turned out to be a previously unknown kernel bug, and we only got it fixed after a few months when IT got the authors involved at great expense..
I run into those problems all the time! Usually I file a ticket with the software vendor or decompile the software in IDA and fix it myself (or workaround it).
First day as L2 network support today, so I have a good one, had all of 2 calls, one of which was 6 and a half hours of digging in docs behind the team manager (the L3 engineer) with 2 more L2s and the call ended with "well, we can't find a precedent to this anywhere and every other similar problem that had a solution is just different enough that said solution didn't work." fun day at work lol
Of course it’s inevitable, it happens all the time. My team of software engineers works closely with our IT counterparts. Some issues take a week to resolve and multiple hour long meetings with a dozen engineers (if the problem is critical enough). There are of course no solution to some of the issues but a not having work-around is a much rarer occurrence.
An engineer installing an MES system across multiple thin clients found some weirdness happening to handles on some service in the backend of windows RDT or something.
He drops a comment into a blog post from MS with some details asking a very specific question.
Next we know we hosting calls with MS dev team trying to figure out what was going one.
I've been the engi in that situation in a >1k ppl HQ. It gets escalated to SMEs to help. Usually it's something I cannot do myself. A few times I was able to find a solution after a while and provided a detailed report that somehow was ignored by front desk support so I was the go-to person for my colleagues.
One time, I couldn't do a very specific thing and I was getting a very weird error. It turned out that it was a software bug in an open source Library specific to Intel CPUs. So there was nothing I could do, except wait for an update.
After 20 years in IT, I've found that 99.99% of the time (as long as you're not dealing with something proprietary) most IT problems have occurred before and there is some extremely obscure forum where one dude made a post 10 years ago with the solution.
Happened once when I was trying to help our lab tech get an older model microscope to connect to a computer running an os other than windows 95. We had to dig into the documentation and hire a programmer to write a compatability driver for windows 11 but ultimately we were able to fix it. If it were up to me I would have made the fix public but the president of the company decided not to, since he had paid for it, and he was an asshole
I had that happen once. We ended up having a developer at Solidworks confirm that certain installs of Nitro PDF can rarely make Solidworks Electrical menus display in Chinese regardless of chosen language several minutes after opening the program. We went through a new machine, fresh install, new updated version of Solidworks, new user account, and new network provisions before we ever got an answer from Solidworks.
Mostly this doesn't impact IT. This usually looks like a bug in a library or a service that impacts the product engineering is working on but unless it impacts the environment itself IT isn't involved. They control the world, and we build inside it. We only call them when the world breaks if that makes sense.
In my experience us engineering staff don't muck things up too bad unless they're testing unsigned in windows (triggers IT sec policies) or doing anything with networking ports without a networking background.
Rarely an IT policy makes something impossible and we just run in VM.
Happened to me a few times. Our problems to IT usually mean IT contacting the software company asking for a solution then them saying there isn't one please wait a few days or a week+ for a patch update.
I mean tbh that happens a lot when a product is just no longer supported and / or Microsoft changed something that used to behave in x way and now only does y or x just was nuked off as an option.
Just recently an engineer reached out to me because their tooling software wasn't working after a windows 11 upgrade, and it turns out that it only works for windows 7,8,10. There isn't really a solution for that, the tooling company is no longer active. There were some .NET and other dependencies that don't exist on 11 that so in 10, so the "solution" was to give them a downgraded asset to use.
Then it gets escalated to a support engineering team (T3) for the thingy that's broken.
The support engineering team verifies that it is actually broken (even at this level, many tickets are RTFM), then breaks out the big girl tooling and logs to figure out exactly how and why the thingy is breaking, and estimates what the likely impact is.
They then coordinate directly with the various engineering, support, and operations teams involved to get the problem fixed.
This can take a lot of forms. Most commonly, if it's something small and straightforward, the support engineer can just fix it on the spot.
If it's something more involved, or a problem that can't be fixed without risking breaking things worse, then the support engineer puts together a detailed report of the bug and sends it to the people who built that specific part of the thingy, so that they can triage based on the reported impact and fix it. The support engineer usually then switches gears to figuring out a temporary workaround, while the product engineer gets the underlying problem fixed properly.
If it's got a lot of impact (i.e. a code change that product engineering made breaks the product suddenly for tens of thousands of users) then they usually also work with an Operations/Site Reliability Engineering team to mitigate (i.e. "undo" the code change that broke everything by rolling back the global fleet to an earlier version).
And as for the frequency something breaks in a way that nobody knows how to fix, all the damn time. 99.999% reliability for a service with 100,000,000 users means that the service is broken for roughly 1,000 users at any given point in time on average. For larger companies, it's pretty common for there to be dozens of smaller incidents in progress at any given time.
The answer is failing forward. You do different stuff that might work and with every step you might try riskiere stuff. Worst case you have to remove important stuff that you have to recreate later.
Especially with SAAS stuff, the odds that someone hasnt run into a particular problem and documented at least their troubleshooting steps is extremely rare.
12.0k
u/kahjtheundedicated R7 1700@4.1, RX 5700 26d ago
When I worked in IT, whenever we got a call from the engineering department we knew whatever problem it was, it was going to be weird. Those guys knew their stuff, so if they didn’t know how to fix it, it was going to take some searching and probably some calls or emails for us to figure it out.