Another fun problem I dealt with was when I was moving my employer's codebase from Subversion to Mercurial version control ages ago. Everything looked good, except a directory named CVS (after the pharmacy, a customer) was missing. Was banging my head on the table before realizing that the default .hgignore file instructed Mercurial to ignore all contents of .*/CVS (another old version control system).
Forcing the loading of the floating point libraries fixed it, but it took months to track it down.
It turns out it was an optimizing compiler that HP didn't properly set the options on, in their make setup.
_____
[That's like] Me, with zero C/C++ experience, being asked to figure out why the newer version of the Linux kernel is randomly crash-panicking after getting cross-compiled for a custom hardware box.
("He's familiar with the the build-system scripts, so he can see what changed.")
-----
I spent weeks of testing slightly different code-versions, different compile settings, different kconfig options, knocking out particular drivers, waiting for recompiles and walking back and forth to reboot the machine, and generally puzzling over extremely obscure and shifting error traces... And guess what? The new kernel was fine.
What was not fine were some long-standing hexadecimal arguments to the hypervisor, which had been memory-corrupting a spot in all kernels we'd ever loaded. It just happened to be that the newer compiles shifted bytes around so that something very important was in the blast zone.
Anyway, that's how 3 weeks of frustrating work can turn into a 2-character change.
C/C++ certainly gives one enough rope to shoot your foot off in the most unexpected places. That one took a heck of a long time to solve.
Had to create a stress test to reproduce in minutes not days. Then trace code paths through timers and serial events to find problematical path. Turned out to have many - timer interrupt callback could cancel interrupt, reschedule timer, change interval, cancel then reschedule. All in the presence of other channel interrupts occurring and overlapping unpredictably. Timers rescheduled for intervals that had passed already once the callback completed. And on and on.
Took a weekend alone with the code and a set of machines, desk-time getting my head around it all, then coding bullet-proof paths for all calls and callbacks for every related system call.
Once it worked, it worked for days then months under test. Nothing is too hard to resist a methodical approach.
When the bugs near flaws... I'll burn it, you, and myself down
Very old examples that live in my head, rent-free:
Error - Error.
Something is wrong.
---
In an environment I worked there were multichannel audio recordings that are archived. The archival recordings all had a perfect 4kHz tone appearing, seemingly out of nowhere. This was happening on every channel, across every room, but only in one building. Nowhere else. Absolutely nothing of the sort showed up on live monitoring. The systems across all sites were the same and yet this behaviour was consistent across all systems only at one location.
The full system was reviewed: from processing, recording, signal distribution, audio capture, and in room. Maybe there was a test gen that had accidentally deployed? Nope. Some odd bug in an echo canceller? Also no. Something weird with interference from lighting or power? Slim chance, but also no. Complete mystery.
When looking for acoustic sources there was an odd little blip on the RTA at 20kHz. This was traced back to a test tone emitted from the fire safety system (ultrasonic signal for continuous monitoring). It's inaudible to most people and will be filtered before any voice-to-text processing so no reason for concern. Anyway 20kHz is nowhere near 4kHz though so the search continued.
The dissimilarly of 20kHz and 4kHz is true, until you consider what happens in a non-bandwidth limited signal. The initial capture was taking place at a 48kHz sampling rate. It turns out the archival was downsampling to 24kHz, without applying an anti-aliasing filter. Without filtering, any frequency content above the Nyquist 'folds' back over the reproducible range. So in this case a clean 24kHz bandwidth signal with a little bit of inaudible ultrasonic background noise was being folded at 12kHz to create a very audible 4kHz tone. It was essentially a capture the flag for signals nerds and a whole lot of fun to trace.
We got a bug report that an app would crash, and I couldn't reproduce it. So we asked the user, are you using the latest version of Wine from our website? "Yes I am". OK, that's odd, send us some logs then. The crash was some sort of memory corruption during startup of the app. Everything seemed to be running fine, the app was loading files and reading registry entries happily, and then suddenly it would segfault in a random place. No opportunity to debug directly, as everything was binary only and only crashing on this guy's machine.
I spent days working painstakingly through hundreds of millions of lines of API call traces, until eventually I found what seemed to be a difference between his logs and mine. In his logs, some registry reads were failing, and in mine they worked. But why?
It turned out that the guy had been lying to us. He hadn't actually installed the app using the Wine downloads from winehq.org, he'd installed it from the Debian repositories. The packages provided by Debian were badly broken: they had split various tools out into a separate -utils package which wasn't installed by default because that complied with Debian standards better. But that was an error because Windows doesn't care about Debian standards and those tools aren't optional there, so many programs assumed those tools were always available. One of them was regedit.exe, which this app's installer was running with some flags to add default registry entries. On Windows this would never fail, so the installer didn't check the error codes and the install failure was silent. And then the app didn't check the error codes when reading the entries either, because again, that would never fail on Windows. So the reads silently did nothing, the memory the app expected to be initialized wasn't, it tried to use it and corrupted its heap which then led to a random crash about a million API calls away. The original failure wasn't even in the logs I was looking at.
At the time we had an explicit policy of not supporting anyone who installed Wine from their distribution packages, exactly because of bugs like this. Instead the project provided its own apt repositories. The distro-centric model Linux used was just broken because it led to packagers who weren't a part of the upstream communities "fixing" software they didn't understand as they packaged it. The notorious SSH bug was another case of that but such stories are commonplace. Debian users in particular were hard to deal with because lots of community built packages was a part of the distro's appeal and moat, even though upstream developers often hated it (lots of obsolete bug reports or distro-created bugs). So they had become defensive, and some had taken to deceiving upstreams when filing bugs because they thought they knew better.
Needless to say, a multi-day memory corruption debugging session that ended with "there is no bug, follow the install instructions on our website and stop lying to us about it" was by far the most annoying bug I ever had to work on.