When the gremlin won't come back: Chasing an intermittent EPROM fault across five programmers

Drawer overview of the cross-validation rig: GQ-4x4, Chicagoland Retro Tech Chip Tester Pro V2, BackBit chip tester, and Atari cartridge interface adapters

Part 2 of 2. Part 1 walked through the cross-validation workflow that established the chips were bit-perfect and my GQ-4x4's read path was the source of the bad verifies. This is the followup—what the bench experiments actually showed, why the original hypothesis didn't survive contact with the rematch, and what the null result taught me about working with intermittent EPROM read faults.

Part 1 ended with a diagnosis still pending. Four chips that the EMP-20 said were perfect had come back from the GQ-4x4 with verify errors—same offset, same wrong byte, same intermittent fingerprint. The bit-flip math pointed at the GQ-4x4's read path, not the chip. I closed the post promising experiments: three sequential reads of the same chip, USB power-delivery variations, a firmware revision check. If the read path was analog-marginal, the experiments should make it visible.

I came back to the bench expecting to see the gremlin. Instead I added two more programmers to the rig, ran sixty-seven fc /b comparisons across forty-five reads, and watched everything pass.

That isn't the conclusion I planned to write. But it's a more interesting one—and the lesson at the end is sharper than the one I would have landed with.

Where Part 1 left off

The short version: I'd burned twelve M27C512 EPROMs split across two programmers—six on Needham's EMP-20 (the parallel-port veteran from the late 1990s), six on my GQ-4x4 (the modern USB unit from MCUmall). The plan was a cross-validation pass: each chip read back on the other programmer. If both programmers were healthy, all twelve chips should pass both ways.

Eight passed clean. Four came back from the GQ-4x4 with verify errors at offset 0x000002: Device=0xA2, Buffer=0xA9. The EMP-20, hours earlier, had read those same four chips and said they were bit-perfect. One of the programmers was lying. Both were confident.

Part 1 walked through the diagnosis: bit-flip direction analysis on the disagreeing bytes showed mixed 1→0 and 0→1 errors in the same byte, which doesn't fit any single chip-level failure mode. An audit-trail extension using DOS fc /bagainst the original source file confirmed the chips were bit-perfect on a third, independent comparison path. The chips were fine. The GQ-4x4's read path—not its programming path—was the source of the bad reads. The 4-of-12 intermittent failure rate fit the fingerprint of an analog problem: marginal Vcc, marginal read timing, or an aging buffer IC somewhere on the GQ-4x4's PCB.

Part 2 was supposed to identify which one.

What changed before Part 2—two new programmers on the bench

Between Part 1 and Part 2, two pieces of new gear walked into the workshop. Neither was bought for the followup, but both ended up reshaping it.

The first was a second Needham EMP-20. I'd been operating on a single point of failure for the older EPROM family—one parallel-port programmer in the workshop, no backup if it died mid-build. The new unit is older hardware than my original (manufactured December 1993 versus the original's August 1999), shipped originally as a Rev. A, and was field-upgraded at some point in its life to Rev. H. So the bench now held two Needham EMP-20s of distinctly different vintages—a 1993 Rev. A→H upgrade and a 1999 Rev. K—both reading the same M27C512 family through the same DOS software on the same Toshiba.

Needham EMP-20 Rev. K serial 17845 manufactured 1999 — the original bench programmer
EMP-20 #1, the trusted veteran. Rev. K, serial 17845, manufactured August 1999.
Needham EMP-20 Rev. A upgraded to Rev. H serial 1213 manufactured 1993 — the second bench programmer
EMP-20 #2, originally shipped as Rev. A, manufactured December 1993, field-upgraded to Rev. H somewhere along the way.

The second was an XGECU T48. The T48 is the current-generation USB programmer from XGECU, the successor to their TL866 line, supporting tens of thousands of devices over a USB 2.0 high-speed link. It does the same job as the GQ-4x4 in a different package, with its own software (XGecu Pro) and its own read electronics. For Part 2 it became a third independent voice in the cross-validation rig.

And a realization about an existing piece of gear: the BackBit chip tester, designed by Evie Salomon and originally bought for system-level Atari 2600 cart testing, has a drop-in ZIF socket that will read a bare EPROM directly and return its checksum. No cart PCB required. That meant I could add a fifth independent reading source to the rig without building a cartridge.

So Part 2's plan stopped being three experiments on one programmer. It became a cross-validation matrix across five programmers—EMP-20 #1 (Rev. K), EMP-20 #2 (Rev. A→H), GQ-4x4, XGECU T48, and BackBit—with the GQ-4x4 still under the microscope as the Part 1 suspect.

A fresh test batch—burning M27C512 chips on two EMP-20s

The original twelve M27C512s from Part 1 are in finished diagnostic cartridges, on shelves, doing their job. Part 2 needed a fresh test batch—chips designed as experimental specimens, not production parts that happened to surface a bug.

I pulled five blanks from the same M27C512 sub-lot Part 1 had drawn from—STMicroelectronics ceramic DIPs from workshop stock, ~2009 vintage NOS factory-sealed Malaysia origin. Marked them with blue dot stickers numbered 81 through 85 so I could track individual chips through every experiment. Four would become test subjects; one stayed in reserve for the inevitable bench gremlin (bent pin, socket contamination, accidentally-erased-while-staring-at-the-UV-lamp).

The burn workflow on EMP-20 #1 (the Rev. K, the trusted veteran) was the same one Part 1 walked through, four keystrokes per chip: 2 → 1 → 3 → N. Blank check, program with Quick Pulse at Vcc=6.25V with inline verify, verify-after-burn at Vcc=5.00V (the operating voltage—the one that actually matters for cells that programmed weakly), then read the device checksum. Source file checksum: 008C2B66. Every chip's device checksum after burn: 008C2B66. Eject, reseat, run 3 → N again as the standard socket-gremlin check. All four chips through, all clean.

Then the audit trail extension I'd documented at the end of Part 1: read each chip back into the EMP-20's buffer (Option 4), save the buffer to disk as EMP1_B81.BIN through EMP1_B84.BIN (Option 9), and binary-compare each readback against the original source file at a DOS prompt:

fc /b EMP1_B81.BIN CRTDIAG.BIN → No differences
fc /b EMP1_B82.BIN CRTDIAG.BIN → No differences
fc /b EMP1_B83.BIN CRTDIAG.BIN → No differences
fc /b EMP1_B84.BIN CRTDIAG.BIN → No differences

Four chips, four matches against the source file. Same audit trail Part 1 established. So far, this is exactly what should happen—and exactly what Part 1's batch did until the GQ-4x4 walked in and disagreed.

Here's the new move. With two EMP-20s on the bench, the truth baseline that Part 1 assumed could now be demonstrated. Part 1's diagnostic argument rested on the EMP-20 being a trustworthy reference. That trust was inherited from decades of bench experience and the DOS-era engineering—fair, but not actually proven inside the post. With a second EMP-20 in front of me, it could be.

So I moved each chip to EMP-20 #2—the 1993 Rev. A→H upgrade, a completely different hardware revision from the original Rev. K, manufactured six years earlier, originally shipped as a Rev. A and field-upgraded somewhere along the way to Rev. H. Same DOS software, same M27C512 family module. Read each chip back, save as EMP2_B81.BIN through EMP2_B84.BIN. Then the cross-comparison matrix:

fc /b EMP2_B81.BIN CRTDIAG.BIN  → No differences
fc /b EMP2_B81.BIN EMP1_B81.BIN → No differences
fc /b EMP2_B82.BIN CRTDIAG.BIN  → No differences
fc /b EMP2_B82.BIN EMP1_B82.BIN → No differences
fc /b EMP2_B83.BIN CRTDIAG.BIN  → No differences
fc /b EMP2_B83.BIN EMP1_B83.BIN → No differences
fc /b EMP2_B84.BIN CRTDIAG.BIN  → No differences
fc /b EMP2_B84.BIN EMP1_B84.BIN → No differences

Eight fc /b comparisons. All clean. Two Needham EMP-20s of distinctly different vintages—a 1993 Rev. A→H upgrade and a 1999 Rev. K—agree bit-perfect on every chip in the test batch. The EMP-20 truth baseline isn't an assumption anymore. It's data.

The chips are bit-perfect against the source file across two independent programmers of different revisions. Time to give the GQ-4x4 a chance to misbehave.

The GQ-4x4 self-consistency test

The hypothesis I closed Part 1 with: the GQ-4x4's read path is analog-marginal. Marginal Vcc, marginal read timing, or an aging buffer IC—failure modes that produce intermittent, multi-bit, scattered errors rather than deterministic single-cell faults. If the hypothesis was right, the simplest possible test would be enough to demonstrate it: read the same chip three times in a row, save three separate files, and binary-compare them against each other.

If the GQ-4x4's read electronics are genuinely jittery, three sequential reads should produce three slightly different files—different errors at different offsets each pass. If they don't, the read path is consistent under current conditions and the analog-jitter story takes a hit.

This was the load-bearing test.

GQ-4x4 EPROM programmer staged on the workbench with an inline USB digital tester

The GQ-4x4 staged with the inline USB digital tester between the host and the programmer.

I dropped blue 81 into the GQ-4x4's ZIF socket. M27C512 device profile already loaded, CRTDIAG.BIN in the buffer, GQ-4x4 Software Re. 7.38 (the app self-reports as current—the firmware-revision branch of the original Part 2 plan was already closed before the bench session started). Read the chip into a fresh buffer, saved as GQ_B81_A.BIN. Didn't eject the chip. Read again, saved as GQ_B81_B.BIN. Read a third time, saved as GQ_B81_C.BIN. Same chip in the same socket, three reads back-to-back, three independent buffer snapshots.

Then blue 82 and blue 83, same drill.

Nine readback files. Time for the matrix.

fc /b GQ_B81_A.BIN GQ_B81_B.BIN → No differences
fc /b GQ_B81_A.BIN GQ_B81_C.BIN → No differences
fc /b GQ_B81_B.BIN GQ_B81_C.BIN → No differences
fc /b GQ_B81_A.BIN CRTDIAG.BIN  → No differences
fc /b GQ_B82_A.BIN GQ_B82_B.BIN → No differences
fc /b GQ_B82_A.BIN GQ_B82_C.BIN → No differences
fc /b GQ_B82_B.BIN GQ_B82_C.BIN → No differences
fc /b GQ_B82_A.BIN CRTDIAG.BIN  → No differences
fc /b GQ_B83_A.BIN GQ_B83_B.BIN → No differences
fc /b GQ_B83_A.BIN GQ_B83_C.BIN → No differences
fc /b GQ_B83_B.BIN GQ_B83_C.BIN → No differences
fc /b GQ_B83_A.BIN CRTDIAG.BIN  → No differences

Twelve comparisons. Zero disagreements. Nine reads of three chips on the suspect programmer, all bit-perfect against each other and against the source file.

That isn't what Part 1 predicted.

To be precise about what the data does and doesn't say: the result doesn't mean the GQ-4x4 didn't misread chips in Part 1. Those failures were real. Part 1's bit-flip math was correct; the chips were fine; the GQ's verify dialog was producing wrong bytes at specific offsets. That data is in the audit trail and isn't being walked back here.

What the result does mean is that whatever conditions produced the Part 1 misreads aren't present on the rig tonight. Something between Part 1 and Part 2 changed—equipment, environment, configuration, contact, or something else I haven't identified yet—and the change has moved the GQ-4x4 from "intermittently lying" to "consistently honest" without my changing anything on purpose.

The next move is obvious. If the gremlin is gone because something has changed, I should be able to figure out whatchanged by reverting variables one at a time and watching for the misreads to come back.

Two variables had visibly changed since Part 1's bench session: I was running the GQ-4x4 off a different USB port on the host machine, and I'd seated the chip in the ZIF socket with a noticeably tighter grip than Part 1's setup. Either could plausibly affect signal integrity at the chip/programmer interface. Time to test them.

Variable flip #1—testing the USB port

The right USB port on the host was the most plausible candidate. It's the one I'd been using during Part 1's session, the one that lined up with the failures. The clean reads tonight were happening on a different port—the one closer to the back of the chassis.

The experiment design: move the GQ-4x4's USB cable back to the suspect right port. Re-run the self-consistency test on blue 81 (the chip with the most accumulated data so far). Save the new reads under new filenames, not overwriting the clean-port baseline.

Plus a wrinkle worth running while the rig was already being reconfigured: the inline USB digital tester I'd been using to measure Vbus and current draw was inline between the host and the GQ-4x4—meaning Part 2's rig had extra USB connector hops, an extra pass-through PCB, and a measurable insertion impedance that Part 1's bare cable didn't have. If the tester was filtering or stabilizing the USB power in a way that happened to mask a marginal issue, removing it should expose the fault. So the test split into two phases:

Phase D/E/F—right USB port, no tester inline. Three sequential reads of blue 81 on a bare USB connection, exactly as Part 1's rig was set up. Saved as GQ_B81_D.BIN, GQ_B81_E.BIN, GQ_B81_F.BIN. (I skip "I" in naming conventions to avoid the digit-letter ambiguity in log scans—a small habit that saves a half-minute of squinting every time.)

Phase G/H/J/K—right USB port, tester inline. Four sequential reads with the USB digital tester back in the path. Saved as GQ_B81_G.BIN, GQ_B81_H.BIN, GQ_B81_J.BIN, GQ_B81_K.BIN.

The matrix of outcomes I was prepared to see:

D/E/F (raw right port) G/H/J/K (right + tester) Interpretation
Misreads Clean Right port is marginal; tester is masking the failure
Misreads Misreads Right port is the cause regardless of tester
Clean Misreads Tester itself is producing failures
Clean Clean Neither USB port nor tester is the cause—move on

Ten representative fc /b comparisons across the matrix, against the clean-port baseline and against the source file:

fc /b GQ_B81_D.BIN GQ_B81_A.BIN  → No differences  (raw right port vs clean baseline)
fc /b GQ_B81_D.BIN CRTDIAG.BIN   → No differences
fc /b GQ_B81_D.BIN GQ_B81_E.BIN  → No differences  (raw self-consistency)
fc /b GQ_B81_E.BIN GQ_B81_F.BIN  → No differences

fc /b GQ_B81_G.BIN GQ_B81_A.BIN  → No differences  (with-tester vs clean baseline)
fc /b GQ_B81_G.BIN CRTDIAG.BIN   → No differences
fc /b GQ_B81_G.BIN GQ_B81_H.BIN  → No differences  (with-tester self-consistency)
fc /b GQ_B81_H.BIN GQ_B81_J.BIN  → No differences
fc /b GQ_B81_J.BIN GQ_B81_K.BIN  → No differences
fc /b GQ_B81_G.BIN GQ_B81_K.BIN  → No differences

Plus a glance at the inline USB digital tester throughout the with-tester reads: 5.00V steady, current draw within a normal range during the read pulls, no sagging, no transients I could see on the meter.

USB digital tester reading 5.00V and 0.13A during an EPROM read pull on the GQ-4x4
The inline USB tester reading 5.00V steady at 0.13A during a GQ-4x4 read pull (idle was 0.038A). Clean power delivery throughout.

Bottom-right cell of the outcome matrix: clean across both conditions. The right USB port is not the variable. Adding the tester inline is not the variable. Both can be ruled out.

That doesn't mean Part 1's failures weren't real—but it does mean that whatever USB-side issue may have contributed back then isn't reproducible by switching back to the same port tonight. The "port" abstraction may be too coarse; the original fault may have lived somewhere downstream of the port itself (cable connector contact, GQ-4x4 internal trace, host USB hub state) and incidentally got cleared up when the rig was reconfigured between Part 1 and now.

One variable down. Time to check the chips themselves.

Variable flip #2—testing chip vintage

The blue 81–85 chips on the foam were fresh blanks I'd burned an hour earlier, with cell charge as new and crisp as it gets. If chip-state variance is part of the story—if older or differently-handled chips drive their data lines a little differently, and that variance interacts with a marginal GQ-4x4—fresh chips might just not surface the failure.

I had reference material for this exactly: a row of green-dotted chips on the chip rail off to the side of the foam.

A bench-organization aside, since the color coding shows up in photos: chips in active rotation on the bench get a small sticker with a dot color and a number. Blue dots are this batch—fresh M27C512 blanks from the workshop's unprogrammed stock, numbered 81 through 85. Green dots are the previous batch—M27C512s burned in earlier sessions, validated then, now sitting on a chip rail as known-good reference material. Greens 41 through 45. Two colors, sequential numbers, no ambiguity when chips are moving between programmers, rails, and foam through a long session.

So the second variable flip: read green 41 through green 44 on the GQ-4x4, three sequential reads each. Same socket, same workflow as the blue chips. Save each readback as GQ_G41_A.BIN, GQ_G41_B.BIN, GQ_G41_C.BIN, and so on through green 44.

Twelve readbacks. Sixteen fc /b comparisons across the matrix—each chip's three reads against each other, plus the first read of each chip against the source file:

fc /b GQ_G41_A.BIN CRTDIAG.BIN  → No differences
fc /b GQ_G41_A.BIN GQ_G41_B.BIN → No differences
fc /b GQ_G41_A.BIN GQ_G41_C.BIN → No differences
fc /b GQ_G41_B.BIN GQ_G41_C.BIN → No differences

fc /b GQ_G42_A.BIN CRTDIAG.BIN  → No differences
fc /b GQ_G42_A.BIN GQ_G42_B.BIN → No differences
(... continues through G43 and G44 ...)

fc /b GQ_G44_B.BIN GQ_G44_C.BIN → No differences

Sixteen comparisons, all clean. Already-burned chips of older vintage read just as cleanly on the GQ-4x4 as the fresh blanks did. Chip age, chip handling history, time-since-burn—none of those are the variable either.

At this point I'd ruled out the two variables I'd identified as visibly changed since Part 1, plus chip vintage as a confounder. The GQ-4x4 was reading every chip cleanly under every condition I could think to test it under tonight.

That left a question worth being honest about. If none of the variables I could think of was the cause, and the GQ-4x4 wasn't going to misbehave on the rematch, then the original analog-jitter hypothesis was either wrong or had moved past the point where I could provoke it tonight. Either way, the post couldn't be about isolating which specific analog issue was at fault—there wasn't anything left to isolate.

But the bench session wasn't done. The original Part 2 plan included a five-programmer cross-validation matrix, with the GQ-4x4 as just one of five voices. The other four had walked onto the bench specifically to make Part 1's chip-side argument harder to walk back. They still had work to do.

Five programmers, one truth—the cross-validation matrix

With the GQ-4x4 ruled in as honest (at least tonight) and chip vintage ruled out, the rest of the bench rig still had something to contribute: a multi-source agreement check on the chip contents themselves.

Part 1 had built its case on a single audit-trail comparison per chip—read the chip on a programmer, save the binary, fc /bagainst the source file using a tool that doesn't know what a programmer is. That's the load-bearing audit trail for any production EPROM workflow, and it works because the comparison hardware (fc /b) is fully independent of the programmer hardware. Part 2's expanded rig let me run that same logic across four programmers, and add a fifth chip-side reading source via the BackBit chip tester.

The four programmers each have completely independent read electronics:

  • EMP-20 #1—Rev. K, 1999. Parallel-port interface, DOS software, family-module board for the 27Cxxx series. The trusted veteran of the bench.
  • EMP-20 #2—Rev. A→H upgrade, 1993. Same software, same module, completely different vintage hardware. The new cross-check arrival.
  • GQ-4x4—USB programmer from MCUmall. Software Re. 7.38. The Part 1 suspect.
  • XGECU T48—the current-generation USB programmer from XGECU, the TL866 line's successor. Different software (XGecu Pro), different read path, different USB controller from the GQ-4x4.

Plus the fifth source: the BackBit chip tester, which reads a bare EPROM dropped into its ZIF socket and returns a 16-bit checksum on the LCD display. Different again—it's not a programmer at all, it's a chip-side validator originally designed to verify ROMs out of game cartridges, and its read electronics share nothing in particular with any of the four programmers.

BackBit chip tester handheld with drop-in ZIF socket for reading bare EPROMs
The BackBit chip tester, designed by Evie Salomon. Drop-in ZIF socket reads a bare EPROM and returns a 16-bit checksum on the LCD.

If the chips in the test batch were bit-perfect, every one of those five reading sources should say so independently.

For the T48 readback pass, I dropped each chip into the T48's ZIF socket, loaded the M27C512 device profile in XGecu Pro, read the chip, and saved the buffer to disk. Blue 81 → T48_B81.BIN, blue 82 → T48_B82.BIN, and so on through blue 84. Then green 41 through green 44—same drill, saved as T48_G41.BIN through T48_G44.BIN. Eight chips, eight readbacks, with the T48 software reporting a stable 4.99V over its USB 2.0 high-speed link the whole time.

The cross-comparison matrix for the T48 layer:

fc /b T48_B81.BIN CRTDIAG.BIN   → No differences
fc /b T48_B81.BIN EMP1_B81.BIN  → No differences
fc /b T48_B81.BIN GQ_B81_A.BIN  → No differences
fc /b T48_B82.BIN CRTDIAG.BIN   → No differences
fc /b T48_B82.BIN EMP1_B82.BIN  → No differences
(... through blue 84 ...)
fc /b T48_G41.BIN CRTDIAG.BIN   → No differences
fc /b T48_G42.BIN CRTDIAG.BIN   → No differences
fc /b T48_G43.BIN CRTDIAG.BIN   → No differences
fc /b T48_G44.BIN CRTDIAG.BIN   → No differences

Thirteen comparisons. All clean. The T48 read every chip identically to the source file, the EMP-20 #1 readback, and the GQ-4x4 readback. Whatever the GQ was up to on the bad night in Part 1, it isn't up to it now—and a completely independent USB programmer of a different lineage confirms what every other reading source has been saying.

Finally, the BackBit. Blue 84 dropped into the ZIF socket—the chip allocated specifically to this layer of the matrix. I pressed the read button. The LCD reported a 16-bit checksum on the chip's contents.

It matched 2B66. The lower 16 bits of the GQ-4x4's 008C2B66 reading from the start of the night. The same checksum the EMP-20 had reported after each burn. The same chip contents, read through a fifth independent hardware path and confirmed to a 16-bit precision the LCD could display in two seconds.

That's the matrix.

Reading source Chips read fc /b vs source / cross Differences
EMP-20 #1 (Rev. K, 1999) 4 chips, 4 reads 4 comparisons 0
EMP-20 #2 (Rev. A→H, 1993) 4 chips, 4 reads 8 (incl. vs EMP-20 #1) 0
GQ-4x4—all conditions 7 chips, 28 reads 41 comparisons 0
XGECU T48 8 chips, 8 reads 13 comparisons 0
BackBit 1 chip, 1 checksum 1 vs 2B66 0
Total ~45 reads ~67 comparisons 0

Five reading sources of fundamentally different hardware lineages—a 1999 parallel-port DOS programmer, a 1993 parallel-port DOS programmer (field-upgraded twice), two different generations of USB programmer, and a chip-side cart validator—all agreeing on every byte of every chip in the test batch.

The Part 1 verdict that the chips were bit-perfect is no longer resting on a single audit-trail comparison. It's resting on five.

What I ruled out

The variables I'd tested through the bench session and across the cross-validation matrix:

  • USB port (left vs right). Both ports on the host tested. No misreads on either.
  • Inline USB tester (impedance + connector hops). Tested with the tester in the path and removed from it. No misreads under either configuration.
  • Chip vintage / age-since-burn. Fresh blue blanks burned an hour earlier and already-burned green reference chips from an earlier batch read identically clean.
  • Chip-batch / sub-lot. Chips drawn from the workshop's M27C512 sub-lot read clean throughout.
  • GQ-4x4 software version. Software Re. 7.38, USB Driver Re. 3.0; the GQ-4x4 application reports as updated when checked against MCUmall's server.
  • EMP-20 truth baseline. Two units of distinctly different vintages—a 1999 Rev. K and a 1993 Rev. A→H upgrade—agree bit-perfect against the source file and against each other. The EMP-20 as a reference programmer isn't an assumption resting on past bench experience anymore. It's been empirically demonstrated against an independent unit.
  • Chip contents themselves. Five independent reading sources confirm every byte of every chip in the test batch matches the source file.

None of the variables I could think to flip produced a misread on the GQ-4x4. That's a real result, even if it's a negative one.

One point worth being precise about: a clean test session doesn't mean Part 1's failures weren't real. The four chips that misread on the GQ-4x4 in Part 1 misread for a reason. The bit-flip math was correct, the cross-validation against the EMP-20 was correct, the fc /b audit trail against the source file was correct. The failure mode just isn't reproducing tonight with the variables I can identify and control.

What I couldn't test

A few variables are still in play and deserve to be called out, because intellectual honesty matters in a null-result post.

  • The original Part 1 failing chips themselves. Those four chips are in finished diagnostic cartridges on shelves, doing their job. I can't pull them out for a re-read without cracking open carts I value. If the failure was something specific to those four physical chips—a marginal cell, a manufacturing micro-defect that pushes data-line drive strength right to the edge of the GQ-4x4's read threshold—I haven't tested for it.
  • ZIF grip variance. I considered a deliberate-misseating test—seat the chip with the ZIF lever barely closed, just making contact—to see if marginal socket contact would surface the failure. I decided against it for the post. "I broke it on purpose with bad socket contact" is a weaker story than the data I already have.
  • Deliberately marginal USB rigging. A powered hub with poor regulation, a long unshielded cable, a host machine with other USB devices pulling current—any of these could plausibly push USB power past the GQ-4x4's threshold. I didn't test them because the rig I actually have works.
  • Time-passage on the GQ-4x4's own electronics. It's possible the read circuitry has shifted since Part 1—a marginal trace settling differently after a power cycle, a thermal effect, an aging discrete component happening to land on the right side of a threshold. I can't isolate that without internal probe work that's out of scope for a Bench Notes post.

Each of these is a future Bench Notes seed if any of them ever turn into a reproduced failure. For now, the variables I could control are clean.

Why the audit trail still wins—single-programmer verify and intermittent EPROM faults

This is the part of the post the original hypothesis would have walked right past.

If the bench experiments had landed the diagnosis—if they'd identified marginal Vcc, or a specific firmware bug, or an aging buffer IC as the cause of the Part 1 misreads—the lesson would have been narrow. "Here's what was wrong with this specific programmer; fix it (or don't trust it) accordingly." Useful, but bounded.

The null result is broader, and the lesson it forces is sharper.

The GQ-4x4 misread chips for real in Part 1. The chips were bit-perfect. The four chips, the offsets, the bit-flip math, the audit-trail confirmation—all of that was correct data, captured on the night it happened. Tonight, the same programmer, with chips burned to the same source file, behaves perfectly across forty-five reads under every condition I could think to test. The fault hasn't been diagnosed. It hasn't been reproduced. It has, for all practical purposes, moved.

Which leads to the load-bearing observation: you cannot rely on being able to reproduce an intermittent fault later.

If a single programmer reports clean verify and clean checksum on every chip in a batch you're shipping—and you ship on that evidence alone—and one of those chips lands in someone else's hardware and fails—you cannot go back to your programmer the next morning and say "let me check whether it would have misread that chip if I'd done it again." The fault might not be there anymore. The conditions that produced the bad verify might have shifted while you slept. Maybe you swapped USB ports during cleanup. Maybe the host machine rebooted. Maybe the chip got jostled in transit. Maybe nothing identifiable—and the programmer is just behaving differently because that's what marginal hardware does. The night you should have checked has passed. You either have the evidence from that night, or you don't.

This is why the audit trail extension I described in Part 1 isn't paranoid documentation. It's the only evidence you can rely on at all.

Recap of the audit trail: for every chip in a production batch, after the standard 2 → 1 → 3 → N burn workflow, do three more steps:

  • Option 4—read the device back into the programmer's buffer.
  • Option 9—save the buffer to disk as chipNN.bin.
  • DOS promptfc /b chipNN.bin source.bin against the original file. Expected output: No differences.

Three independent confirmations per chip: internal verify (programmer compares chip to its own buffer), checksum match (16-bit sum agrees with the source), and now a filesystem-level byte-for-byte diff against the source file using a tool that has no idea what a programmer is. The third one is the load-bearing piece, because it breaks the chain-of-trust loop the programmer can't break on its own.

Two minutes per chip. Twenty-four extra minutes on a twelve-chip batch. The output is an evidence file you can show anyone—a customer, a community member, yourself a year from now—that proves what every chip in the batch contained at the moment it left your bench.

Part 1 framed this as belt-and-suspenders for bench EPROM work. Part 2 sharpens it: the audit trail isn't belt-and-suspenders. It's the belt. Single-programmer verify alone is the kind of belt you might find undone tomorrow and never know it.

Tonight's null result on the GQ-4x4 is, paradoxically, the strongest argument for the workflow I've made in any Bench Notes post yet. The programmer that lied on the night of Part 1's session is telling the truth tonight, with no diagnostic intervention from me. If Part 1's audit trail had only been a 2 → 1 → 3 → N burn and a single device checksum—if I'd shipped those four chips on that evidence—I'd have shipped four broken chips. The fc /b against source saved that batch. Nothing else could have.

That's the lesson. The audit trail wins because the gremlin moves.

Workflow update—the cross-validation methodology after Part 2

The audit trail itself doesn't change. The four-keystroke burn workflow (2 → 1 → 3 → N), the readback to a chipNN.bin file (Option 4 + Option 9), and the fc /b chipNN.bin source.bin filesystem comparison are the load-bearing per-chip discipline for any EPROM that has to be right when it leaves the bench. That's exactly what it was at the end of Part 1, and Part 2's null result reinforces it rather than revising it.

What does change is the rig around it.

The second EMP-20 retires the single-point-of-failure status the workshop had been operating under. With two Needham EMP-20s on the bench—different vintages, different revisions, same software, same family modules—a chip that needs cross-EMP-20 verification can get it without leaving the workshop. The Part 2 cross-verify pass (read on EMP-20 #1, re-read on EMP-20 #2, fc /b between the readbacks) is now a workflow option for high-stakes batches, not a one-time experiment.

The XGECU T48 stays on the bench as a third programmer-independent voice for cross-validation when a batch warrants it. The BackBit's drop-in ZIF socket is in the validation toolkit now, available as a chip-side checksum sanity check that costs about three seconds per chip.

And the framing of the workflow itself has matured. Part 1 made the case for the audit-trail extension as belt-and-suspenders for production work—better safe than sorry. Part 2 makes the harder case: it's not an optional belt. Intermittent faults move. The fc /b against source on the day of the burn is the only evidence that the chip was right when it left the bench. Without that file, you don't have an audit trail. You have a story.

What you can take away from Part 2

If you burn EPROMs that go into real hardware—replacement BIOSes, arcade ROM swaps, retro hardware repair, community PCB builds—these are the lessons Part 2 sharpens:

  1. Single-programmer verify isn't enough. Even a programmer with a perfect track record can be misreading on a given night without showing it in the verify dialog. The intermittent failure rate Part 1 captured (4 of 12) wasn't visible in any single chip's verify; it only surfaced through the cross-pass on a second programmer.
  2. The fc /b audit trail is the only evidence that survives the night. If a programmer misreads a chip and you ship the chip on that programmer's clean verify alone, you cannot reconstruct the fault later by re-running the verify. The fault may have moved on. The only evidence that holds up is the binary file you captured at the moment of the burn.
  3. Intermittent faults are real and they don't always reproduce on demand. The post you're reading is itself the proof of that. Part 1 documented real misreads. Part 2 ran nearly seventy fc /b comparisons across five programmers and couldn't get the failure to come back. Both are real findings; both are non-contradictory; both should change how you treat single-programmer verify.
  4. When you can afford a second reading source, use it. A second EMP-20, a second USB programmer, a chip-side cart tester, even a friend's bench programmer at a swap meet—any second independent source is the cheapest insurance against the kind of failure mode Part 1 surfaced. You don't need five programmers like the cross-validation matrix in this post. You need two that are independent of each other.

The total overhead is still about two minutes per chip beyond bare verify. That's the cost of knowing—really knowing, with binary evidence you can show anyone—what your chips contained when they left your bench.

For chips going into hardware you can't easily revisit, that's still an obvious yes.

Frequently asked — intermittent EPROM read faults and cross-validation

Why didn't the GQ-4x4 misread when you tested it again in Part 2? That's the central question of this post, and the honest answer is I don't know — and that's the point. The GQ-4x4 read every test chip bit-perfect across twenty-eight sequential reads, on two USB ports, with and without the inline tester, on fresh and aged chips. None of the variables I could control produced the Part 1 failure. Whatever set of conditions caused those misreads has shifted since, and I can't identify what specifically. Intermittent faults move; that's the nature of the failure mode.

Does an intermittent EPROM read fault mean my chip is bad? Not necessarily — and that's the whole reason to cross-validate. An EPROM verify error tells you the programmer's readback disagrees with its buffer. The disagreement can be a bad chip, an incomplete burn, or the programmer misreading a perfectly good chip. The way to find out which one is to compare the chip against the original source file using a tool independent of the programmer — DOS fc /b is the simplest one, and it's what the audit trail in this post relies on.

What's the minimum viable EPROM cross-validation workflow? Two reading sources, one binary compare. Burn the chip on one programmer; read the chip back on a different programmer (or the same programmer at a different time); binary-compare the readback against the original source file with fc /b. If both reads match the source, you have a real audit trail. You don't need five programmers like the matrix in this post — you need two that are independent of each other.

How can I tell if my EPROM programmer is misreading chips? Read the same chip three times in a row, save three separate files, and fc /b them against each other. If the three readbacks differ from each other at scattered offsets, the programmer's read electronics are intermittent. If they match each other but differ from the source file, the programmer is misreading deterministically. If everything matches, the programmer is reading the chip correctly under current conditions — which doesn't guarantee it will tomorrow.


That's the close of the Cross-Validation series. Part 1: When Two EPROM Programmers Disagree—A Cross-Validation Workflow. Companion: Reading an EPROM Verify Error—A Bit-Flip Primer. The printable workshop reference card that came out of these posts is available alongside this post—link in the footer once it's staged.

I'm Jeffrey Mays. Bench Notes is where I write up the actual workshop work—burns, builds, repairs, the occasional unreproducible gremlin. Subscribe to catch the next bench session.

0 comments

Leave a comment

Please note, comments need to be approved before they are published.