When Thermal Cycling Exposed What Milestones Couldn't See
The E-GMP platform ICCU recalls affecting more than 200,000 vehicles should not be read as a semiconductor failure.
It is a structural failure.
Public records confirm two NHTSA recall waves. The official language is precise: "transient high voltage and thermal cycling" damaging MOSFETs inside the Integrated Charging Control Unit. NHTSA documentation describes the defect as cumulative damage from compound stresses operating simultaneously.
That phrase matters. This is not one failure mode. It is two stresses interacting, and the interaction is what kills the device.
The Technical Context
Wide-bandgap semiconductors resolved the electrical constraints that limited previous generations of power electronics. SiC MOSFETs operate at higher switching frequencies, handle greater voltage stress, and dissipate less energy than silicon devices.
The bottleneck shifted.
What was once an electrical problem became a thermal stability problem. Higher switching frequency reshapes heat generation. That reshapes cooling requirements. Cooling reshapes packaging. Packaging reshapes vibration behaviour. Vibration feeds back into solder fatigue and long-term reliability.
Control logic shapes transient thermal loading. EMI mitigation reshapes packaging and heat rejection again. The system became genuinely interdependent.
SiC devices have a Young's modulus roughly three times that of silicon. That creates thermal-mechanical stress during junction temperature swings. Research shows that under identical packaging and test conditions, SiC device reliability is three to four times lower than silicon-based devices. The power cycling life of SiC devices in conventional packaging designs is approximately one-third that of silicon equivalents.
The failure modes that matter now are cumulative. Thermal cycling. Local hotspots. Substrate delamination. Bond wire fatigue. Interface material degradation.
The question is no longer whether a component survives a test. It is how the whole system behaves under years of interacting stresses.
Anatomy of the ICCU Failure
The ICCU manages power flow between the high-voltage traction battery and the 12-volt auxiliary system. Community teardown evidence suggests a high-frequency SiC-based converter architecture operating at approximately 300 kHz in an active clamp forward topology.
The phrase "transient high voltage and thermal cycling" sitting together as a single causal mechanism tells you what happened.
The thermal model that qualified the MOSFET selection was almost certainly built against a steady-state or quasi-steady-state duty cycle assumption. The voltage transient profile that actually appears at the start and end of 12-volt charge cycles was specified by a different team against a different operating envelope.
Both assumptions were probably correct in isolation.
The coupling between them was not owned by anyone.
Transient voltage spikes arrive precisely when the junction is already thermally loaded from a prior charge event. That compound stress only shows up in long-duration field operation across varied charge patterns.
A voltage spike arriving when the junction is at 60 degrees has different consequences from the same spike arriving when the junction is at 130 degrees. The device's ability to absorb energy, the avalanche margin, the gate oxide stress response all degrade with temperature in non-linear ways.
SiC MOSFETs have known threshold voltage instability under repetitive switching stress. The failure data points toward a device being used in a topology where the actual stress profile drove gate oxide degradation faster than the qualification matrix predicted.
The dependency that was missed is not just between thermal and voltage teams. It is between the device qualification assumptions provided by the semiconductor supplier, the actual operating profile imposed by the converter topology, and the charge cycle calibration.
Three parties. Three documents. One device. No single owner of the interaction.
The Invisible Dependency Chain
The issue is not usually missing communication. It is communication flowing in the wrong shape for the decision it needs to support.
Between thermal and calibration teams:
What flows: a thermal envelope specification. Maximum junction temperature, maximum case temperature, allowable continuous power dissipation, and a duty cycle profile assumed for the analysis.
What does not flow: the assumed time-at-temperature distribution underneath the envelope. The thermal team's qualification almost certainly assumed a specific distribution of operating points across the device's life. The calibration team's actual control strategy produces a different distribution, because it is optimised for charge speed, NVH, EMI compliance, and user experience.
SiC MOSFET life is driven by cumulative thermal cycling. Cumulative damage is a function of how long you spend at each temperature band, not how high the peak is.
The calibration team had no reason to know this was the binding question. The thermal team had no visibility into how the calibration would actually be tuned by the time it shipped.
Between voltage transient and thermal teams:
What flows: a transient voltage specification at defined ports of the device. Peak amplitude, rise time, repetition rate. The thermal team receives this as input to their power dissipation calculation.
What does not flow: the temporal correlation between transient events and the thermal state of the junction at the moment the transient arrives.
The transient team specifies the spike. The thermal team specifies the steady state. Neither team specifies the joint distribution.
Between calibration and voltage transient teams:
What flows: the calibration team designs the charge initiation and termination sequences, which determine when and how transients are generated.
What does not flow: the recognition that the calibration team has the most leverage over the transient profile, but has no objective function that includes transient minimisation. Their optimisation targets are charge time, user-perceived smoothness, regulatory compliance, and battery health.
Transient amplitude on the DC-DC converter MOSFETs is not in their objective function because it is owned by another team.
The seam where all three teams meet is the joint distribution of junction temperature, transient amplitude, and time across the device's operating life under the actual shipped calibration and the actual customer usage pattern.
That joint distribution is the thing that predicts cumulative MOSFET stress.
It is not held by any single team. It is not produced by any single tool. On most automotive programmes it does not appear as a deliverable on any gateway document.
Why Milestone Reporting Failed
Automotive gateways are organised around deliverables that can be signed off by a single function. Thermal sign-off. Calibration sign-off. EMC sign-off. Reliability sign-off.
The joint distribution is not a sign-off. It is a continuous prediction that needs to be maintained and reviewed at every gate.
Programme management systems are not built to track artefacts of this shape. They are built to track binary completion of single-owner deliverables.
Each functional team is measured on whether they hit their own gates against their own constraints. None of them is measured on whether the joint behaviour of their three outputs produces a reliable device in the field.
The reliability team is measured on test completion, not on whether the test framework was adequate to find this class of failure. The chief engineer is measured on launch date and unit cost. The supplier is measured on PPM defect rate at end of line, not on field failure rate three years out.
There is no role on the programme whose performance review hinges on the joint distribution being right.
The work to produce it is, in a strict organisational sense, nobody's job.
The Coordination Gap
Six structural barriers prevent the joint distribution from existing.
Ownership. No single role typically owns cumulative cross-domain stress prediction. The natural owner would be a systems engineer with cross-functional authority, but on many programmes that capability is junior, understaffed, or politically subordinate to functional leads.
Toolchain fragmentation. Producing the joint distribution requires integrating thermal models, control-system models, power-electronics simulation, and customer usage distributions across incompatible environments and time assumptions. Most programmes can do this reactively after a failure. Few can do it continuously during development.
Gateway structure. The joint distribution is not a sign-off. It is a continuously evolving prediction that should persist from concept freeze through production validation. Most programme structures have no mechanism to keep such an artefact alive.
Incentives. Functional teams are measured against local constraints and local deliverables, not against cumulative interaction behaviour in the field three years later. Rational local optimisation can still produce globally unstable systems.
Supplier interface. Semiconductor qualification, Tier-1 topology decisions, OEM calibration strategy, and final thermal behaviour all lock at different points in time across different commercial boundaries. By the time the actual coupled behaviour becomes visible, upstream commitments are economically difficult to reopen.
Time pressure. The time available for thinking artefacts has compressed faster than the time available for deliverable artefacts because deliverables are visible to gateway reviews and thinking is not.
The Remediation Sequence
The first recall focused heavily on calibration changes and software limits. The second expanded into thermal management improvements, hardware replacement, diagnostics, and fusing changes.
The progression is consistent with an organisation discovering the coupling incrementally, one functional boundary at a time.
If the underlying issue was a cumulative interaction-state problem from the beginning, then calibration-only remediation would naturally underperform because it adjusts one variable whilst leaving the interaction structure intact.
The April 2026 litigation alleging replacement units may still carry the same defect is consistent with that reading.
The programme appears to have treated the failure through functional ownership boundaries whilst the failure mechanism itself existed between those boundaries.
Left-Shifting the Discovery
Left-shifting the ICCU failure would have meant five specific moments where someone with refusal authority could have halted the programme and forced production of the missing artefact.
MOSFET selection review, roughly 2019: A chief systems engineer asks what joint distribution of junction temperature and transient voltage amplitude this device will see across its operating life under the planned calibration strategy, and how that compares to the qualification matrix. The correct decision is to refuse part lock until the joint distribution can be produced against a candidate calibration profile.
Calibration freeze, roughly 2020: The systems role requires the calibration team to publish the resulting transient profile, and requires the power electronics team to re-evaluate the MOSFET stress against this actual profile rather than the assumed profile used at selection.
Validation framework design, roughly 2020 to 2021: The systems role requires the validation framework itself to be reviewed against the failure modes the cumulative interaction-state register predicts. If the register predicts that cumulative thermal-transient interaction is the dominant failure driver, the validation must include a test profile that specifically combines transient events with correlated thermal states.
Production validation review, roughly mid-2021: The systems role examines accelerated test data not for pass-fail outcome but for degradation trajectory. Early degradation signatures consistent with cumulative damage should trigger an extended test programme rather than launch approval.
Early warranty signal, late 2022: The systems role recognises that the damage pattern is the signature predicted by the cumulative interaction-state hypothesis if it had been formally tracked, and refuses to allow production continuation under the existing build configuration until the failure mechanism is confirmed or ruled out.
None of these five refusals required predicting the specific failure mode. Each required only the discipline of refusing to proceed without evidence that a known unknown had been closed.
The reason none of these refusals happened is not that the engineers involved lacked competence. It is that no role existed with both the cross-functional visibility to ask the question and the authority to stop the programme on the answer.
The Structural Pattern Across Industries
This structural shape appears across industries.
The Boeing 737 MAX MCAS failure was the joint distribution between control law authority, sensor architecture, pilot training assumptions, and certification documentation. No role owned the joint artefact.
The Takata airbag inflator failure was the joint distribution between propellant chemistry, regional climate exposure, vehicle age, and supplier qualification standards. The recall scope expanded repeatedly because each expansion was driven by new field data exposing a wider region of the joint distribution than the previous remediation had assumed.
The GM ignition switch failure was the joint distribution between mechanical tolerance, key weight, driving conditions, and the airbag deployment logic that depended on power state. The thirteen-year gap between first internal awareness and recall is explained by the joint artefact never existing at any gate.
The Samsung Note 7 battery failure was the joint distribution between cell chemistry margins, packaging-induced mechanical stress, charge profile aggressiveness, and supplier quality variation across two cell vendors. The remediation sequence was structurally identical to the ICCU pattern: a single-domain remedy that did not close the failure because the failure lived between domains.
Five elements appear in each case. A component-level failure mode that is technically describable in single-domain language. An actual mechanism that requires the joint distribution of variables owned by separate functions. A qualification framework that validated the component against generic stress profiles rather than programme-specific joint distributions. A first remediation that addresses one domain and does not close the failure. An expanded second remediation that broadens scope but still operates within the original categorical apparatus.
The cases differ in industry, technology, and timescale. The structural shape is consistent.
What This Means for Other Programmes
The first artefact I would demand on any similar programme is the cumulative interaction-state register.
For every component on the programme whose failure mode is cumulative rather than instantaneous: a named owner, a stated joint distribution of stress variables that drives its degradation, the source of each input to that distribution, the assumed customer usage profile underneath it, and the comparison against the qualification envelope of the candidate device.
One row per component. One owner per row. One joint distribution per row. One delta against qualification per row.
The register tests whether the programme has constructed the categorical apparatus required to see cumulative interaction-state failures. A thermal model can be excellent and still miss the joint distribution because the model lives inside thermal's frame. A DFMEA can be exhaustive and still miss it because DFMEAs are organised around component failure modes, not interaction failure modes.
The register also tests organisational structure. Producing it requires someone to sit above thermal, calibration, power electronics, reliability, and supplier quality, with the authority to ask each function to declare their assumptions about the others' outputs.
If no such role exists, the register cannot be produced.
The structural change that would prevent the next ICCU-class failure is the creation of a named role with cross-functional authority over interaction-state artefacts, reporting at a level senior enough that functional vice presidents cannot overrule its judgement on gating decisions.
Not a new process. Not a new committee. A role, with a name, a reporting line, budget authority, and explicit power to refuse gateway sign-off on the basis of an inadequately maintained interaction-state register.
Until that role exists, every other change is theatre, because the structural failure is not the absence of processes but the absence of authority to enforce them across functional boundaries.
The Path Forward
The E-GMP ICCU recall sequence should not be read as an isolated semiconductor or supplier failure.
It is a visible warning that tightly coupled EV architectures are beginning to exceed the governance structures used to develop them.
Public evidence suggests the failure mechanism emerged not from a single incorrect engineering decision, but from the absence of a role with cross-functional authority to continuously govern the cumulative interaction-state between thermal behaviour, voltage transient profile, calibration strategy, supplier qualification envelope, and real-world usage distribution across the device lifecycle.
Each function likely passed its own gate. The unresolved risk lived between gates, inside the joint distribution no single function owned.
The lesson for other OEMs is not simply to improve component validation. It is to recognise that many current EV programmes may still lack an institutional mechanism for continuously governing cumulative interaction-state behaviour across functional and supplier boundaries.
The cost of installing that capability is a senior cross-functional role with refusal authority at concept freeze, design freeze, and production validation.
The cost of waiting is discovering the structural gap through field exposure, regulatory scrutiny, and litigation rather than through governance.
The OEMs that install this capability before mandate will shape the standards that eventually govern the rest of the industry.
Everyone else will inherit them.
Comments ()