Power Delivery Affecting Performance At 7nm

BY BRIAN BAILEY, October 11, 2018 in Semiconductor Engineering

Complex interactions and dependencies at 7nm and beyond can create unexpected performance drops in chips that cannot always be caught by signoff tools.

This isn’t for lack of effort. The amount of time spent trying to determine if an advanced-node chip will work after it is fabricated has been rising steadily for several process nodes. Additional design rules handle everything from variation to power, and the rules deck has been getting thicker as each new process is released. Yet surprises still lurk when silicon comes back, even when every design rule has been met and the chip has passed every form of signoff.

One particularly troublesome area involves the power delivery network (PDN). To distill it to its simplest form, resistance is going up because of decreasing dimensions. That causes more IR drop, which in turn affects timing, sometimes in unexpected ways. Chips are coming back that are not able to run at intended clock speed.

Techniques used in the past to mitigate this type of problem, such as over-dimensioning or decoupling capacitors, no longer work or are becoming cost-prohibitive. And methodologies that in the past used static analysis techniques are being forced to consider dynamic analysis just to find some of the problem areas.

Resistance
“When you want that many functions on silicon you have to scale down the transistor sizes, and every time you go down in size the resistance is proportionally going up,” says Jerry Zhao, product management director in the Digital and Signoff Group at Cadence. “The size impact is that you have more voltage drop consumed in the grid. Do I deliver enough voltage to the transistors that they can be functional?”

This is becoming especially problematic at metal layers 0 and 1 at 7/5nm. “The lower levels of metal are so thin that they are very resistant,” says João Geada, chief technologist for ANSYS. “The upper layers have the same rules as before, but as it gets lower and lower, they have much more limited access to the rail supply. The local behavior starts to get a little unpredictable. With 7nm and below, traditional design teams that have been very good at producing working silicon are starting to have surprises because the delivery system is just not good enough for these nodes.”

That is not the only thing changing with the new technologies. “Getting it right is an order of magnitude more difficult than in previous technologies,” says Scott Johnson, principal technical product manager at ANSYS. “Not only do you have a very disjointed power delivery system below the thick metals, but you are really reducing the voltage levels dramatically.”

New effects are creeping in, as well. “While IR drop is fast becoming a dominant factor in determining the chip frequency, aggressive interconnect scaling has increased the average current density and the resistance per unit length of wires and on-chip inductance,” adds Magdy Abadir, vice president of marketing for Helic.

To compound the problem, solutions can create their own issues. “The high resistance of vias requires the use of additional vias, but is mitigated to some extent by the use of via pillars,” explains Prasad Subramaniam, vice president for AI Platform Infrastructure at eSilicon. “Increased cell density allows the use of larger logic blocks, which in turn generate large dynamic current variations. This requires the use of a denser power mesh for mitigation. As more metal resources get diverted for power delivery at the higher layers, there is a fine balance that needs to be made between power delivery and routing congestion/timing.”


Fig. 1: TSMC’s 7nm finFETs. Source: TSMC.

Additional troubles
Close proximity may not be anyone’s friend. “The definition of what is close together has become fuzzy because it is not just on the shared rail,” says Johnson. “The resistance of these grids is very high, so even though you are putting in lots of metal, far more metal than traditionally dedicated to the power grid relative to the routing and metal 1 and metal 0, your resistive influence is now much less predictable. You could be 4 rails away from metal 0 and still be extremely sensitive resistively to each other for simultaneous switching events.”

The notion of closeness also is becoming less predictable. “In high-performance SoCs, the average number of switching transistors in each clock cycle continues to increase and the corresponding current peaks continue to escalate,” explains Abadir. “Similarly, rise and fall times continue to get faster. This means that di/dt is rapidly increasing. The increase in both IR drop and L di/dt induces magnetic fields. These are transmitted by antennas naturally formed by the SoC layout structures, the bond interconnections and the package layers. This gives rise to the electromagnetic coupling. Ignoring those magnetic coupling effects can be catastrophic and, as evident from several recent experiences, can lead to costly silicon failures.”

The addition of analog may make things worse. “We often need a different voltage set for the I/O pads and bond rings compared to the internal circuits or possibly multiple sets of voltage domains across the internal circuits,” explains Fionn Sheerin, principal product marketing engineer with Microchip’s Analog Power and Interface Division. “That complicates the routing within the chip. It complicates the power requirements for the chip, which adds extra board level requirements. If we are doing voltage conversion within the device, then that is an extra power generation headache.”

That view is echoed around the industry. “More power domains are necessary for different analog components like radio interfaces, high speed SerDes or ADCs or DACs,” says Andy Heinig, group manager system integration at Fraunhofer EAS. “At this point it is very difficult to route all these power domains to a limited number of I/Os on the chip-package interface and at the same time to route this power on the limited number of layers in the advanced package variants. Typically used approaches like power planes are impossible for some of the domains because of the limited number of layers. Sometimes even the access to the bump is very difficult.”

At some point that power has to come from a source. “The chip guy is not solving all of the problems,” warns Zhao. “This is especially true for power. The metal layers, wires, start from the battery and go through the board, package, pad on the die, and then through a massive PDN. It is amazing how complex that delivery system is. You do not want to drop a lot of voltage. You do not want to consume a lot of power unnecessarily. You have to analyze it as a single unit.”

But it involves more than just power. “PCBs need to deliver power to high speed ICs that is solid up to the frequencies where the device’s internal capacitance can carry the load,” adds Todd Westerhoff, product marketing manager in the Board System Division of Mentor, a Siemens Business. “The high frequency currents required inside the IC can’t be delivered through the device’s package pins due to their mounting loop inductance, so decoupling on the package and the die must meet the demand for current above a certain frequency.”

The perfect storm
As process geometry has evolved, one thing that has stopped following Moore’s Law is threshold voltages.

“These have not really changed since 16nm,” says Geada. “But there is a continuous pressure to lower supply voltage because that is one of the easiest ways to lower the power footprint. So you have these competing pressures where threshold voltage hasn’t changed, supply voltage is dropping, you have less and less headroom on every cell, and you have this unpredictable behavior of the supply because of the metals and local resistance and simultaneous switching. All of this is making timing unpredictable unless you have the ability to have timing be aware of the voltage conditions. It is not only that you have to be nimble about how your power grid is behaving. You have to know the impact of the power grid on timing.”

With 7nm, many factors that used to be separate concerns have become interlinked, such as timing, power and heat. “In the past, thermal would be something that you would worry about for physical breakage and long-term effects,” says Johnson. “Usually, the entire die would be across the same thermal gradient, but that is no longer true. You are looking at thermal gradients that affect timing and paths that have never been considered before. You are looking at inductive effects, coupling effects from TSVs. What are you going to do about EMI?”

On top of that, there is far more complexity inside of these chips, which makes it more difficult to tackle problems in isolation.

“Our customer are trying to squeeze more functionality into a 7nm chip, and the chip is getting too big,” said Navraj Nandra, senior director of product marketing interface IP at Synopsys. “This is forcing people to consider chip-to-chip or chip-on-chip or die-to-die types of solutions. In addition, there is a push is to get more signals on the periphery and to reduce the number of power/ground. This is a debate between the power integrity engineer and the person responsible for that signoff and the architects of the chip, who want as much signal and functionality onto and off of the chip as possible.”

Analysis
Analysis starts with worst case. “There are applications where we can take some educated guesses as to typical or common use cases and boundary use cases,” explains Microchip’s Sheerin. “We can run analysis specific to those use cases. In devices that involve software, that gets notably more complicated because we have to take a guess at what the software will be doing and what computations are going to be more common or less common.”

Avoiding problems requires careful analysis. “For power analysis, you need activity,” says Zhao. “Activity comes from two methodologies, vectorless or vector-based. You can do a system boot or play a video game. That is where real activity can be seen. Which window of that activity will provide the maximum power will typically dictate the maximum power on the die, the highest temperature on the die and the worst timing numbers on the die.”

But that is no longer enough. “The coupling of voltage drop and static timing becomes a cornerstone of electrical signoff,” Zhao explains. “You cannot do signoff of them separately. Even if you back-annotate each instance with effective voltage supply along the critical path and do signoff, when the silicon comes back it still fails. The coupling is beyond just a simple annotation of voltage drop—it is more like the sensitivity of the critical path to the variation of voltage changes.”

Perhaps even more disconcerting is the fact that worst case may not be where you think it is.

“The tradition flow up to about 16nm was that you did you power grid analysis at the high-power corner,” warns Geada. “At low voltage, the system consumes less power, the voltage drops lower, and the system becomes very sensitive to any small amount of voltage drop. You are operating most of the cells at threshold or near threshold. So very small changes in drops can cause exponential changes in delays which can cause timing surprises on those corners. At high-voltage corners, you have the clash between voltage timing and temperature. So, you cannot just look at one timing corner and apply it to a different power corner or the other way around.”

Paths can remain hidden. “For some paths, traditional methods will make you believe that it is not a critical path,” confirms Zhao. “It has good timing numbers. In reality, that path may be very sensitive to voltage variation. Thus, a pattern will cause that sensitivity to show up and that pattern will generate timing, which violates signoff.”

Avoidance
What can design teams do? “You cannot design the power grid independent of the rest of the design and in particular, the power grid and timing are certainly no longer decoupled,” says Geada. “This is not something that can safely be engineered by having timing margins. You have to analyze the behavior of the power grid at the critical timing corners rather than analyzing the power grid and power just from the perspective of what is my peak power for this design and what is my large di/dt events. You also have to analyze the behavior of the power grid with respect to timing, and timing with respect to the power grid.”

That also requires an understanding of tradeoffs between short-term and long-term impacts of such factors as heat. “In many cases, the answer is to build in thermal protections by design, and then you can push closer to your boundary conditions without being concerned about reliability or quality issues,” says Sheerin. “So if you build a portion of your circuit to monitor the temperature of the die and allow it to shut down gracefully, then you have covered a lot of worst-case scenarios and you can start being more aggressive with the rest of the thermal design.”

Chip, package and board are all part of a coupled system. “It is very important to start the power deliver strategy very early in the project and also to include the package into the study,” advises Heinig. “Often the power strategy is developed without consideration of the package. An early chip package co-optimization enabled by an assembly design kit is very helpful to avoid later trouble.”

Automation
EDA can completely solve at least some of this problem, but there will still be a need for people in the loop.

“When you do an implementation, you must consider the potential hazards because of power drop,” says Zhao. “IR drop or hotspots can be reduced or removed by doing local optimization of your placement. The same for timing. So, if there is a timing problem, the tool can do a local ECO to fix that problem. What we need to do as a tool company is to further improve the technology to give more prediction combined with machine learning technologies to guide the user through it. In the future, automation will be able to take care of most of the problem—but talented engineers will still be required.”

Johnson agrees. “We thought the initial coupling of functionality and timing was an intractable problem in 2005, but solutions were built into the P&R algorithms by 2007. It may not be that quick of a transition this time, but I would expect that the system would evolve. However, by the time we are there, say hello to 5nm and 3nm. This wavefront of problems will be with us for quite some time. 7nm may catch up and automate some of this, but 5nm is presenting a new series of unexpected issues.”

And this is just one reason why scaling is becoming more difficult. “We have never run into a problem that is this coupled and complex,” says Johnson. “The industry will respond, but the complexity is orders of magnitude tougher than for previous generations. This is the first time that I have seen something that is so cross-domain coupled.”