Aging In Advanced Nodes
October 11, 2018
L-R: João Geada, Hany Elhak, Christoph Sohrmann, Magdy Abadir; Mick Tegethoff; Naseer Khan.
SE: What do design teams need to know about aging and reliability to make sure that those designs will do what they need to do?
Geada: There are fundamentally two things that need to be dealt with, and one of them is really tricky. On one side, you have to model what’s going to happen on silicon that we haven’t aged. FinFETs are, in practice, about five years old, and we need to predict what’s going to happen to them 15 years down the road. We don’t know. We have the physical models, we have the theory, we can run pure physical simulations, but we don’t really know how these designs are going to last 15 years. We’re building models and conjectures from forced aging to try to have some knowledge of what’s going to happen to finFETs. This is a really tricky problem the foundries have to address. It’s much easier for planar. We have a lot more history and a lot more knowledge. That’s the one side. On the other side, the design problem is really tricky because aging is extremely context-sensitive. It depends on exactly what the design is doing, for how long, and there aren’t that many tools right now that can handle that complexity. So that’s one of the things that is a challenge. People have been trying to deal with it by margining, but from our perspective, this is not a margin-able problem.
Elhak: We are at a turning point with aging now. Aging, as an analysis, has been around for more than 20 years. I’ve been involved in this with a small group of other people in the Silicon Integration Initiative Compact Model Coalition (CMC) every few months. It’s basically the whole universe of people interested in aging. Today, aging is becoming a mainstream analysis. There are two factors that are driving this. The first is advanced nodes. Aging is more prominent with finFET and advanced nodes than on older nodes with planar transistors, which makes it more of a mainstream analysis even for consumer applications. The other factor is that there is a turning point now in the semiconductor industry, where automotive is driving the growth. It’s not consumer and telecommunications as it used to be, and for automotive the requirements for lifetime are much higher than consumer. In consumer, we’re talking about 3 years or 5 years. In industrial, it can be 5 to 10 years. In automotive, the lifetime requirement is more than 15 years. Suddenly, everybody now needs to analyze aging, so it’s becoming mainstream. From one direction it’s finFETs and advanced nodes driving this analysis to the mainstream. In another dimension, it’s the automotive industry and the long lifetime driving this. There are more people interested in aging than 3 years ago or 5 years or 20 years ago, when aging was first becoming an analysis. It’s no longer the reliability team that works in isolation. Now, aging is becoming an analysis that every designer needs to do.
Sohrmann: We’ve been working on aging for a very long time and we were trying to push it toward the partners and design teams we work with. There was always a certain interest in this topic but not the need to do it. This is changing at the moment. We see users now coming to us and saying there are solutions—all the vendors have some type of interface where aging models can be added and things like that. But they want to know how to characterize those models. If you’re working with the CMC, there is discussion about whether there should be one model which fits all or is it individual user defined models — there is still quite a lot of uncertainty about how to do that. Is there a BSIM model for aging, where we just have to fit the parameters and then we are all set? I doubt it. To be honest, there are still so many physical effects like recovery effects, and so much physics going on, that we probably won’t get away with a single model. So it is very individual. And particularly for the automotive industry regarding ISO 26262, there are some requirements coming that they have to do this. Design teams somehow have to verify the quality after 10 or 15 years of operation. We also feel this is changing at the moment.
Khan: From our point of view, whenever we talk to different customers, it depends on the application they are looking into. If they’re an ASIC supplier designing for a particular customer, they go talk to this customer and ask what the expected lifetime is for a particular application. Depending on whether it’s 10 years or 15 years, they come back with how to tackle such a problem. One is to take measures inside the design. One of the problems they come back to Moortec and ask about is if we can monitor a different location in the design. What is actually happening after so many hours of running? So then we’re trying to actually solve the problem. We cannot really take the same circuitry and run it over and over again and then see how it will react to aging effects. But we can actually try to monitor, as close as possible, to see if there’s any correlation between what we measure or monitor versus what’s happening inside the design. There are various ways that we’re trying to tackle this. One is the kind of loading a device would have, but then there are device effects like mismatching. There are also the metals and interconnects. There is electromigration that’s causing a device to fail at a particular time. We’re looking at things that we know we have to live with it and looking at whether we can monitor those, making sure we are aware of the scenario so we can deal with it in the right way.
Abadir: There is a significant aging impact on the design. Designs do age. Devices and wires change over time, so there is a very good chance they will fail as a result of this after some time. Now the question of reliability means how long I can expect my design to be functional over time. Here we need to take into consideration these aging effects and how they impact the reliability of the design. People in the automotive, avionics, and certain kinds of markets in which these things are required by the customers especially need to take these impacts into consideration.
Tegethoff: As geometries shrink, fewer foundries are available to choose from, and design teams from different companies will find themselves designing on the same technology. The winning teams or IP companies will be those that extract the most from the technology, which inevitably leads to transistors being used in higher-stressed conditions. Whereas PVT constraints were sufficient in the past, in order to maximize design performance for the life of a design, we now see the simulation validation is becoming more and more circuit-specific. Statistical analysis is one area of growth. There also is a need for reliability analysis. Design teams already are aware that aging happens, but the effects have been considered through the addition of sufficient margin during design. Now they need to know more about how the transistors’ behavior will change under different stress conditions so they can assess how important that change is in the design. This requires a well-characterized technology for reliability, which is not always available from the foundry. Many larger companies develop their own knowledge of reliability to help their design teams.
SE: What tools are available to do this today? Are customers aware of them? What is the traction for tool usage here?
Geada: There’s been a major inflection. Two years ago, aging was a very small piece of the market. Only very specific sub-teams—and they were generally really tiny teams—cared. Everybody and their dog cares about aging now because of the electrification of everything. ADAS is clearly the really big target everybody knows about, but there are a whole bunch of devices that we’re now thinking of putting into people. The concerns are the same because it’s life-critical. Then there’s some of these industrial applications. If you put some electronics in a windmill somewhere, you’re not going to go out there and maintain it. It’s just punitively expensive to try to replace a $10 part that you need a helicopter and a couple of hours to go find. And so everybody now suddenly cares about aging. We’re talking to a much broader scale of teams with a very different skill set and attitude than before. The aging and reliability people used to be a very expert community now you’re having to talk about just the general market design, and they’re not talking about just this tool. They want to know how to prove their design works and how to prove their design has a reasonable yield. And we haven’t mentioned there’s a bathtub curve. There are two sides of aging. There is early aging, early failures, and there’s the tail end of aging. There is electrical overstress and then you subdivide it into EM, ESD, conventional aging which is just a degradation on performance. And then there’s time-dependent breakdown, which is if you age for long enough, it just dies an old-age death. How do you approach all of these physical effects? How do you map them into a design? And when somebody is asking, ‘Is your chip going to last 5, 10, 15, 20 years?’, you need an answer that is more than just hand-waving.
Tegethoff: There are a number of proprietary tools available on the market from all the leading EDA vendors. TSMC offers its own EDA-vendor-agnostic solution, and is a strategy we believe will be adopted by more foundries through the recent introduction of the Compact Modeling Council’s release of the Open Modeling Interface (OMI) solution.
Abadir: The tools are not well integrated into the design flow. In the past, people had reliability calculators that looked at a design, how many transistors there are, the type of wires, the type of technology, and they crunched in the probability of failure associated with every one of these components, and it computes when the design would die. For example, I have so many transistors, these type of transistors have a reliability model meaning the time, on average, it would fail in three years and five years. Some of these models take into consideration how much activity they see—an average amount of activity. And that’s where the calculations sometimes go wrong. It turned out that these types of calculations would give you some bounds for three years or four, five years, whatever. However, when you study what’s really happening with aging, there are a lot of phenomena that get impacted. For example, metals may start getting thinner, transistors will get slower, all gates will get gate delays. Wire delays will get longer. So there’s a good expectation that at some point in time, if the clock keeps coming in at the same time, the delay of the path will exceed the clock period and I will start getting the wrong results. That’s the most common kind of failure mechanism. In the past, It used to be one of the metals would get thinner until eventually there is a break and the circuit would die. Or you have memory bits that start failing. You may have some repair, some structure, some tolerance, some this and that, and eventually you exceed your ability to repair, to tolerate and you die. It’s all of these. From my EM (electromagnetic) hat, inductance gets bigger as the metal shrinks. And with inductance getting higher, EM coupling gets stronger.
SE: Another aspect of this is how to analyze transistors that have not aged, right?
Tegethoff: The alternative to not aging is to monitor the bias conditions of individual transistors to ensure they are within predefined limits. This is typically done by using assert checks on the blocks as they are designed. Of course, this alternative will possibly lead to over-design.
Abadir: These are all based on physical models and tools that will just use old models to try to estimate what will happen to these new devices. There are lots of reasons why these models—even for devices that we’ve seen before, that we’ve designed before—may not hold up in today’s type of usage. For example, we have never subjected our designs to the level of activity that we’re seeing today. It used to be where the electronics sit and do their work 10% of the time and the rest of the time it would be idle. Now, it’s always active. My phone is always doing something. The servers are working overtime, and aging has a direct correlation to activity. It is not like this is good fitness for the device. The more it works, the more it will age. Heat is also an important parameter in aging, since temperature will cause a device to age faster.
Elhak: The way we, as the EDA industry, deal with aging is that, first of all, there are the aging models. The aging model is extracted from the actual silicon from the foundry, so basically you have test chips that you measure at different points of time. Aging is not just a question of time. It’s time and stress. The aging phenomenon depends on the voltages and currents flowing through these transistors, so you need to have a good understanding of how these transistors will age with time and stress. This is measured in the lab, and then we end up extracting aging models. These models are basically equations that say, ‘For this amount of stress on this transistor, this is how the parameters of its transistor model will change. For example, this is how V threshold will change. This is how the mobility will change according to certain voltages and currents flowing through the transistor.’
SE: Are these models specific to each foundry process?
Elhak: It is very specific to the process. It is specific to each foundry, to every process in each foundry. And all transistors don’t age the same. Just like humans, if you have more stress you will age faster than people with less stress.
Geada: Another thing that matters is the statistical effect. Two identical transistors with exactly the same stress may age differently, depending on exactly how the atoms lined up on that transistor.
Sohrmann: This is a research topic. But there are statistics on combined aging effects that makes this even more complex.
Elhak: This is a very interesting point because until now, everybody in the design community was either analyzing aging or analyzing process variation. So I would run aging analysis or Monte Carlo analysis. Nobody actually take it takes into account the correlation between aging and Monte Carlo.
Sohrmann: You would need to measure it, and that’s even harder. You have aging, variability, recovery, correlation effects and local variation. So where to start? If you have this list of effects, what’s the most important?
Geada: This is one of the trickier ones. Right now, I’m actually not aware of a fab that is actually characterizing the interrelationship between those two. The aging models that are currently available do not play well with variability and vice versa. It’s a big data problem of enormous dimensions.
Sohrmann: We did this in terms of a research project, but no foundry is really willing because customers are not asking for this particular thing.
Geada: One of the things I keep talking about, which is a different path on the variability side, is that local process variation that augments threshold voltage and that augments resistance also enhances aging for the exact same reasons. If the transistor doesn’t switch, it’s maintaining the electric field across it for longer, and that exacerbates aging. So the two effects are surprisingly connected.
Khan: Is it the same for the same process? Wouldn’t it be a different model for a different process, a different process flavor, and a different device in that process?
SE: Does the foundry provide new models for every new process?
Elhak: This goes back to the ‘how.’ We started with building and characterizing that aging model, which is related to a process. It’s a process-specific process node, and it’s basically saying for specific values of stress, this is how the parameters of the transistor will change. That’s at the process level, and that’s provided by the foundry. Now as a circuit designer what you need to know is how much stress is applied to your transistors, and that’s what aging stimulation does. Aging simulation is a two-step process. The first step is you run a fresh simulation before aging, and you apply the stress. You apply the stimulus, and you see how every transistor will respond to what will be the current flowing through that transistor, what will be the voltage applied to the gates of that transistor depending on that specific functionality in the chip. So every transistor on the chip will have different levels of stress.
Tegethoff: It is possible that the foundry will tweak or change, not only the aging model parameter library, but also the aging model equations from one process to the other to accurately model the aging effect for that process. However, the foundry won’t be changing the model for every test chip that comes through. That is why some design houses will rely on their own CAD teams to modify or completely change the foundry aging models to best fit the design house’s application specific requirements.
SE: Where does the stress stimulus come from?
Elhak: From the stimulus and the operation of the chip.
Geada: There was actually a paper published at DAC by HiSilicon about exactly how to model the stress. There are two conventional approaches. The dominant one up to now has been you run a whole bunch of vectors that you think model worst case. The problem is modern designs are actually extremely power-conscious, power-aware and power-limited, so everybody applies techniques to reduce power. The most common technique is clock gating. Clock gating, unfortunately, exacerbates stress because all of a sudden you are leaving a part of your design static, and that’s actually the worst case for aging because now you’re applying stress for a long period of time without any recovery effect. All of a sudden, some reset signal or somebody is crossing in front of my car or a signal doesn’t wake up for a couple of hours and said there’s something there, and now the chip has to wake up that circuitry and respond. At that precise moment when it’s critical, the chip is actually responding at its slowest because it’s just now recovering from a major stress condition. However, there is a way now to do this vector-less by automatically determining for every individual condition what’s the worst-case gating condition that will have exacerbated this stress pattern. This was a paper published at DAC.
Khan: But if we do not turn it down for a longer period of time, then it’s booting up at that stage. And then you say we have left them in stress for a longer period of time. But if we allow them a sufficient time to be up and running in the right state, then we save the power.
Geada: Regardless of how much we like science, this is engineering. We don’t need to necessarily understand it as well as we can measure it. We can work around it. So if we know the worst-case behaviors, we can add circuitry or change the design so that these conditions do not hang around long enough to cause a problem. From the old reliability know-how, there are techniques of having clock gating that doesn’t stay gated one way or the other for forever. Every once in a while it puts in a recovery blip so that all the signals toggle, and the circuit never goes to its worst case. But they cost something, so you want to apply those techniques to relieve problems rather than blanket them everywhere.