Home News What opportunities are there for improving the current processor architecture?

What opportunities are there for improving the current processor architecture?

2025-07-25

Share this article :

For years, processors have focused on performance, with performance being accountable to little else. Performance still matters, but now it must account for power consumption.

If small gains in performance result in a disproportionate increase in power consumption, designers may need to forgo those improvements in favor of more energy-efficient options. While current architectures continue to steadily improve in both performance and power consumption, further gains are becoming increasingly difficult.

"Everyone is redesigning their microarchitectures to see how they can improve them to control power consumption," said Prakash Madhvapathy, director of product marketing for audio/voice DSP at Cadence Tensilica.

Many processor features designed to increase computational throughput, such as out-of-order execution, add complex circuitry that increases power consumption and circuit area. Such improvements might not be accepted today because of the power cost. So what opportunities are there for our current processor architectures?

Efficient implementation is not good enough

Many efforts to improve efficiency involve better designs of existing architectures, and there is still some progress to be made in this regard. "There are a lot of power-saving techniques on the implementation side," said Marc Swinnen, director of product marketing at Ansys.

One very basic approach is to use process improvements to do more with less power. "Moore's Law is not dead," said Swinnen. "We are still getting smaller process technologies, and that is always the number one way to reduce power. It will soon run out, but it is not completely gone yet."

This can also drive process decisions. "When you choose a process node, you also need to consider energy efficiency," said Madhvapathy. "22nm is basically 28nm, but with much better energy characteristics." He noted that 12nm is another popular node for efficient designs. 

3D-ICs offer a new power point, between monolithic chips and PCB-level components. "3D-ICs will consume more power than monolithic chips, but 3D-ICs are lower power and higher speed, which is much better than multiple chips connected by traditional PCB traces," Swinnen pointed out.

Co-packaged optics (CPO) brings optics closer to the silicon, which can also reduce power consumption, but this has been a long time in the making. "CPO has been around for a long time, but it was difficult to justify the technical complexity economically, and the trade-offs in the end were not necessarily favorable," Swinnen explained. "That seems to be changing. Partly because the technology is getting better, and partly because the need for high-speed digital communications has become so strong that people are willing to pay more for it." 

Not all techniques are practical

Some implementation techniques sound interesting, but they come with their own challenges. Asynchronous design is one of them. "On the plus side, each register talks to the next register as fast as it can," Swinnen explains. "There's no central clock, so the whole clocking architecture goes away. You don't have slack time, where one data path waits for another. It's been around for decades, but it hasn't been a breakthrough except in specific cases because performance is unpredictable. It's a guess as to what the timing is going to be, and it can be slightly different on every chip because of process variations."

It's also unclear whether it actually saves power in the end. "Self-timed handshakes mean the flip-flops have to be much more complex," Swinnen says. "When you factor all that in, all the flip-flops consume more power. There's also the question of, ‘Does all this complexity and unpredictability really save you much power in the end?' All in all, it doesn't really work as a design approach."

Power can also be reduced by suppressing spurious or glitchy power through data and clock gating. "That increases area, but the impact on spurious power can be quite large," Madhvapathy says.

This requires analysis to determine the main contributors. "Not only does it measure the power consumption of a glitch, it can also identify what caused it," Swinnen noted.

Ultimately, the impact at the implementation level is limited. "There's a limit to how far you can go at the RTL level, which is ironic because most of the power savings opportunities are at the RTL level," Swinnen said. "The biggest benefit is actually at the architectural level." 

Expensive features

Artificial intelligence (AI) computing has pushed design teams to the memory wall, so given the industry's focus on AI training and inference, a lot of effort has been focused on how to put trillions of parameters where they need to be, when they need to be, without "burning the house down." But processors themselves consume energy, and other workloads will present a different balance between execution power and data movement power.

While clock frequencies continue to gradually climb, these changes are not driving as much performance gains as they once did. The real goal of improvement has been to try to keep processors as busy as possible. Three architectural features illustrate the complex changes made to achieve these gains - speculative execution (also known as branch prediction), out-of-order execution, and limited parallelism.

The purpose of speculative execution is to avoid the situation where a branch instruction must be entered before deciding which branch to follow after waiting for the result. Waiting until then to decide delays the result until the system fetches the instruction indicated by the branch result from DRAM. Instead, it speculatively follows a branch - hopefully the most likely one. Usually, the completion of the branch decision validates the decision, but sometimes it doesn't. At that point, the speculative calculation must be backtracked and another branch restarted (including fetching potential instructions from DRAM).

Branch prediction is often accompanied by out-of-order execution, a feature that allows certain instructions to be executed in an order different from the order in which they appear in the program. The idea is that while one instruction might be stalled waiting for data, another subsequent instruction is now ready. Note that the later instruction cannot depend on the previous one, but one of the main limitations of the serial programming paradigm is that instructions must be in order, even if there is no dependency between them. Out-of-order execution is therefore a complex system that can start multiple instructions ahead of time, ensuring that the original program semantics are adhered to.

Area vs. Performance

These systems are not simple, and their costs may be disproportionate to their benefits, depending on how they are built. "For example, a branch predictor keeps a list of taken branches," says Russ Klein, program director in Siemens EDA's Advanced Synthesis Division. "Like a cache, that list typically uses the lowest N bits of the branch target as a hash key pointing to a list of taken branches. So N could be 4 or 16 or more, and the number of entries in the list could be 1 or 2 or 32. You could store the full target branch address, or maybe just the lowest 12 or 16 bits. A larger, more detailed taken branch memory would give better performance, but obviously takes more space (and power)."

The resulting benefits would vary accordingly. "A small, simple branch predictor can make a processor 15 percent faster, while a large, complex predictor can give 30 percent better performance. But it can be 10 times (or more) larger than a small, simple predictor," Klein explained. "In terms of area, who cares, but in terms of power, it's a real problem."

Cadence improved performance by refactoring some codecs to produce code with fewer branches. "We saw about 5 to 15 percent better performance," Madhvapathy said. "There were less than 5 percent branches in the codecs, and almost none in the inner execution loops, where we used ZOLs (zero-overhead loops)."

More generally, the company saw more branches in typical programs. "About 20% of the instructions in real code are branches," Madhvapathy said, "and those represent opportunities for speculative execution. The performance gain can be 30% or more because the average number of instructions executed per cycle increases significantly - even if half of them are predicted successfully. The total overhead [of branch prediction and out-of-order execution] is probably between 20% and 30%."

Klein recalled Tilera founder Anant Agarwal discussing the "Kill Rule." "The Kill Rule states that if you're going to add a feature to your CPU and it adds area, if the area increase is greater than the performance gain you're going to get, then you don't add that feature," he said. 

Parallel computing is the "easy" answer

Parallelism is obviously another way to improve performance, but the parallelism available in current processors is limited. There are two ways that today's mainstream processors provide parallelism - by instantiating multiple cores, and by multiple functional units within a core.

A functional unit is what used to be a simple arithmetic logic unit (ALU). It executes the actual instructions. A given functional unit is usually capable of performing some instructions beyond simple math. They may also include multipliers, dividers, address generation, and even branches. By providing multiple such units, while one unit is busy, another unit may be able to process different instructions, which may be out of order.

Different processors have different numbers of functional units, and code analysis helps determine the combination and distribution of instruction support among them. This helps parallelize instruction execution where possible, but processor overhead - such as instruction fetch - occurs serially.

Truly parallelizing computation is one of the best opportunities for improving performance, and it may be more power efficient for a less complex processor. But such solutions are not new. Multicore processors were commercialized more than a decade ago, but failed to gain traction.

Few algorithms are fully parallelizable. Those that are parallelizable are often called "embarrassingly parallel." Every other algorithm has a mix of parallelizable code and segments that must run serially. Amdahl's law identifies these serial sections as the ultimate limiting factor. Some programs can be highly parallelized, and some can't. But even when an algorithm doesn't appear to be parallel, there may be other opportunities.

Fractals are an example. "Your f(x) is f(x-1)," Klein explains, "and each pixel is calculated individually in a long serial chain. But if you're processing an image, you have 1024 x 1024, or whatever the image size is, so you have a lot of opportunities for parallelism [by calculating multiple pixels at the same time]."

Today, data center servers have processors with up to about 100 cores. But unlike earlier multicore processors, they aren't used for a single program. They allow multiple programs to be executed for different users who need cloud computing. 

The problem with parallelization

Even if they can be parallelized, the problem is that processors must be programmed in parallel. This usually means explicitly managing the parallelism of the code, such as by calling pThreads. This is much more cumbersome than typical programming, requiring knowledge of data dependencies to ensure that in-order execution semantics are met. Although some tools exist to help with this, none have made it into mainstream software development.

In addition, manually managing parallelism may require writing different programs for different processors. If more threads are needed than a given processor can manage in hardware, the program may run but may not be ideal. Switching to software parallelism may hurt performance due to context switching overhead.

The biggest problem is that software developers are dismissive of explicit parallel programming. There is a strong expectation that anything new can be programmed using current programming methods. "Software people have completely rejected the concept of a 100-core processor, except for one area where we are starting to see it creep in - GPUs and TPUs," Klein observed.

This is why multicore processors have failed commercially. Even then, parallelization is primarily for performance. Reducing power requires a modest core and aggressive power reduction strategies so that idle cores don't consume energy. Parallelization can also help recover overall performance that might be lost by making the cores more efficient.

"My argument is that an array of very large numbers of very simple CPUs is the way to go, but it does require a change in programming approach," he said. "My only hope for that to happen is that AI can create parallel compilers, which we as an industry have never been able to do."

The practical way we deal with algorithms that bottleneck on general-purpose processors today is to use accelerators as non-blocking offloads, so that the accelerator can efficiently handle its task while the CPU does something else (or sleeps).

 Accelerators can be broad or narrow

Accelerators of various types have been around for decades. Today, much attention is being paid to those that can speed up training and inference, which require very specific, intensive computations. But these types of accelerators are not new.

"Heterogeneous computing combines processing cores to deliver optimized power and performance," said Paul Karazuba, vice president of marketing at Expedera. "This obviously includes NPUs. NPUs tackle all AI processing, bypassing less efficient CPUs and GPUs. However, not all NPUs are created equal - not only in approach, but also in architecture and utilization."

That's because accelerators can be highly specific - even custom - while others remain more general. "If the AI workload is well-known and stable, a custom NPU can significantly improve power and cost efficiency," Karazuba continued. "If you need flexibility to support multiple models or future AI trends, a general-purpose NPU is more adaptable and easier to integrate with existing software ecosystems."

Customizing an accelerator will allow it to be more specifically tailored to its workload, an effort that should improve energy efficiency.

"One way to improve processor subsystem efficiency, especially NPUs, is to create more application-focused NPUs rather than more general-purpose NPUs," Karazuba said. "Custom NPUs typically use specialized MAC arrays and execution pipelines that may be tuned for specific data types and model structures. General-purpose NPUs contain configurable compute units, support multiple data types, and typically handle a wider range of layers and operators."

Removing functionality that is not necessary for a given task can have a significant effect. In real-world applications, Expedera has typically seen processor efficiency (measured in TOPS/W) improve by about 3x to 4x, and utilization (defined as actual throughput/theoretical maximum throughput) improve by more than 2x after deploying custom NPUs.

What happens when we run out of ideas?

Clearly, there are still some opportunities to improve the efficiency of processors and processing subsystems. But we may run the risk of running out of ideas in the not-too-distant future. What happens then?

That's where a new processor architecture might come in handy. However, given the large ecosystem that current architectures rely on, such a change is not easy. Fortunately, there are some new architectural ideas, and the possibility of giving up some commonality.


Source: Content from semiengineering

Reference link

https://semiengineering.com/can-todays-processor-architectures-be-made-more-efficient/


View more at EASELINK

HOT NEWS

UFS 4.1 standard is commercially available, and industry giants...

energy-efficient,options,circuit,area,DRAM

The formulation of the UFS 4.1 standard may accelerate the implementation of large-capacity storage such as QLC

2025-01-17

Understanding the Importance of Signal Buffers in Electronics

Have you ever wondered how your electronic devices manage to transmit and receive signals with such precision? The secret lies in a small ...

2023-11-13

Turkish domestically produced microcontrollers about to be put into production

Turkey has become one of the most important non-EU technology and semiconductor producers and distributors in Europe. The European se...

2024-08-14

Amazon halts development of a chip

Amazon has stopped developing its Inferentia AI chip and is instead focusing on semiconductors for training AI models, an area the com...

2024-12-10

Basics of Power Supply Rejection Ratio (PSRR)

1 What is PSRRPSRR Power Supply Rejection Ratio, the English name is Power Supply Rejection Ratio, or PSRR for short, ...

2023-09-26

DRAM prices plummet, and the future of storage is uncertain

The DRAM market will see a notable price decline in the first quarter of 2025, with the PC, server, and GPU VRAM segments expe...

2025-01-06

Survival Guide – AI Chip Unicorn’s

Recently, the world's "AI chip unicorns" have successively announced new developments in their companies and products. Gro...

2024-04-26

Another century of Japanese electronics giant comes to an end

"Toshiba, Toshiba, the Toshiba of the new era!" In the 1980s, this advertising slogan was once popular all over the country.S...

2023-10-13

Address: 73 Upper Paya Lebar Road #06-01CCentro Bianco Singapore

energy-efficient,options,circuit,area,DRAM energy-efficient,options,circuit,area,DRAM
energy-efficient,options,circuit,area,DRAM
Copyright © 2023 EASELINK. All rights reserved. Website Map
×

Send request/ Leave your message

Please leave your message here and we will reply to you as soon as possible. Thank you for your support.

send
×

RECYCLE Electronic Components

Sell us your Excess here. We buy ICs, Transistors, Diodes, Capacitors, Connectors, Military&Commercial Electronic components.

BOM File
energy-efficient,options,circuit,area,DRAM
send

Leave Your Message

Send