Why Does My Octa-Core Phone Suck?

An Illustrated Overview of Factors Affecting Processor Performance


It is almost spring of 2016, nearly a quarter century after what was one of the most significant events in cell phone history.

24 years ago, in the Summer of 1992, IBM unveiled the Simon.

Figure 1: IBM Simon Touch Screen Cellular Phone

Figure 1: IBM Simon Touch Screen Cellular Phone

The Simon was possibly the world’s first, large scale, commercial experiment at fusing a cellular phone with a PDA.

Well, the experiment was a flop.

The Simon spent all of 6 months on the market, selling a grand total of 50000 units before being pulled out. For perspective, compare that with the Nokia 101 which was launched the same year, and which went on to sell over 5 million units.

But in its failure, the Simon gave us all a glimpse into our smart phone laced future. The IBM Simon sported, by today’s standards, a puny 16 MHz 16 bit CPU based on the x86 chipset, just 1 MB RAM, plus 1 MB of storage, but get this: a 100% touch screen display! And it ran a version of DOS (minus the command prompt off course!). In its specifications, the Simon was light years ahead of its time.

It turns out, the world did take notice.

In the decades that followed Simon’s demise, mobile processor clock speeds have exploded, rocketing from the humble 16 MHz of the Simon in 1992, to the blistering fast 2.2 GHz per core of the Qualcomm Snapdragon 820 launched in 2016.

During the same 25 years, the CPU clock on desktop and server class processors has also been ticking faster and faster.

Just consider the following two examples:

Intel introduced the 486 DX2 processor in 1992. This was a 32 bit chip that could run at up to 66 MHz speed.

Figure 2: Intel 486 DX2 66 MHz

Figure 2A: Intel 486 DX2 66 MHz

800px-Intel_80486_DX2_P24S_die

Figure 2B: Die of Intel 80486 DX2-66

Incidentally, this was also the chip that I plugged into the first computer I assembled!

Fast forward 23 years, and you come face to face with Intel’s Core i7-6700K. This beast of a CPU can effortlessly deliver 4000 MHz on each of its four cores.

Let’s look at this again.

Figure 3: Growth of Processor Speeds

Figure 3: Growth of Processor Speeds

66 MHz of processing power on the desktop in 1992, has grown to 4000 MHz of processing power per core in 2015.

16 MHz of processor power on the cell phone in 1992 has grown to 2200 MHz of processing power per core in 2016

You can see that while the i7 6700K is a seriously fast chip, mobile chips such as the Qualcomm 800 series, or, for that matter, even Intel’s own Atom series mobile processors, seem to be catching up pretty quickly in terms of raw Megahertz speeds.

So can we all start dreaming of ditching our laptops and desktops, and hooking up our Samsung Galaxys and iPhones to a Bluetooth keyboard, mouse and a flat screen monitor?

Shall we also, dare say, begin replacing the thousands of power hungry rack mounted servers in the data centers of the world, with equal or lesser number of ultra power efficient mobile processors?

Not even by a long shot.

You see, comparing two processors using only their clock speed is like racing a car with a horse, and reaching the conclusion that the one crossing the finish line first has better performance. In reality, the performance depends on what you want to do.

For processors, performance is measured using benchmark tests. Each test determines how well a processor compares with a ‘benchmark’ value in performing specific kinds of tasks, be it web browsing, power consumption, graphics rendering, or just hardcore number crunching.

Let’s compare the benchmark numbers of two processors from Intel.

The first one is the Intel Xeon E3-1535M v5. This processor was launched by Intel in Q3 of 2015 and it’s based on Intel’s sixth generation Sylake micro-architecture. Intel advertises this processor as a “Mobile Workstation” processor. This presumably means that it’s the sort of processor that ought to naturally come to your mind, if it is your goal to build a seriously powerful top-of-the-line laptop.

We’ll compare the Xeon E3-1535M v5 with the Intel Atom x7-Z8750. The later is an 8000 series pure-play, mobile (read smart phone and tablet) processor from Intel launched in Q1 of 2016.

So here are the benchmarks comparing these processors:

Figure 4: Comparison of benchmarks between Intel Xeon and Atom

Figure 4: Comparison of benchmarks between Intel Xeon and Atom

As you can see, the Xeon E3-1535M v5 demolishes the Intel Atom x7-Z8750 in every single benchmark.

Let’s dig a little deeper into the causes of these performance differences.

Let’s begin with the CPU clock speed which is, if not the most important, definitely the most ‘visible’ of all performance parameters.

The effect of clock speed on CPU performance

Figure 5: Clock waveform

Figure 5: Clock waveform

Each time the clock ticks, an electrical pulse races through the internal circuitry of the processor. It switches on or off millions of transistor gates, flips memory registers open or close, and moves bits of information into and out of memory caches and the RAM.

The faster the clock ticks, the more of these pulses are racing through the CPU per second. For e.g. in a CPU running at 1 MHz, there are 1 million of these electrical pulses racing through the CPU’s circuitry each second. A CPU running at 1 GHz has 1 billion clock pulses running through it every second, i.e. 1000 times as many.

Imagine each clock pulse as a short burst of current flowing down a wire. The faster the clock ticks, the more of these current bundles that are racing through the wire each second. With it, the more is the number of electrons that are speeding through each second. Consequently more is the number of collisions between the speeding electrons and the semiconductor material of the CPU, and therefore, greater is the heat generated every second as a result of these collisions. Thus, operating a CPU at a high clock speed heats up the CPU more than when it’s running at a lower clock speed.

A fundamental equation in electricity, applicable to direct current systems, is:

Power in Watts = 
Voltage in Volts x Current in Amperes

Thus the power consumed by the CPU is proportional to both its operating voltage and the operating current. The operating current is roughly speaking proportional to the clock speed.

For an electronic circuit such the the CPU, the actual power equation is actually the following:

Power = [C x F x V x V]  + P(static)

Where C is the capacitance of the transistor gates which in turn is a function of how tiny they are. F is the operating frequency, V is the operating voltage and P(static) is the static power demand of CPU.

Thus, running the CPU at a higher voltage, or higher clock speed will make it consume more power, and therefore also generate more heat.

This simple fact will work against you, if your source of power is modest and finite, for e.g. a battery. This fact will also land you in trouble if the means available at your disposal to dissipate the heat are very limited. For e.g. say your CPU is fitted into such a tight little space that there is no space for a CPU fan to run to keep things cool. Now guess where such a constrained environment exists? In a mobile device of course! Hence, mobile processor designers have always striven to keep operating voltages low, and clock speeds subdued at the cost of reduced performance.

A significant downside to restraining clock speeds in mobile processors is of course to make them run slower than their desktop cousins. Lower is the clock speed, lesser are the number of instructions executed per second, and by that crude measure, lower the performance of the processor.

In fact, even desktop processors such the Intel Core i3, i5 & i7 have the ability to dynamically increase or decrease the clock speed (as well as the operating voltage) based on whether or not the CPU is doing some heavy-lifting. The processor jargon for this capability is Dynamic Voltage or Frequency Scaling (DVFS for short). Intel prefers to call it SpeedStep , or EIST (Enhanced Intel SpeedStep Technology). ‘SpeedStepping’ not only makes the processor more energy efficient, but also reduces the overall cooling requirements and the cooling cost.

The effect of multiple cores on CPU performance

Figure 6: A four core processor with a private L1 cache and shared L2 cache

Figure 6: A four core processor with a private L1 cache and shared L2 cache

We saw how increasing the clock speed will increase the heat generated by the CPU. CPU designers have come up with a neat trick to get around this constraint, at least to some extent. And that is by simply “splitting” the CPU into the two or more cores, with each core operating at a lower frequency and voltage. Because the clock runs slower in these cores, the processor runs less hot, thereby reducing cooling requirements and cost. In mobile phones, it results in phones that can be more ‘tightly packed’ and thinner. Such multi core processors also have the advantage of true parallelism, i.e. two threads in a program, or even entire processes can execute on different processor cores at the same time. This later feature can be exploited only if the application running on such a multi-core processor has been written in a multi-threaded manner, and there aren’t a whole lot of dependencies between the different running threads. Crucially also, the operating system should recognize the presence of multiple cores, and intelligently schedule threads and processes on them in parallel, so as to reduce overall execution time.

Multi-core architectures also lend to some interesting configurations such as the Big-Little architecture. In Big-Little half the cores are designed for a higher frequency and the other half for a lower frequency. The high frequency core(s) are activated by the OS only when a burst of computation intensive workload is to be executed. At other times, the processor cruises along on the lower speed, lower power cores. Conceptually, this is not very different than Intel’s SpeedStep technology.

We have seen the role that raw clock speed can play on a processor’s performance. Now let’s look at how clock speed can also be a very deceptive indicator of processor performance.

Cycles Per Instruction, and Millions of Instructions Per Second

Let’s get introduced to two more performance parameters.

  1. CPI which is, not the Consumer Price Index, but Clock Cycles Per Instruction, and
  2. MIPS or Millions of Instructions Per Second.

But what exactly is an ‘instruction’?

Take a look at the following very simple program:

helloworld

When you compile and run it, this is what the processor executes:

assembly_instructions1 … around 4000 more such lines!

Each one of these lines contains an assembly language instruction that a processor can understand and execute.

Figure 7: Structure of an Assembly Language Instruction

Figure 7: Structure of an Assembly Language Instruction

Individual instructions do very simple things such as adding the contents of two memory registers, or jumping to a memory address based on the value of a memory register.

As you can see, even a tiny program translates into hundreds of instructions. Each one of these instructions further breaks down into a sequence of binary ones and zeroes. It is these ones and zeros that ultimately flip transistors on and off, store and fetch data in memory and turn pixels on and off on the screen.

It can take several CPU clock ticks to completely execute each one of these instructions. What CPI indicates is the number of clock ticks needed to execute one such instruction. Bear in mind that a processor will require a different number of clock cycles for different types of instructions. For e.g. Compare (cmp) instructions will have a different CPI than Jump (je, jne, etc.) instructions.

With this knowledge under our belt, let’s look at a simple relationship between clock speed, CPI and MIPS:

Clock Speed in Mega Hertz = Average Clock Cycles Per Instruction x Millions of Instructions per Second, i.e.

MHz clock speed = CPI x MIPS

This simple equation produces a surprising result.

Take the example of two processors P1 and P2. Let’s say P1’s designers have created a marvelously efficient instruction set architecture (ISA). This feat of engineering allows P1 to execute its instruction set at an average CPI of just 5 Cycles Per Instruction. In comparison, let’s say processor P2 is less efficient, with a CPI of 8 Cycles Per Instruction. Consequently, P1 can execute more number of instructions per seconds than P2. Let’s say P1’s MIPS rating is 160 million instructions per second, while P2’s MIPS rating is a tad less, at 100 million instructions per second.

As per the above equation, P1’s clock speed is: 5 x 160 = 800 MHz. and P2’s clock speed is 8 x 100, i.e. also 800 MHz.

Note that processor clock speeds are normally not derived from CPI and MIPS! In fact it’s quite the other way around. But you can see how knowing only the clock speed says so little about the underlying performance of the processor. Indeed, here we have two processors rated at the same clock speed, but with different performance characteristics.

Before we forge into other aspects of CPU performance, let’s see if we can calculate the time needed by processors P1 and P2 to execute our little program. Assume that the program compiles into 4000 instructions. We already know that the average Clock Cycles Per Instruction for P1 is 5, and for P2 it’s 8.

Thus the total number of clock cycles needed by P1 to execute our toy program is 4000 x 5 = 20000, while for P2 it is 4000 x 8 = 32000.

Now all we need to do is to divide these numbers by the respective clock speeds, which is 800 Megahertz in each case.

Therefore the time required by P1 to execute the program is 20000 / (800 x 1000000), which is approximately 25 milliseconds. While for P2 can execute it in 32000 / (800 x 1000000), which is 40 milliseconds.

What we just did was to use the CPU performance equation, viz.:

Execution time = 
instruction count x CPI / clock speed

Sub-scalar, scalar and super-scalar processors

But CPI by itself can be a deceptive measure. You see, the above calculations assume that the CPU is executing instructions strictly one after the other, i.e. serially. This results in a Clock Cycles Per Instruction of greater than one i.e. CPI > 1, or an Instructions Per Clock Cycle (Inverse of CPI) of less than 1. Such processors are known as sub-scalar processors. But what if the processor could execute the instructions in parallel? Let’s look at what might well be the world’s simplest program. Only six instructions long:

assembly_instructions2

Remember that P1 can execute instructions at an average rate of 1 instruction every 5 clock cycles. (i.e. Cycles Per Instruction=5). Therefore, P1 will need 30 clock cycles to complete the execution of our super simple program consisting of only 6 instructions. Now let’s bring in a new processor P3 with the same average CPI of 5, but P3’s designers have come up with a way to parallelize instruction execution, as shown in Figure 8 below:

Figure 8: Instruction Level Parallelism. Scalar execution. CPI = 1.0

Figure 8: Instruction Level Parallelism. Scalar execution. CPI = 1.0

In Figure 8, you can see that P3 is starting the execution of the first instruction ‘push rbp’ at clock cycle 1, the next instruction ‘sub rsp 30h’ at clock cycle 2, i.e. even before the ‘push’ instruction finishes execution, and so on. P3 can thereby finish executing our little program in just 10 clock cycles, as against 30 clock cycles for P1, even though both P1 and P3 can both execute a single instruction in 5 clock cycles on average. What processor P3 is doing is what’s known as Instruction Level Parallelism (ILP). P3’s designers have figured out a way to break each instruction into a pipeline of operations. In this case, the pipeline is, on average, five operations long per instruction. Each of these operations requires a different kind of Instruction Unit to execute on, such as an ALU, FPU, bit shifter etc. The technique of pipelining allows P3 to put multiple instructions for execution in parallel, and have them move through the various kinds of Execution Units provided by the processor in assembly line fashion. After the fifth clock cycle, you can see that this technique will allow P3 to finish executing one instruction for each subsequent clock cycle. Thus the effective clock Cycles Per Instruction that P3 can achieve with this technique is 1, an enormous improvement over P1’s CPI of 5. P3 is said to have a scalar architecture.

But wait, P3’s designers can do even better than this.

What if P3 has multiple execution units available for each type of operation? This single feature will allow P3 to put into execution more than 1 instruction at the same time. See Figure 9 below for how this would work.

Figure 9: Instruction Level Parallelism. Superscalar execution. CPI < 1.0

Figure 9: Instruction Level Parallelism. Superscalar execution. CPI < 1.0

In Figure 9, P3 is shown to execute the first three instructions all at the same time. P3 can do this because it has 3 Instruction Units available for each type of operation that the instructions break down into. But when it comes to executing the fourth instruction, it realizes that it has run out of IUs of the kind that the fourth instruction requires, so it waits for one cycle for this IU to become free. This technique allows P3 to execute more than one instruction per clock cycle. In the above figure, we can see it executing 3 instructions at a time. After the fifth clock cycle, P3 finishes 3 instructions during every subsequent clock cycle. Thus the effective Instructions Per Cycle (IPC) for P3 is 3, and therefore the effective Cycles Per Instruction (CPI) is 1/3 i.e. < 1. In this case P3 is said to have a super-scalar architecture.

The effect of memory throughput

While we are in the realm of semiconductor circuitry, let’s also look at one more very important parameter that determines how many instructions the processor can execute in a unit of time, i.e. the instruction throughput. A large fraction of instructions that a processor executes results in either data being fetched from RAM, or data being stored in RAM, or both. Consequently, the memory read/write throughout, i.e. how many bytes of data can be fetched from, or stored in memory at one time, and how fast the fetch/store can be done has an enormous influence on the execution time of an instruction. Among other things the memory bandwidth or throughput depends on the MHz clock speed supported by the RAM chips, and the width of the data bus, for e.g. can it move 32 bits in and out of RAM at the same time, versus 64 bits at a time.

Consequently, the performance of a processor is intrinsically tied with the throughput of the RAM chips that it has to talk to.

The dark horse

With this discussion about CPI, IPC, instruction level parallelism, pipe-lining and what not, you may be tempted to declare that the processor’s micro architecture is the ultimate influencing factor of its performance. Well, let me jolt you out of that particular belief. You see, there is a dark horse that I have not introduced to you yet, and that is the compiler. Remember our simple program that was translated into 4000 or so run-time instructions? What if the language writers created an especially efficient compiler which translated the same program into just 2000 instructions instead of 4000? The processor will have a full 50% less instruction workload to deal with – a single fact that by itself will have a terrific impact on performance.

Bringing it all together

In summary, smart processor designers use techniques such as the above while designing the micro-architecture of the processor, so as to deliver performance that is astonishingly higher compared to their peers even at a paradoxically lower clock speed rating.

How do such factors as described above reflect on the performance benchmark numbers?

Figure 10 below compares the single-core and multi-core benchmarks of two sets of processors. All processors within each set have same number of cores and similar Megahertz clock speeds. However, due to factors such as the ones we looked at earlier, their performance benchmark values, when compared on a single-core to single core, and multi-core to multi-core basis are very different.

Figure 10: Geekbench 3 performance benchmark comparison

Figure 10: Geekbench 3 performance benchmark comparison

One more thing…

We end this article by circling back to the topic of this article, namely, why do desktop processors still hold the roost over their mobile counterparts when it comes to delivering raw performance?

As you may have noticed by now, a common thread that runs throughout our discussion on processor performance is the heat generated by the processor. As with any piece of machinery, our own human body included, the heat generated, more than anything else, puts a limit to the maximum performance that the machine can deliver, and the duration over which it can deliver this performance without slowing down. There is a way to quantify the heat generated using a parameter called the Thermal Design Power (TDP). Given how important the TDP rating is, not surprisingly, the semiconductor industry is quite divided over both the definition of TDP, and the correct way to measure it.

A reasonably well accepted definition is that TDP specifies the maximum heat per second (measured in Watts) generated by the processor that can be successfully siphoned away by whatever cooling system is employed.

For e.g. if you plan to use an Intel Xeon E3-1535M v5 with a TDP rating of 45 Watts in your computer, the cooling system you employ better be able to dissipate 45 Joules of energy per second on a sustained basis. Otherwise, better keep a fire extinguisher handy!

As we have seen earlier, the higher the clock speed the CPU runs at, the more heat it emits. Therefore the clock speed rating of the processor is usually tied to its TDP rating. The Intel Xeon E3-1535M v5 has 4 cores, each one rated at a base frequency of 2900 MHz. Since its TDP is 45 W, the processor is expected to generated 45 Joules of heat per second when all 4 cores simultaneously operate at 2900 MHz on a sustained basis, under an “Intel defined high complexity workload”.

Alas, this is where mobile processors once again hit a major performance air pocket.

To illustrate, the Intel Atom x7-Z8750, a top-of-the-line mobile processor from Intel has an SDP (Scenario Design Power – similar to TDP) rating of just 2 Watts! This is expected. After all how can you possibly dissipate several dozen Watts from a slim mobile phone case without any cooling fins, air pockets or cooling fans?! If you cannot dissipate more than a certain amount of heat, you cannot operate the processor at more than a certain clock frequency on a sustained basis. All other aspects of processor design remaining the same, you take a big hit on performance in mobile processors, just on the basis of the drastically lower sustained low clock speed that you can operate the processor at in a mobile device.

References and further reading

BUXTON COLLECTION. Microsoft Research. Simon. Online: http://research.microsoft.com/en-us/um/people/bibuxton/buxtoncollection/detail.aspx?id=40

Wikipedia. IBM Simon. Online: https://en.wikipedia.org/wiki/IBM_Simon

Intel A80486DX2-66. Online: http://www.cpu-world.com/CPUs/80486/Intel-A80486DX2-66.html

Intel Corp. Microprocessor Quick Reference Guide. December 2008. Online: http://www.intel.com/pressroom/kits/quickreffam.htm

Intel Corp. Intel Product Specifications. Online: http://ark.intel.com/

Qualcomm Technologies, Inc. Qualcomm Snapdragon. Online: https://www.qualcomm.co.in/products/snapdragon

CPU Monkey. Intel Xeon E3-1535M v5 Benchmarks. Online: http://www.cpu-monkey.com/en/cpu-intel_xeon_e3_1535m_v5-567

CPU Monkey. Intel Atom x7-Z8750 Benchmarks. Online: http://www.cpu-monkey.com/en/cpu-intel_atom_x7_z8750-605

Intel Corp. Enhanced Intel SpeedStep® Technology FAQ for Mobile Processors. Reviewed 23-Feb-2016. Online: http://www.intel.com/content/www/us/en/support/processors/000005723.html

Heiser, G. Le Sueur, E. Dynamic Voltage and Frequency Scaling: The Laws of Diminishing Returns. HotPower’10 Proceedings of the 2010 international conference on Power aware computing and systems. Online: www.emo.org.tr/ekler/035226640b6b89f_ek.pdf

kamrans. Mentor Graphics. Implementing power management features including CPU Idle functionality, tick supression and Dynamic Voltage and Frequency Scaling in embedded devices. Aug 21, 2012. Online: https://communities.mentor.com/docs/DOC-3171

Dehmelt, F. Texas Instruments. Adaptive (Dynamic) Voltage (Frequency) Scaling—Motivation and Implementation. March 2014. Online: http://www.ti.com/lit/an/slva646/slva646.pdf

Fruehe, J. Dell Inc. Planning Considerations for Multicore Processor Technology. May 2005. Online: http://www.dell.com/downloads/global/power/ps2q05-20050103-Fruehe.pdf

Intel Corp. Frequently Asked Questions: Intel® Multi-Core Processor Architecture. March 5, 2012. Online: https://software.intel.com/en-us/articles/frequently-asked-questions-intel-multi-core-processor-architecture

Stokes, J. Inside the Machine: An Illustrated Introduction to Microprocessors and Computer Architecture. No Starch Press. 1 edition May 25, 2015. Online: http://www.amazon.com/dp/1593276680

Harris, D. Harris, S. Digital Design and Computer Architecture. Morgan Kaufmann. 2 edition August 7, 2012. Online: http://www.amazon.com/dp/0123944244/

Intel Corp. Measuring Processor Power: TDP vs. ACP. 2011. Online: http://www.intel.in/content/dam/doc/white-paper/resources-xeon-measuring-processor-power-paper.pdf

Copyright information

Figure 1: IBM Simon

Image Attribution: Bcos47

Publication Date: 30 June 2012

Online At: http://commons.wikimedia.org/wiki/File:IBM_Simon_Personal_Communicator.png

Copyright Information: Image has been released into the public domain by author Bcos47 as follows: “I grant anyone the right to use this work for any purpose, without any conditions, unless such conditions are required by law.”

Figure 2A: Intel i486 DX2 66Mhz

Image Attribution: Henry Mühlpfordt

Publication Date: 27 March 2007

Online At: https://commons.wikimedia.org/wiki/File:Intel_i486_dx2_66mhz_2007_03_27.jpg

Copyright Information: Image Copyright Henry Mühlpfordt. This image has been licensed for commercial use by Henry Mühlpfordt under the Creative Commons Attribution-ShareAlike 2.0 Generic Licence. To view a copy of this licence, visit http://creativecommons.org/licenses/by-sa/2.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Figure 2B: Die of Intel 80486 DX2-66

Image Attribution: Pauli Rautakorpi

Publication Date: 25 September 2013

Online At: https://commons.wikimedia.org/wiki/File:Intel_80486_DX2_P24S_die.JPG

Copyright Information: Image Copyright Pauli Rautakorpi. This image has been licensed for commercial use by Pauli Rautakorpi under the Creative Commons Attribution 3.0 Unported Licence. To view a copy of this licence, visit https://creativecommons.org/licenses/by/3.0/deed.en  or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

 

All other images in this article are Copyright © 2016 Sachin Date, and are licensed by me for use under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this licence, visit http://creativecommons.org/licenses/by-sa/4.0/  or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s