A closer look at Samsung’s “neural network” M1 CPU
Some of Samsung’s Galaxy S7 series and Note 7 handsets come sporting the company’s own Exynoss 8890 processor this year. The release of this chip marked a major first for Samsung, as it is the first processor to feature the company’s custom designed M1 CPU core, code-named Mongoose. At the Hot Chip 2016 conference Samsung has revealed more information about its latest processor, including details about an interesting “neural network” CPU design.
As we know, the Exynos 8890 is an octa-core processor is built from four Samsung M1 CPU cores clocked between 2.3 and 2.6GHz, four 1.6GHz ARM Cortex-A53 cores, and an ARM Mali-T880 MP12 GPU. The M1 CPU core is the result of a three year design cycle that was developed completely from scratch.
We now also know that the CPU features a 4-way 64KB L1 cache, a 2MB L2 cache, and support for full out-of-order execution including loads and stores, much like ARM’s latest Cortex-A73. There are seven integer execution ports with their own schedulers, with two pipelines that have a shared scheduler for advanced SIMD, NEON, and cryptographic instructions. Interestingly, the M1 decodes and dispatches four instructions per cycle, where as ARM went for just a two wide decode pipe with its Cortex-A73. ARM decreased this from 3 in the Cortex-A72 as the company believes this is more energy efficient while still catering well enough for mobile applications. Samsung seems to disagree.
So far Samsung’s M1 seems fairly familiar for a high-performance big.LITTLE core, but the M1 CPU begins to differentiate itself from the ARM CPU’s that we are familiar with thanks to advanced branch prediction. Samsung describes this simply as a “neural network”.
Before we delve any further, in let’s go over some basics. Branch prediction is an important part of a CPU circuit, as it can improve the flow of instructions by guessing ahead of time what will happen at common “if-then-else” functions (branches). If a branch is predictably correctly, a CPU can be continually fed instructions allowing it to maximize its potential, rather than having to wait to see what happens, which would incur a delay.
Branch prediction circuitry is incredibly complex and varies a lot between processor designs. Usually companies don’t disclose their designs because of this, but Samsung seems happy to boast about its development.
Samsung’s design supports indirect jumping for multi-way and conditional branches, estimation of two branches per cycle, and a dedicated loop predictor. The neural networking part seems to come in with the use of a “perceptron” as an alternative to the commonly used two-bit prediction counter. The use of a perceptron engine in a CPU is not entirely new, AMD and Intel already use similar ideas, but this is the cutting edge of branch prediction design.
Instead of assigning branches a likelihood value from 0 to 3 based on recently seen branch instructions, a perceptron algorithm keeps track of branch likelihood by learning from previous outcomes and predictions.
Put simply, perceptron branch prediction guesses an outcome based on an assigned branch weighting. This value can be adjusted over time based on whether the outcome was correctly guessed or not, in order to make better predictions in the future. This operates as a feedback loop and imitates the way that our brains learn from experience. There’s quite a good (technical) paper on this to read here, if you’re interested.
The benefit is that a perceptron should correctly predict branch outcomes more consistently, avoiding wasted cycles and time spent reloading saved states, thereby making the most of a CPU’s performance potential. Furthermore, a perceptron design doesn’t use as much die space or as many resources as increasingly complex bit counters.
For a three year project, the M1 core and the overall Exynos 8890 package seems quite accomplished. Samsung was always going to try something new with its in-house CPU design, and its very interesting to see that a considerable amount of effort has been put into branch prediction, especially given the relatively short from-scratch development time.
The result of this effort is that Samsung’s M1 CPU is specifically designed to cut down on the processing time wasted by incorrect branch assumptions. This is not only important for maximizing processing performance in a more limited mobile package, but also for keeping power consumption to a minimum, by not wasting cycles. Although we can’t really say how much better, if at all, this is than designs used by ARM or Qualcomm.
Samsung’s M1 is certainly an interesting and promising step for the company. Developing its own CPU design clearly signals an intention to escape from dependence on ARM and Qualcomm, and the second generation design is likely to be even more competitive than the M1.