Great study from +Gary Sims​ on +Android Authority​!

#supercurioBlog



Fact or Fiction: Android apps only use one CPU core
We have had multi-core processors in our PCs for over a decade, and today they are considered the norm. At first it was dual-core, then quad-core, and today companies like Intel and AMD offer high end desktop processors with 6 or even 8 cores. Smartphone processors have a similar history. Dual-core energy efficient processors from ARM arrived about 5 years ago, and since then we have seen the release of ARM based 4, 6 and 8 core processors. Howev…

Source post on Google+

Published by

François Simond

Mobile engineer & analyst specialized in, display, camera color calibration, audio tuning

44 thoughts on “Great study from +Gary Sims​ on +Android Authority​!”

  1. I'm not quite sure what to think of this article. The whole system is of course profiting from more cores because your app is never alone.
    But also out of the box every process has one main thread for drawing. If you do everything on that multiple cores are less relevant to you then a really beefy one.
    Also usually the Android dev writing in Java will never have to think about core count because he won't assign threads to a specific core.
    I am afraid this article will raise more questions in the non developer community…

  2. I think +Gary Sims​​ was very clever to observe and highlight the behavior of Chrome browser, as it already uses multiple cores optimally.

    Not all apps really benefit from higher computing capabilities in day to day usage. There will be a difference but not night and day in experience.
    However browsing on mobile is never fast enough due to the ever increasing complexity of client side Javascript and higher resolution images.
    That's why with the browser alone, we have a definite answer to the question asked 😊

  3. +Michael Panzer exactly my thoughts. The whole article is writen on the premiss that when you see multiple cores being used when you're using X or Y app it means the app is making use of them: wrong. There's a whole system behind the actual app, and Android does that job by itself, it doesn't necessarily mean that app is even spawning new threads. The reality is that, unless you need to do some heavier work that would block the main thread, you'll always use the UI thread.

  4. +François Simond​ sure chrome is a good example but every application benefits from more cores and computing power. When you have an open system like Android where you can have a lot of Services running while you try to get your frames done in under 16ms more is always better.
    So even if your app never creates a new thread it's nice to be able to keep the others away from the resources you need.

    Nevertheless nicely written article with a lot of explanation.

  5. +Francisco Franco it's not what I've got from the article, quite the opposite actually. See the "Android" section:
    Because of processes like the SurfaceFlinger, Android benefits from multi-core processors without a specific app actually being multi-threaded by design. Also because there are lots of things always happening in the background, like sync and widgets, then Android as a whole benefits from using a multi-core processor.

    Then it's pretty well explained that instead of using less cores at a higher frequencies, it's more power efficiency to spread the work or more cores running at their best power efficiency.

  6. Alright so who is going to write this article about more RAM?
    BTW I want a phone that has like 16gb RAM boots from fast flash storage and then lives completely in RAM. Of course persisting from time to time to flash is ok.

  7. I don't think the people saying you don't need 8 cores are all saying it because its multiple core.

    Many (myself included) say it because using say 4 on the same die area is a better use of that die area. When comparing apples to apples on lithography (there is no other ARM chipset on Samsung 14nm to compare directly to), the octa designs have shown NO advantage in power efficiency and performance over even tweeners like Krait and Apple's Swift/Cyclone in real world use. And having bigger cores that are faster 100% of the time is more favorable than the few apps that do utilize a cluster fully once in a blue moon.

  8. +Jonathan Franklin I know what you're getting at. I still remember +Brian Klug​ talking about the topic if die size, core count and cost. Apples SoC are not really comparable because they don't need to be competitive, they just need to serve their needs. There is a lot more to be said about this topic that most people wouldn't even understand.

  9. +Jonathan Franklin​ it would be definitely interesting to try out Apple CPUs running an Android port, but we cannot.

    In the meantime, the cores powering Android devices are quite different.
    As we've seen, the A57 are not competitive in terms of power efficiency.
    The alternative we have now is using many A53 cores, and fortunately Android has been designed from the ground up to utilize parallelism which makes it an okay solution until an A57 proves itself worthy. (will it be the A72, a Qualcomm or Samsung creation? We'll see!)

  10. The article itself is well written and fine BTW.

    Just that a lot of variables exist there and not a lot of great real world examples in terms of processors because all have made marketing driven decisions.

    Chrome and most of your graphics intense games are also unique in that they doesnt really reflect how most Java apps are written or perform. Their being native affords more flexibility in utilizing threads. They're very legit everyday use cases, but they're also the existing best case scenario in terms of CPU utilization.

  11. +Michael Panzer​​​​ exactly. And just how much of Android's current system level improvements have hinged completely on things like renderscript being able to throw a specific workload across multiple cores where it previously could not.

    Huge improvements have come due to that but it also shows how much work there is left to do to truly utilize the designs to their fullest and we're still (as many OS's are) in the baby step phase of that

  12. Just because the cores get used doesn't mean in any way that it's not stupid.

    8 a53s in a mobile SoC is stupid. The MediaTek parts that used 8 a5 cores. Stupid.

    Putting 4 a15, 4 a57, 4 a72 cores in a mobile SoC is stupid.

    Drawing conclusions on big little on a design that isn't even big little….

    Drawing efficiency conclusions without actually measuring how much power is being used under a controlled workload.

    But I get it. Cores are being used, therefor it's good!

  13. Disagree completely. The conclusions he drawed from the article…. The anybody says a 8 core… Throw your hands up in the air.

    This entire big little, 8 cores etc is irrelevant if all you wanted to do was show apps being used on more than one core. That clearly wasn't the objective with the article.

  14. +François Simond​​ isn't the whole point of RS that you don't care and shouldn't alter where it's executed?

    I ran the benchmark on my N5 and N10. The results are quite strange and also the Samsung SoCs results are quite strange in terms of proportion to other SoCs…

  15. No idea if its possible, but I don't think running RS on a Hexagon would make all that much sense. Hexagon is a DSP core with 3-way SMT and only moderate clockspeed. For typical compiler generated code, it will not match most ARM application processors, and it lacks the parallelism of a GPU.

  16. +Brian Z If you look at the Anandtech tests or some of the ARM ltd. presentations, theres maybe a factor of 5-10 difference in power consumption between the a7/a53 and the a15/a57, so if you can run almost anything on the little core and not have performance suffer noticeably you are very likely to come out ahead. The difference in power consumption is just so large its hard not for it to pay off.

    Paired big/small a53 systems are more interesting though. In theory power consumption can be much different based on different core layout, but I don't know how good Qualcomm's implementation is. Would be very interesting to see the difference in perf/watt between their big and little a53 designs.

  17. As far as I know it's just DVFS. To me those chips aren't interesting at all to be honest.

    Just stop gap chips since their custom cores aren't ready.

  18. +Brian Z Where did you read that? DVFS wouldn't really make sense given that the range of clock frequencies are very different. Plus it would be a very foolish way to implement it. I'm very skeptical unless Qualcomm has come out and said so.

  19. No where. As far as know no reviewer who is up to the task has looked at it.

    Source code will reveal plenty though.

    I don't see them using different transistors and optimizing for a certain clock at all. Be shocked if it's any more than DVFS that determines the clusters.

    What I suspect is that simple. Get 8 a53 cores that can run @ the higher clusters rate. Down clock one cluster to the more efficient point of the power / performance curve.

    Optimizing on the hardware level in this case makes no sense to me. Especially on the older 28nm lp on a low cost chip.

    Take the 610. 4 x a53. 615 clock a cluster higher / lower and call it a day.

    Anand has said that Qualcomm admits the 615 exist almost solely because of the Chinese market when it was announced too. Wanted a octa core 64bit chip out the door

  20. +Brian Z I wouldn't assume its DVFS. That would be very foolish of them, and its usually not how these things are done. Instead if you want to use a core optimized for a given operating point, typically you'd use one specifically synthesized for the clock targets you're aiming for rather than taking one designed for a different purpose and trying to shoehorn it in with over/under voltage. This is really nothing special, most SOCs have many different processor cores on them, each targeting different voltage and cycle times. Doing anything else is just too inefficient.

  21. They aren't going to optimize a off the shelf arm core.

    Those cores are designed to run by ARM to run within a frequency range. But it's up to the OEMs to decide exactly what they run them at. Usually they push it higher.

    Mobile SoCs up until recently didn't have different cores at all. Unless you want to do some creative counting and count every damn last thing as a core.

    For our purposes core = application processor. Like krait / cyclone cortex a5 a7 etc. They have not been mixed up until the big little debut in the international gs4.

    Tegra 3 had a lower power companion core that was the same core.

    But it's all been done with binning and voltage scaling. Not some hw tuning.

    These android OEM's shoehorn crap into everything. In fact, that's what they do always. Samsung = we're making a watch. Let's shoehorn in a exynos and cripple it. Android wear = let's take a snapdragon 400 and cripple it. Moto 360. Same. Google Glass 1.0 same thing.

    Krait cores. Everything from krait 300 all the way up what's in the 805 is essentially the same. The gains in performance cpu wise comes from the clock speed increases from process improvements and other tweaks in the SoC.

    But krait hasn't seen IPC / clock for clock performance improvement since the 200 to the 300.

    You gave the SoC vendors and OEMs the benefit of the doubt. When in fact they been doing foolish nonsense for years. And many have celebrated it! Psssht, my phone has 2.5ghz quad core, it pwns that puny little dual core in crapple!

    Speaking of Apple and shoehorning. When apple needed more for their iPads over the years. They made the x versions. Everybody else = let's just shoehorn in whatever we got. We even got the joy that was known as Windows RT from that.

  22. +Brian Z I don't think you have an accurate understanding of how ARM processors are integrated into actual silicon products. When someone mentions a specific part like the A53, they're referring to an RTL level design of a processor, not actual synthesized part. At the RTL design, there is no "frequency range" designated, yet, there can't be. That happens at synthesis, where the logical/mathematical design of the processor is translated into a transistor pattern that can be encoded onto a mask. Before that there is no frequency range because the same RTL can operate over a huge range of different clock frequencies and power consumptions, even at the same voltage.

    So when you take "a[n] off the shelf arm core", you don't actually have a physical core, you have your choice of many pre-synthesized parts, and usually an infinite number of not-yet-synthesized cores. Logically if you need a high power one, you'd take one synthesized for high power applications (or maybe even synthesize a new one). If you need low power, you'd take a low power one. What you wouldn't do is take the same one and use it twice.

    I think probably when you read "customized" you were assuming that this meant changing the RTL. I agree this doesn't happen (there is probably no point). But that doesn't mean the two cores are same…

  23. +Brian Z Regarding the Tegra 3, it might be helpful for you to look at a real die shot of the Tegra 3's 5 cores:

    http://www.techinsights.com/uploadedImages/Public_Website/Content_-_Primary/Teardowns/2012/Google_Nexus_7/NVIDIA-Tegra-3-processor.jpg

    The 5th core is the tiny little box midway across the die near the top. As you can see its less than half the size of the other A9s. This is the same logical processor, but as you can see they did a lot more than just "binning and voltage scaling". Its a completely different layout. By taking the same logical core but relaxing all the cycle timings, much lower drive currents and longer gate delays can be tolerated. This gives a much smaller, more power efficient core that doesn't clock nearly as high.

  24. I didn't really disagree with you on anything. However, despite it not being true that most android apps are simgle-threaded, (at least, that's the conclusion we draw from your article) I think the utilization of the CPU in those apps could be significantly better. In particular, the games you showed off seemed to be optimized for a dual core CPU. That could be the case, since I remember the iPhone 6 being the first phone on the iOS end that actually broke into the quad-core range of CPU. Being that temple run has been around for awhile, and practically every mobile game these days is meant to run cross-platform (with a hint of negligence for Windows phones), it could be the case that the devs are trying to kill two birds with one stone by only shooting for dual core optimization. It helps that there's still a significant amount of older Android phones being used that may only have a dual core processor.

  25. +Michael Giacomelli​ in the beginning I didn't understood the logic behind octacores CPUs with 4x high frequency A53 and 4x low frequency ones.
    But the this is what you describe here.
    I would be really interested reading documents about this multi core design, because I wonder how much power efficiency is gained (if any) that way.
    Also, to learn what exactly differentiate physically the high frequency and lower frequency cores.
    Example chip: the S615

  26. +François Simond I'd love to see real documentation too. Truth is, I have no idea how much they save. In theory it could be a lot, there can be a very large difference between different layouts of the same core, but then that assumes that everything is done correctly, that the kernel schedules properly, that synchronizing state between different cores is efficient, etc. It may be that they don't gain very much once overhead is considered.

    As for how you layout cores for different design goals, the basic idea is pretty simple. Every gate you go through is a capacitor that has to be charged before its activated. Drive them with more current or put a smaller load, and they flip faster. Drive them with less current, and you use less power. Depending on how much delay you can tolerate, you can do things like use different types of transistors, or stack logic into deeper levels, which increases delay (since the circuit is more sequential and less parallel) but can be more power efficient (fewer gates to charge). For FinFET designs, they even do things like layout transistors that are time critical with multiple fins in one transistor just so they can drive more current (and of course use more power).

    For the most part this stuff is all handled by software these days, since laying out cores is easy for a computer but hard for a human. Its actually a lot like compiling a program. You can optimize speed or memory. Same idea here. Your RTL is like code that can be compiled into all kinds of different circuits depending on the software you use, the constraints you have, etc.

  27. I understand how it works.

    They are not doing this for every SoC. They make their lets say their Qualcomm s600 chip. From there they it's binned and DVFS to fill their who knows how many krait 300 parts.

    Same for the 400 in phones and watches.

    A reference core is reference core. Depending on what arm core you have licensed you can get variation. Some older ARM reference designs had plenty of things as optional. The caches, neon support…

    Qualcomm is not tinkering with the transistors. It's shoehorned in. When you strip away the marketing name like Qualcomm 400 and look at part numbers like 8026 in the watches. It's the same exact damn part everywhere else it's used. Same for the TI in the 360.

    For the original galaxy gear it's a gimped exynos 4212 shoehorned in.

    These aren't custom, special made chips. DVFS and binning.

  28. +Brian Z Sorry, but no you are misunderstanding this completely. There is no single "reference core". An A53 for example is just the logical specification for a processor, not any specific core. Over the lifetime of the design, many thousands of different cores will be created from the a53, and individual products may use several a53s with completely different transistor layouts. This isn't a Qualcomm thing, its just how modern fabs work.

    Are you familiar with the idea of hardware description languages? You may want to take a look at the wiki page to understand what I mean by synthesis.

  29. I am not misunderstanding it. You just won't accept that all these vendors DVFS, bin and shoehorn their way in. Not make a change to the layout and have the foundry fab it for all these different chips in the same family line.

    This is going nowhere.

Leave a Reply to Brian Z Cancel reply