Sunday, August 10, 2014

Browser Wars: August 2014, Console Edition

The web browsers on PlayStation platforms, even the PlayStation 4, are extremely under-performant in most standard benchmarks when compared to pretty much any other device -- even years-older Android phones also made by Sony.

One of the most frustrating things about the PlayStation 3, and now the PSVita and PS4, is the oddly broken web browsers. Seemingly intentionally handicapped as far as HTML5 functionality, they also perform significantly worse than most comparable (and technically inferior) platforms, to the point where many of Sony's own web sites can cause the browsers to be extremely jerky and/or crash. (A good example is just loading up the PlayStation Blog and scrolling down the page -- huge pauses and really jerky even when no games are running.) Indeed, a couple of the people I know who have "jailbroken" their PS3 and Vita cited getting a non-gimped web browser on the devices as a motivator. It's a huge pain to try to look up a walkthrough/trophy guide and need to dig out another device just to do it.

Before getting into it, let me thank my pals Corey and Lloyd for running the tests on their Xbox 360 and Xbox One consoles. Lloyd even made YouTube videos showing Fishbowl's performance at varying levels, with some hilarious textual commentary.

Starting with the PS4 (firmware 1.75) and its 8 cores and as many gigabytes of RAM, there's several easy ways to get it to crash or become so unresponsive you think the system is hung: go to iwaggle3d.net, joystiq.com, or even the PlayStation Blog on some days, and scroll down the page. I even got crashes when trying to watch the new music videos on weirdal.com, never mind that most of the videos wouldn't even play inline or on their natively-hosted web sites (funnyordie.com, nytimes.com, etc).

Then there's the performance of the Fishbowl benchmark, easily the most embarrassing aspect on PlayStation platform browsers. The Vita (firmware 3.18) barely gets 1 fps (you read that correctly), and the PS4 -- with its 8 cores, 8GB of GDDR5 RAM, and plentiful GPU compute cores -- gets as high as 3 fps.  For comparison, the iPhone 3GS (running iOS 5.1) gets ~30fps, iPad 2 (running iOS 7.0.2) gets 60fps, and both the Xbox 360 and Xbox One (both running latest software as of August 2014) get 60fps. If you want to get the frames per second up on PS4 and Vita, un-checking the visual components that use alpha (Back, Shine, Front, etc) can get the Vita up to 3fps and PS4 up to 20fps. It's bizarre that alpha is so incredibly slow, meaning it's not only not GPU accelerated but also severely unoptimized. It's also weird that it's so slow despite the audio not playing: when testing the Fishbowl benchmark on the Sony Xperia phone running Android 2.3, the audio plays just fine. (Looking at html5test.com, PlayStation browsers have support for HTML5 audio but support no codecs -- not even license-free ones like PCM or Ogg, nor licensed codecs already included on every shipping PS4 and Vita such as AAC.)

Fishbowl running on PS Vita, getting 1 frame per second (no audio or background video):
Fishbowl running on Xbox 360, getting 60 frames per second (with audio and correctly looping background video):

Fishbowl running on iPad 2, getting nearly 60 frames per second (with audio, but no background video):


  Fishbowl running on PlayStation 4, getting 2 frames per second (with background video, but no audio):


Fishbowl running on Xbox One, getting 60 frames per second (with audio and correctly looping background video):

One might say that HTML5 Fishbowl benchmark is "too new", despite it being about 5 years old performing pretty well on 5 year old devices and the now-defunct Internet Explorer 9, Firefox 17, and Chrome 20 desktop browsers. (That being said, the Android 2.3 web browser on the Xperia Play didn't play anything -- it just played the audio.) There's also the FishIE Tank benchmark, which just demonstrates CSS animations and transitions without using any HTML5 features. In that benchmark, nearly every browser and device gets 60fps -- except PlayStation platforms, where Vita gets 7-13fps and the PS4 gets ~20fps with the default of 20 fish being animated at once. Set the FishIE Tank benchmark to 50 fish, and both PlayStation platforms' browsers nosedive in performance. The iPad 2 and Xbox 360 get 60fps without even breaking a sweat.


 FishIE Tank benchmark on PS Vita, getting 6 frames per second:

FishIE Tank benchmark on Xperia Play running Android 2.3, getting 13 frames per second:


FishIE Tank benchmark on iPad 2, getting 57 frames per second:


FishIE Tank benchmark running on PlayStation 4, getting 22 frames per second:


Even non-graphical, JavaScript benchmarks continue this trend on the Vita. Looking at Octane v1 benchmark from Google, the Vita gets the lowest score of 15.4 on the Richards test. For comparison, I found the lowest-powered device with the oldest browser I could -- a Kindle 2. The Kindle 2 running the 3.3.x software, gets a score of 51.0 in the Richards part of the Octane v1 benchmark. The abhorrent performance of the Vita browser continues throughout the Octance v1 benchmark (and even into Mozilla's Kraken 1.1 benchmark).

I mention the first test from each benchmark here, but the results are quite consistent across all the tests -- the ones that don't crash the browser, anyway. The Splay test in Octane v1 causes both the PS4 and Vita browsers to run out of memory and make the OS a bit unresponsive, and the Kindle 2 browser also has trouble with that particular test. The iPhone 3GS and iPad 2 also have a bit of trouble, but all those browsers still outperform the Vita in the benchmarks that do complete. The Xbox 360 browser crashes during the Richards test, while the PS3 has decent performance on Richards (2x the Kindle 2) but crashes during the Crypto test. The Android 2.3 browser on the Xperia Play doesn't grok some basic aspect of Octane v1 and redirects to a bogus URL.


Octane v1 benchmark running on Kindle 2, getting 49.8 in the Richards test:

Octane v1 benchmark running on PS Vita, getting 15.4 in the Richards test:

Octane v1 benchmark running on iPad 2, getting 1,742 (yes, 2 orders of magnitude higher than PS Vita) in the Richards test:



Octane v1 benchmark running on Xbox One, getting 150 (10x slower than PS4 and iPad 2) in the Richards test:



Octane v1 benchmark running on PlayStation 4, getting 1,737 (pretty good!) in the Richards test:



The Kraken 1.1 benchmark tests from Mozilla run so slow on the Vita, I thought the device was hung. The Vita browser finishes the ai-astar test in the Kraken benchmark in 167,658ms -- over an order of magnitude slower than the next slowest browser. The Xbox 360 was also incredibly slow, but did complete. The PS3 browser simply crashes with a JavaScript error when trying to run Kraken benchmark.

Kraken 1.1 benchmark running on Xperia Play with Android 2.3, getting 12,195ms in the ai-astar test:


 Kraken 1.1 benchmark running on PS Vita, getting 167,658ms (WTF!?) in the ai-astar test:

 Kraken 1.1 benchmark running on Xbox 360, getting 25,223ms in the ai-astar test:


 Kraken 1.1 benchmark running on iPad 2, getting 3,805ms in the ai-astar test:
 Kraken 1.1 benchmark running on Xbox One, getting 7,902ms in the ai-astar test:


 Kraken 1.1 benchmark running on PlayStation 4, getting 10,010ms in the ai-astar test:



SunSpider 1.0.2 benchmark is the same -- the Vita is ridiculously slow but the PS4 performs comparably with the iPad 2. The Xbox 360 fails to run the 3d-cube test properly, resulting in an artificially low score.


SunSpider 1.0.2 benchmark running on Kindle 2, getting 6,023ms total in the 3D tests:


SunSpider 1.0.2 benchmark running on Xperia Play with Android 2.3, getting 462ms in the 3D test:



SunSpider 1.0.2 benchmark running on PS Vita, getting 4,513ms (order of magnitude slower than Xperia Play) in the 3D test:


SunSpider 1.0.2 benchmark running on Xbox One, getting 194ms in the 3D test:


SunSpider 1.0.2 benchmark running on iPad 2, getting 195ms in the 3D test:

SunSpider 1.0.2 benchmark running on PlayStation 4, getting 221ms (pretty good!) in the 3D test:



The good news? The PS4 appears to have a JavaScript JIT in action, but appears to totally lack CPU optimization or GPU acceleration of any kind. The PS3's browser got a major bump to a new version of Netfront (not NX) a few years ago, and many of the media applications on PS3 and PS4 (Netflix, etc) are really HTML5 applications that are running a better-optimized version of WebKit than the user-accessible browser application. Reasonably, the user-accessible browser application on PS3 should eventually get the same webkit/NetFront NX backend as existing applications.

The bad news? The Vita appears to have a severely unoptimized JavaScript *interpreter*, like khtml (the basis for webkit) used to have 5+ years ago. Technically, the Kindle 2's browser also lacks a JavaScript JIT but it does appear to be optimized. My only guess is that the PS Vita web browser is literally an un-optimized debug build, which is at best an embarrassing oversight on the part of Sony's release engineering. (Perhaps some of the Vita homebrew/reversing community can take a look and verify this.)

The ugly news? html5test.com, like acid3.acidtests.org, only tells part of the story: 1) companies that sell browser middleware (Netfront) puts crazy hacks in to make the basic tests pass to improve the overall score, but applications that try to actually use the features tested for will fail spectacularly; and 2) a feature being present but being extremely under-performant in industry-accepted benchmarks or in pages on the popular web sites is not very useful. (A good example of this is the current PS3 browser appearing to pass ACID2 and ACID3 tests, but mis-rendering the BetaFishIE CSS3 animation demo.)

In the meantime, we can hope that Sony will give the dedicated PS3 browser the same webkit backend as used by the new PlayStation Store application, and that the PS4 and Vita will both get reasonably optimized packages of a refreshed Netfront NX browser soon. Since the toolchains for all the platforms support Link Time Optimization and Profile-Guided Optimization, just recompiling the existing code with a runtime profile of these benchmarks (like Firefox, Chrome, etc do) would likely make an enormous difference in performance while also bringing the binary (and therefore resident memory) size down. Let's hope they do it quickly so that people have one less reason to jailbreak the devices and further open the platforms to increased piracy. It's all about providing value.

Below is a table with some of the relevant benchmark values on different platforms. The Vita performed the same regardless of being on AC or battery power, with and without a fresh boot. The PS4 performed the same whether a game was "suspended" in the background, with and without a fresh boot.

Platform Fishbowl FishIE Tank Octane v1: Richards Kraken 1.1: ai Sunspider 1.0.2: 3d
PlayStation 4 (1.75) 2fps 22fps 1,737 10,0010ms 221ms
Xbox 360 (Jun 2014) 60fps 60fps BROKEN 25,223ms BROKEN
PlayStation 3 (4.60) BROKEN 1fps 95 CRASH 334ms
iPad 2 (7.0.2) 57fps 60fps 1,742 3,805ms 195ms
Xperia Play (2.3.4) N/A 13fps N/A 12,195ms 462ms
Kindle 2 (3.2.1) N/A BROKEN 49.8 N/A 6,023ms
PS Vita (3.18) 1fps 6fps 15.4 167,658ms 4,515ms
Xbox One (Aug 2014) 60fps 60fps 150 7,902ms 194ms
higher better higher better higher better lower better lower better

Saturday, August 09, 2014

spies versus mocks, 2014 edition

Johnny says to learn to use your debugger, kids. Also, chew your gum.

After 6 years of mostly doing low-level C and C++ agile engineering (along with some personal C# projects), I came back into the Ruby and JavaScript fold last year when I accepted a CTO position. I hadn't paid incredibly close attention to the Rails and front-end JavaScript worlds from most perspectives, but was very much looking forward to how things had improved from a developer practices and tooling perspective. 

One interesting thing is that edit/test cycles are so fast at this point with interpreted (read: non-compiled) languages, that many developers don't really use the debugger. To them, it's easier to iterate 2 or 3 times by adding console.log() or Rails.logger.* calls, re-running the test (sometimes manually), and checking the output. An older engineer I worked with over a decade ago called this practice of incrementally adding temporary log messages as "tombstoning": Tombstones are what you trip over when running through a graveyard. This practice isn't really practical in large C/C++ codebases, where even the fastest of build+test cycles is still in the high double-digit seconds.

Despite the process being fast enough to iterate this way thanks to guard-livereload (and other great tools), I still see the waste of context-switching in this practice -- especially when the cycle involves going from the code editor to a web browser, sometimes having to refresh manually, then manually click through a use case. With the several people I paired with on these Rails, Node.js, and ReactJS+Backbone components over the last year, I had to re-introduce and gently push the practice of using the debugger to set a breakpoint to cut the number of times we had to re-run the test (and context switch between several other developer tools). Chrome, Firefox, and Internet Explorer 11 all have fantastic debuggers (each with their own strengths and weaknesses), and RubyMine's Ruby/Rails debugger integration works *extremely* well in version 6.3.x on MacOS Mavericks. What I often heard is that both JavaScript and Ruby debuggers had *previously* been flaky and/or hard to use, and so people fall back to 1960s style debugging practices.

Wait, isn't this post supposed to be about Strict Mocks versus Spies? Yes, it was -- and it is. The perception of maturity and reliability of debuggers, and tooling in general, plays into a root cause of why many developers I have been working with favor spies over strict mocks. Also, there is seemingly still a general affinity in the Ruby community toward optimizing code for readability, often by leveraging deep Ruby features and optional syntax to create DSLs that read from left-to-right, at the expense or afterthought of other medium-term aspects of developer productivity*. Namely, how efficient is it to figure out what's going on when a test FAILS at runtime.

We have all heard the adage from Bob Martin and Kent Beck that code will be read many more times than it is written, and that is the economy to optimize for. I totally agree, but like many XP practices the principle behind this is a reaction against an anti-pattern of "macho" coding where people are proud of their 280-character one-line of code with no spaces whatsoever. The principle of being mindful of your current and future peers -- and your future self -- with regard to making code, especially tests, approachable and "as simple as possible, no simpler", is an important one. Unfortunately, between the DSL-mania at the end of the last decade and the perception of poor tooling, people focused purely on the static approachability of code and tests and not the runtime aspects.

The nice thing about Spies is that they blend well with existing assertions, after the arranging and invokation of the module/class/function being tested. Consistency is good. The downside is when a Spy assertion like this fails:
--
let (:invitation) { invitation = double('invitation').as_null_object }

before do
  InvistationSender.send(invitation)
end

it "delivers invitations to both recipients" do
  expect(invitation) .to have_received(:deliver).twice
end
--

Let's say the test fails because the invitation.deliver method is called three times instead of the specified two times. Where did the third call come from? At this point, the step might be to to a grep/find in the sources looking for references to the deliver method and then eyeball the code, playing computer in our heads to see where the logic fell down. Now, if we are keeping our changes small and our intervals between test feedback short, then both of those practices woudl reduces the scope we have to manually review. Still, let's say that we just upgraded a Ruby gem, Node.js package, or what have you -- there's not any easy way to slice a common development task like that to be thinner.

Manually reviewing in an instance like this often means trying to follow the call stack down in the IDE and ending up in your third-party gems/modules. This huge context switch is a major productivity killer, and can be even more frustrating if you and your pair were previously cruising along. What we really want to know is, "where did the unexpected call come from?" -- more specifically, "what is the call stack (and conditional branches) that led to the unexpected call?".

This is where strict mocks come in handy: in this instance, when the expected number of calls is exceeded, or a parameter constraint does not match, an exception will be thrown/raised, and we'll get a call stack in our console output. If we need more context, we can just re-run the test again but under the debugger this time. Under the debugger, when the invariant violation for the mock is thrown, we'll be able to switch across the frames of the call stack and inspect local variables in right then. It's possible we'll need to do one more cycle where we set specific breakpoints, but often the cycle ends there -- inspecting the variables in the stack frames and in the global context give us enough information. Also, we reduce the context switches between different tools that have differing visual layouts, key bindings, etc.
--
let (:invitation) { invitation = double('invitation').as_null_object }

it "delivers invitations to both recipients" do
  expect(invitation) .to_receive(:deliver).twice
  InvistationSender.send(invitation)
end
--

So, there's a pretty good case for using strict mocks instead of spies when our test is pinning down the *most* number of calls expected for a mocked method/function or constraining the parameters passed to the mock. What about when we're doing different kinds of constraints, like saying a method is called at *least* a certain number of times? In that instance, we can't fail the test until the actions of the object under test are finished -- there's no real way of catching the object misbehaving "in the act" like there is when setting an expectation for a maxumum number of calls. So, a spy/verify approach is perfectly reasonable.

At this point, some people may be worrying about consistency: if I use strict mocks only some of the time (during setup/arrange part of a test) and then spies some of the time, then it'll be 1) harder to read and 2) harder to understand when to do which style for people who are new to testing/mocks/etc. Both points are valid, and I therefore would recommend consistently doing strict mocks. Note that it is possible to go overboard with strict mocks, especially with parameter constraints or ordered calls, and end up with tests are are too tightly coupled to the implementation details. This can result in 1) tests that have a near-parallel change rate to the implementation, making for higher test maintenance; and, 2) cascading failures where many tests all fail for the same reason  -- both issues are classic ways of reducing the ROI for developer testing. Sometimes, this is a smell that the interface of the class(es) you are mocking has too granular of a public interface. Sometimes the tight coupling between tests and implementation details is someone from a design-by-contract (DBC) background not realizing the longer-term maintenance issues they might be introducing by introducing things into the test that do not relate to the specific scenario the test is documenting. It can also be someone who's just plain excited about the mock framework being used wants to try every feature of it. It's also possible to write spies that are tightly coupled to the implementation as well, so there's no panacea in either style for solving the problem of properly onboarding and mentoring new developers in your codebase and developer testing practices at large.

In summary, all Ruby and JavaScript (front-end and Node.js) developers should take half a day and learn to use the debugger in whatever their IDE/runtime of choice is. Add a throw/raise and make sure the debugger stops, inspect the different stack frames, and explore variables/objects in those stack frames. Consumers of test/mock frameowkrs, and the designers of those frameworks, need to be mindful of optimizing not just the writing and reading aspects, but also the debugging aspects when a given assertion or mock constraint fails. If the observe->understand->fix->re-test cycle is too expensive, then the longer-term maintenance cost/friction of a reasonably growing test suite will upset the ROI of developer testing and potentially cause a backlash. Like most things when it comes to development practices, the whys and the hows are fairly complex and nuanced -- there's no bumper sticker way to drum up zealotry. 

* I think a big part of why medium- to long-term developer productivity issues don't come into play is because often these frameworks are being driven by consultants who never have to deal with a given product/project beyond the first 3-6 months or first 6-12 full-time engineers. As a result, a lot of perceptions, practices, and frameworks put forth by shorter-term consultants tend to over-optimize for the front-end inception/bootstrapping of a project and its initial developers.