Protecting Mozilla Firefox users on the web

I have followed Pwn2Own ever since its inception in 2007. For those of you who do not know what Pwn2Own is, it is a competition in which hackers try to take advantage of software weaknesses in browsers (Internet Explorer, Firefox, Chrome, Safari etc.), put up specially crafted webpages and click on them to try and launch another application, usually calc.exe. They then gain a monetary reward in return. It usually happens on the sidelines of CanSecWest, a yearly security conference held in Vancouver.

During my university days in Singapore on the other side of the world, I always followed this competition with anticipation. I told myself, one day, just one day, I will be at the frontline helping to decipher the problem and help to get the fix out to Firefox users around the world as soon as possible.

Over the years, a security researcher by the name of Nils took down Firefox in 2009 (bug 484320) and in 2010 (bug 555109), whereas in 2011, nobody took down Firefox.

Last year in 2012, I was on-site in Vancouver and I witnessed Willem Pinckaers and Vincenzo Iozzo take down Firefox. However, the bug (720079) was already identified and fixed through internal processes.

This year, Pwn2Own became the venue for many exploits against major browsers, including Firefox (bug 848644), as well as other plugins which are more often used in browsers, such as Flash and Java. The team that took down Firefox this year was VUPEN Security, who also punched holes through Internet Explorer 10, Java and Flash.

Some of my colleagues / co-workers were present at the conference and were relaying us information live, while I stayed back at the office preparing my machines to diagnose the issue.

===

The following timeline (all times PST) describes my role behind the scenes with respect to the Firefox exploit by VUPEN, on March 6, 2013:

~3pm: Rumblings heard on IRC channels that Firefox has been moved from its scheduled slot to 5.30pm.

5.30pm: VUPEN gets ready.

~5.54pm: VUPEN takes down Firefox. On-site team gets to work getting details of the exploit.

~7pm: Bug 848644 gets filed.

Looking at the innings of the testcase, together with confirmation with team members over IRC that there is no malicious code present (Proof of Concept (PoC) code just crashes), I manage to reproduce the crash on a fully-patched Windows 7 system.

More analysis from early responders flow in; information such as the attack vector (Editor), Asan stack trace showing the implicated functions (possibly nsHTMLEditRules::GetPromotedPoint).

I did a quick stab at the regression range here. Using the bisection technique described here, I found that early January 2012 builds did not crash, whereas early January 2013 builds did crash.

The testcase seemed initially tricky; until it was eventually found (quite awhile later) that one could reliably trigger this with one tab that somehow caused the “pop-up blocked” info bar to show, I had to try the testcase repeatedly, sometimes reloading, sometimes closing then opening the browser again to trigger the crash.

Using mozregression here might have been a good idea – however due to an incorrect decision whether a particular build was crashing or not, one would bisect down to an incorrect regression window and waste precious time.

Time was of the essence here – the sooner one gets an accurate regression window, the faster a developer can potentially pinpoint the cause of the crash.

I found myself repeatedly downloading and checking builds to see if they did crash or not. Sometimes the crash happened immediately on load (with the initial PoC). Other times it happened only after a few minutes, or only after a restart.

I eventually settled on the following regression window: crash happens on the October 6, 2012 nightly, but not on the previous day’s (October 5), and I posted a comment, so this could get confirmation from other people. I then immediately looked through the hgweb regression window to see if anything stood out – bug 796839 seemed to be a likely cause, but everything else was still a possibility.

in that regression window, more clues emerge. The Asan stack trace pointed to nsHTMLEditRules::GetPromotedPoint being part of the bigger picture here, and some detective work showed that in this changeset from bug 796839, the file editor/libeditor/html/nsHTMLEditRules.cpp was changed, and this was the file that nsHTMLEditRules::GetPromotedPoint was located in.

Coincidence? Probably. However, this made everything more likely. At this point in time, it was 8pm, approximately one hour from the point in which the testcase was obtained.

I began to consider (and possibly discount) other possibilities, including bug 795610. Thanks to great work by Nicolas Pierron and his git wizardry, we found that nsHTMLInputElement::SetValueInternal (also implicated in the Asan stack trace), existed in nsHTMLInputElement.cpp which was modified in that bug. However, this possibility was quickly discounted.

At this point, I was able to get independent verification that the regression window (Oct 5 – Oct 6) was indeed correct. Further checking showed that our Extended Support Releases (ESR) builds on version 17 was also affected.

This made bug 796839 extremely likely to be the root cause, because it was landed on mozilla-central during the version 18 nightly window, but was backported to mozilla-aurora at that time, which was the version 17 branch. Bug 796839 would encompass the patch landing that inadvertently opened up a vulnerability in Firefox.

Independent confirmation of this regressor came at 9pm.

Within 2 hours, we had gotten from having a PoC testcase with no idea what was affected, to knowing which patch caused the issue. I thus nominated for the fix to be landed on all affected branches.

By about 10pm, the fix was put up for review. After that, lots of great work by various people/teams went towards quick approvals, landing of the fix, along with QA verification.

Overnight, builds were created and by late morning the next day, the advisory was prepared, with QA about to sign-off on the new builds.

At 4pm, a new version of Firefox (19.0.2) was shipped with the fix.

===

Credit must be given to the other Mozilla folks in this effort, who have, outside of normal day working hours, worked till late night to make this possible. I am proud to be part of this fabulous team effort.

It certainly has been my honour to have helped keep Mozilla users safe on the web.

Valgrind builds are now green on TBPL

Valgrind builds are now green on TBPL as of this morning!

I filed bug 800435 to get the build unhidden – previously it was hidden on tbpl.mozilla.org (TBPL) because it was always fiercely burning.

Note that the some of the multiple builds you see in the screenshot were manually triggered, otherwise only one per day is automatically scheduled.

What are Valgrind builds?

Running Firefox binaries with Valgrind helps to detect run-time memory management bugs, so it finds problems like use-after-frees (invalid reads/writes), uninitialized variables as well as memory leaks.

Note that the run-time speed of the application will take a substantial hit – it can take a long time to start up a Valgrind build. Moreover, a fairly powerful computer running Linux / Mac (preferably Linux) with about 4 GB memory is recommended. Tests are thus run once a day due to the slowdown, and we currently run only PGO tests (a small subset of our tests, note that our Valgrind builds are not PGO though).

How was this accomplished?

It is important to note that a lot of work by other folks was put in a year or two ago to get Valgrind showing up before it was hidden by default on TBPL for being perma-red. I then stepped up to help turn it green since I had some experience in running JavaScript binaries with Valgrind, and we all love greenery. :)

When I first embarked on this about 3-4 weeks ago, I found and helped to fixed 3 harness bugs, assisted in upgrading Valgrind twice, detected 35 potential issues at the time of writing (some of which were intended leaks), with 3 non-sensitive ones being fixed, some being recent regressions and one other being potentially security-sensitive. With the issues now known and filed as bugs, they were added to suppression files which also live in the mozilla-central tree. I also accidentally stumbled on a supposed TBPL selfserve bug that turned out to be a Firefox regression.

How can we help?

This is just a small step forward. In the future, ideally we should:

and we should also incorporate AddressSanitizer (Asan) builds into TBPL. (Asan is a faster memory error detector than Valgrind, and sometimes finds a different class of bugs, but it does not detect uninitialized values the way Valgrind does)

Christian Holler [:decoder] has some regular Asan builds but they are only run through the Try servers.

Anything else?

Shout-outs go out to the following people: Julian Seward, Nicholas Nethercote, Jesse Ruderman, Ted Mielczarek, Releng folks Chris AtLee, Nick Thomas and Rail Aliiev, our sheriffs edmorley, philor, RyanVM, and all others whom I have inadvertently left out. Without any of your collaboration and hard work we would be unable to have this set of Valgrind greenery. You folks rock!

Edit: Bug 800435 has been fixed, Valgrind builds now appear by default, thanks Ehsan! See the following screenshot:

Edit 2: It’s been re-hidden again because it’s “not a tier-1 platform“.

Bonus: Here’s a video showing the Endeavour Space Shuttle flypast in Mountain View. Just in case you haven’t seen it. :)