Safe HTML and XSS - Stack Overflow

As I've mentioned before, we are using the most excellent WMD Markdown editor, for the reasons I outlined in that post. However, Markdown, per the official spec, supports both HTML syntax and Markdown syntax. You can mix and match both syntaxes freely. This is great if you want to stick with HTML and not learn any of the Markdown syntax, something I've actually argued for in the past. However, I would also argue that Markdown is much less typing for the same effect, and it's easier to read, so it's worth learning. Markdown will save you time in the long run. Allowing HTML is great for flexibility and choice, but it's perhaps too much of a good thing: you can use any HTML. Try it yourself -- visit the advanced WMD demo and just start keying in whatever kind of wacky HTML you can dream up. Go ahead. Try it. This is bad. Very, very bad. The WMD control renders exactly the HTML you type, and submits it as-is to the server. Which means we, our webserver, our webpages, could be rendering javascript of unknown provenance. That's cross-site-scripting (XSS) in a nutshell.

In recent years XSS surpassed buffer overflows to become the most common of all publicly reported security vulnerabilities. [ed: the last time I wrote about this, in early 2007, buffer overflows were more common.] Likely at least 70% of websites are open to XSS attacks on their users. Site administrators rarely fix XSS problems and, when they do, the hole is likely to have been open for more than a month and a half. In general, cross-site scripting holes can be seen as vulnerabilities present in web pages which allow attackers to bypass security mechanisms. By finding clever ways of injecting malicious scripts into web pages, an attacker can gain elevated access privileges to sensitive page content, session cookies, and a variety of other objects.

Incredibly scary stuff. And it's all due to insufficient sanitization of user input, where HTML, or some subset of HTML, is allowed. Check out some of the standard XSS exploits for examples of clever ways hackers can exploit the tiniest of oversights in your HTML input sanitizing. Think there's just five or six ways to build an <a> or <img> tag? Think again. There are hundreds! So that's my challenge with the WMD editor. I have to write XSS-proof code to sanitize the HTML input on the server before I write it to the database. I'd like your feedback on how best to do this. Here's my general approach, in pseudocode form. Given a random HTML string..

Run a regular expression to match all the HTML <tags> in the HTML string.
For each individual tag match, verify that it passes our tag regular expression whitelist.
If the tag match does not pass, remove the entire tag from the content.
Repeat from step 2 until we're out of tags.
Return the sanitized HTML string.

Update: removed unnecessary extra code; all input is processed by the HTML sanitizer. It's slightly too much code to post here in a blog entry, so I have posted my C# SanitizeHtml routine on RefactorMyCode.com [ed. note: site is spam now, so link has been removed]. Please take a look and let me know what you think. (scroll to the bottom, however, to see the latest "refactoring".) Help me refactor my code, because I make bad software, with bugs! I've been itching for an excuse to link to RefactorMyCode for a while. It's a great site for coders, and signing up to submit code is super easy through OpenID -- no redundant account creation necessary! Even if you have no interest whatsoever in my crappy SanitizeHtml function, I encourage you to visit RefactorMyCode [ed. note: Actually, don't. URL campers have it and put something shady there] and consider the value of many internet eyes on a snippet of your code.

Add to the discussion