Playing with Bun's URL parser
During the latest SECCON CTF quals, I ended up looking at Bun’s URL parser, hoping to find some weird behavior that could help me solve a challenge.
In particular, we were presented with this code:
const LOCALHOST = new URL(`http://localhost`);
const url = new URL(req.url, LOCALHOST);
if (url.hostname !== LOCALHOST.hostname) {
res.send("Try harder 1");
return;
}
if (url.protocol !== LOCALHOST.protocol) {
res.send("Try harder 2");
return;
}
// ... other chall details ...
res.send(await fetch(url).then((r) => r.text()));
After some digging in the source code, it turned out that in Bun the URL
class uses WebKit’s URL parser, while fetch()
uses a custom Bun one, implemented in the url.zig
file. Given this information, inspired by Orange’s good old “A New Era of SSRF” talk, I was determined to find some parsing differentials to solve the challenge.
While this ended up leading to nothing, and in no way to me solving the challenge, I have found plenty of weird (and wrong) parsing behaviors in the custom parser. Sadly for me, but thankfully for everybody else, none of these are currently exploitable, or require a completely unreasonable threat model to be considered security issues. They resulted in a bunch of GitHub issues, though (#16181, #16182, #16183, #17435). Hopefully I haven’t ruined anybody’s future challenge :’).
Before we start
The reason these issues are not exploitable is that, prior to using the custom parser, Bun always† uses the WebKit parser to normalize the URLs. Because of this, all but one parsing mistakes cannot be used in practice.
†Actually, there is one case in which URLs are not normalized first, and that’s when parsing the values of theHTTP_PROXY
and HTTPS_PROXY
environment variables. These two URLs are subject to each one of these errors I identified, but requiring an attacker to run Bun processes with a custom environment is a pretty unreasonable scenario, in which URL parsing mistakes would probably be the least of the issues.
If you are interested in the setup I used to debug and test the parser, as well as my struggles with the Zig language, you can read more at the end of the post.
Auth user parsed as part of the host
This is the only issue that isn’t mitigated by WebKit’s normalization, as the URLs that cause it are already in a normalized format.
Let’s start looking at some Zig code!
if (!is_relative_path) {
// if there's no protocol or @, it's ambiguous whether the colon is a port or a username.
if (offset > 0) {
// see https://github.com/oven-sh/bun/issues/1390
const first_at = strings.indexOfChar(base[offset..], '@') orelse 0;
const first_colon = strings.indexOfChar(base[offset..], ':') orelse 0;
if (first_at > first_colon and first_at < (strings.indexOfChar(base[offset..], '/') orelse std.math.maxInt(u32))) {
offset += url.parseUsername(base[offset..]) orelse 0;
offset += url.parsePassword(base[offset..]) orelse 0;
}
}
offset += url.parseHost(base[offset..]) orelse 0;
}
When parsing a URL, after the scheme, we could encounter a :
(colon) for two reasons: either it separates username and password in the authentication, or it separates the hostname from the port. Any other :
should be URL-encoded, and has no special meaning in the URL.
To differentiate these two important cases, Bun takes the index of the first occurence of a :
and the index of the first occurence of an @
(at), and uses them to identify the situation accordingly.
Let’s see some examples:
Example 1
example.com
Here there is no :
nor @
, so the values of first_at
and first_colon
fallback to 0
, and the expression first_at > first_colon
evaluates to false. Everything is parsed as the host.
Example 2
example.com:1337
Here there is a :
, but no @
, so the value of first_colon
is greater than first_at
, resulting in everything being parsed as the host.
Example 3
user:password@example.com
Here we have both a :
and an @
, and since the :
is before the @
, we parse the username and password before parsing the host. Everything is correct!
But having a password is totally optional, we could just have a username… What happens in that case?
Example 4
user@example.com
Since the :
index fallbacks to 0, we are basically in the same example as before, username and password are parsed before the host, and the password parser correcly handles an empty string.
But what about… this?
Example 5
user@example.com:1337
Now there is a :
, and this :
follows the @
character, meaning that first_at > first_colon
evaluates to false. The user parsing is skipped, and everything is parsed as the host. This result in a URL having hostname user@example.com
and port 1337
, with no auth.
Impact
The impact of this is very limited. While we cannot register a TLD containing an @
, we can in theory have a custom DNS server responding for a subdomain containing an @
, provided we can convince a client to resolve it. I was pretty sure I managed to do that with an entry on /etc/hosts
, but I can’t seem to be able to reproduce it anymore, so I might be remembering wrong.
In any case, to trick a check like the one from the challenge, we would need to be in complete control of the name server associated with the subdomain of the whitelisted domain. Pretty hard.
IPv6 parsing
One interesting (and useless) extra point about this parsing mistake, is that technically we can have a hostname that is not a subdomain of the original domain. Looking at the parseHost()
method:
pub fn parseHost(url: *URL, str: string) ?u31 {
var i: u31 = 0;
//if starts with "[" so its IPV6
if (str.len > 0 and str[0] == '[') {
i = 1;
var ipv6_i: ?u31 = null;
var colon_i: ?u31 = null;
while (i < str.len) : (i += 1) {
ipv6_i = if (ipv6_i == null and str[i] == ']') i else ipv6_i;
// ... omitted for brevity ...
}
url.host = str[0..i];
if (ipv6_i) |ipv6| {
//hostname includes "[" and "]"
url.hostname = str[0 .. ipv6 + 1];
}
// ... omitted for brevity ...
} else {
// ... omitted for brevity ...
}
}
If the host starts with a [
character, then we consider it an IPv6 address, and extract the hostname using the closing ]
. This means that the following URL
[test]@example.com:1337
will be parsed as having hostname [test]
and no auth.
This is useless, as we cannot use a real IPv6 address in the hostname, as it contains colons, and therefore would trigger the username and password parsing. In addition to this, the normalization process URL-encodes every square bracket in the username.
It’s still a “fun fact” :).
#
is not used as a host delimiter like /
and ?
When parsing usernames, passwords, and hosts, the /
and ?
characters are used to recognize the end of the host, and the beginning of the path or query:
pub fn parseUsername(url: *URL, str: string) ?u31 {
// ... omitted for brevity ...
for (0..str.len) |i| {
switch (str[i]) {
':', '@' => {
// ... omitted for brevity ...
},
// if we reach a slash or "?", there's no username
'?', '/' => {
return null;
},
else => {},
}
}
return null;
}
pub fn parsePassword(url: *URL, str: string) ?u31 {
// ... omitted for brevity ...
for (0..str.len) |i| {
switch (str[i]) {
'@' => {
// ... omitted for brevity ...
},
// if we reach a slash or "?", there's no password
'?', '/' => {
return null;
},
else => {},
}
}
return null;
}
pub fn parseHost(url: *URL, str: string) ?u31 {
// ... omitted for brevity ...
//if starts with "[" so its IPV6
if (str.len > 0 and str[0] == '[') {
// ... omitted for brevity ...
while (i < str.len) : (i += 1) {
// ... omitted for brevity ...
switch (str[i]) {
// alright, we found the slash or "?"
'?', '/' => {
break;
},
else => {},
}
}
// ... omitted for brevity ...
} else {
// look for the first "/" or "?"
// if we have a slash or "?", anything before that is the host
// anything before the colon is the hostname
// anything after the colon but before the slash is the port
// the origin is the scheme before the slash
var colon_i: ?u31 = null;
while (i < str.len) : (i += 1) {
colon_i = if (colon_i == null and str[i] == ':') i else colon_i;
switch (str[i]) {
// alright, we found the slash or "?"
'?', '/' => {
break;
},
else => {},
}
}
// ... omitted for brevity ...
}
// ... omitted for brevity ...
}
The issue is that the #
character, used to start the fragment part of the URL, should be used as a delimiter as well. With the current implementation, a URL like http://localhost#example.com
is parsed as a unique host localhost#example.com
.
Impact
This has the same impact as the previous issue, with the difficulty of having to convince a DNS client to resolve a URL containing a #
character.
In addition to that, the URL normalization turns http://localhost#example.com
into http://localhost/#example.com
, so the issue is mitigated in practice.
Someone might argue this is not an issue because of the normalization, and there is no need to add this check. The first counter-argument is that, since ?
is used, and it gets the same normalization treatment, #
should be added for consistency.
In addition to that, relying on the normalization is not necessarily a good approach, as it puts a requirement on the caller, and we cannot guarantee that it is performed.
@
can be used as a username-password separator
Maybe you already spotted it in the previous code snippet, but the current implementation of the parser allows the @
character to be used as a separator between the username and the password, as well as between the password and the host. So http://user@password@example.com
would result in a “Basic auth” with username user
and password password
to the host example.com
. This is obviously not a standard behavior.
The reason this works is that there is always an attempt at parsing the password after parsing the username:
if (first_at > first_colon and first_at < (strings.indexOfChar(base[offset..], '/') orelse std.math.maxInt(u32))) {
offset += url.parseUsername(base[offset..]) orelse 0;
offset += url.parsePassword(base[offset..]) orelse 0;
}
This happens even if the user parsing stopped after encountering a @
character, which would indicate the start of the hostname, so a password-less login.
pub fn parseUsername(url: *URL, str: string) ?u31 {
// ... omitted for brevity ...
for (0..str.len) |i| {
switch (str[i]) {
':', '@' => {
// we found a username, everything before this point in the slice is a username
url.username = str[0..i];
return @intCast(i + 1);
},
// if we reach a slash or "?", there's no username
'?', '/' => {
return null;
},
else => {},
}
}
return null;
}
Note that, while every other implementation of fetch()
raises an exception when basic credentials are included, Bun accepts them, but ignores them silently (#17435). This is technically a violation of the spec. So even if this could go through the normalization process (which it doesn’t), it would still be useless.
Extra @
s are pushed to the hostname
Expanding on the previous issue, what happens if we keep adding @
chars? We start pushing stuff into the hostname:
http://user@password@extra@example.com
This URL would be parsed as having username user
, password password
, and hostname extra@example.com
. This has the same implications as the very first issue reported, but in this case the WebKit normalization breaks it completely.
”Debugging” Zig
Debugging and testing all of this is not that straight-forward. Given that the custom URL parser is not exposed, and always† uses WebKit’s normalization, I had to write some Zig code to access it directly and test it.
First things first, we need to understand how to build the code for Bun. The contributing page includes a guide for installing all the dependencies, but I had some issues installing LLVM/Clang 18 on my system, so I created a Dockerfile for an environment ready to build Bun:
FROM ubuntu:22.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y curl wget lsb-release software-properties-common \
cargo ccache cmake git golang libtool ninja-build pkg-config rustc \
ruby-full xz-utils
# CMAKE
WORKDIR /root
RUN apt-get install -y build-essential libssl-dev
RUN wget https://cmake.org/files/v3.29/cmake-3.29.2.tar.gz
RUN tar -xzvf cmake-3.29.2.tar.gz
WORKDIR /root/cmake-3.29.2
RUN bash bootstrap
RUN make -j$(nproc)
RUN make install
WORKDIR /root
RUN curl -fsSL https://bun.sh/install | bash
RUN wget https://apt.llvm.org/llvm.sh -O llvm.sh
RUN bash llvm.sh 16 all
If you want to use this, you can just run it mounting the directory with the Bun source code, and use it as normal with bun run build
.
Hopefully this won’t desync too soon with the correct building steps and dependencies, but in case this doesn’t work, you can always check the contributing page for the updated instructions.
Now, what I wanted to do was to write a simple main.zig
file, using the URL parser, and use it to test and manually fuzz it. The very first issue with this plan, is that I had no idea on what I needed to import, nor how to properly build my custom code.
So, the next idea was to take the already existing main.zig
file, replace the main()
function, and run bun run build
, hoping that in that way I would have had everything I needed.
The issue is that Zig is a very annoying and, just like Go, it won’t let you have unused variables, switch cases must handle every possible value, etc… It’s far from easy when some of these unused variables are allocators, too, passed through a lot of functions, and you don’t know anything about them.
Build Summary: 2/5 steps succeeded; 1 failed
obj transitive failure
├─ zig build-obj bun-debug Debug x86_64-linux-gnu.2.27 1 errors
└─ install generated to bun-zig.o transitive failure
└─ zig build-obj bun-debug Debug x86_64-linux-gnu.2.27 (+2 more reused dependencies)
src/cli.zig:1806:9: error: switch must handle all possibilities
switch (tag) {
^~~~~~
Maybe I did something wrong, but it also felt very aggressive in trying to remove all unused code, failing to include some of the external C implementation. Every piece of Bun code I tried removing, translated in a build error. Maybe this isn’t even a Zig issue, and it comes from the way bun run build
is configured, I have no idea.
ld.lld: error: undefined symbol: BakeProdLoad
>>> referenced by BakeGlobalObject.cpp:112 (/app/bun/build/debug/../../src/bake/BakeGlobalObject.cpp:112)
>>> CMakeFiles/bun-debug.dir/src/bake/BakeGlobalObject.cpp.o:(Bake::bakeModuleLoaderFetch(JSC::JSGlobalObject*, JSC::JSModuleLoader*, JSC::JSValue, JSC::JSValue, JSC::JSValue))
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
ninja: build stopped: subcommand failed.
cmake took 4.97 minutes
Command exited: code 1
Adding another layer of frustration, the build process uses a lot of memory, but when it fills all my RAM and the kernel kills the process, it gives a compilation error that seems like either the code or its dependencies are broken. It took me a while to understand that it was just the OOM killer doing its job, and I didn’t break the source.
Given that all these attempts took quite some time to test, as the compilation process is not really fast on my laptop, in the end I decided to keep every little bit of the Bun code, and hijack an existing CLI command to run my tests. In particular, the bun discord
command just prints a link, so I can add to its implementation the code I wanted to test:
switch (tag) {
.DiscordCommand => {
testURL("http://localhost");
testURL("http://localhost@example.com");
testURL("http://localhost@example.com:1337");
testURL("http://localhost#example.com");
testURL("http://localhost#example.com:1337");
testURL("http://[testinutile]@example.com");
testURL("http://[testinutile]@example.com:1337");
testURL("http://[::ffff:127.0.0.1]@example.com");
testURL("http://[::ffff:127.0.0.1]@example.com:1337");
testURL("http://[::ffff:127.0.0.1]/@example.com");
testURL("http://[::ffff:127.0.0.1]/@example.com:1337");
testURL("http://[::ffff:127.0.0.1]?@example.com");
testURL("http://[::ffff:127.0.0.1]?@example.com:1337");
testURL("http://[::ffff:127.0.0.1]#@example.com");
testURL("http://[::ffff:127.0.0.1]#@example.com:1337");
testURL("http://a@b@c/test");
testURL("http://a:b:c@d/test");
testURL("http://a:b:c@d:e:f@g:1337/test");
testURL("http://a@b@c@d");
return try DiscordCommand.exec(allocator);
},
.HelpCommand => return try HelpCommand.exec(allocator),
.ReservedCommand => return try ReservedCommand.exec(allocator),
pub fn testURL(u: string) void {
const url = bun.URL.parse(u);
std.debug.print("URL: {s}\n", .{u});
std.debug.print("\tHost: {s}\n", .{url.displayHost()});
std.debug.print("\tHostname: {s}\n", .{url.displayHostname()});
std.debug.print("\tUsername: {s}\n", .{url.username});
std.debug.print("\tPassword: {s}\n\n", .{url.password});
}
Yes, I know, putting random prints is not really debugging, but it’s the simplest way I found to see what was going on in the parser.