Cloning a Git Repository From a Web Browser Using fetch()

Posted by

June 21, 2024

For the last year, I couldn’t find a fast, low-transfer way of downloading Git files in a web browser. git is a server-side command and GitHub REST API, even if it was fast, doesn’t extend beyond GitHub, Yesterday it finally clicked – I needed a Git client built in JavaScript!

This article explains how that client works and discusses specific Git requests and data structures. If you continue reading, you’re in for a treat. If you’re impatient and just want the code, go straight to the code on GitHub. The demo app allows you to fetch specific paths from a Git repository:

Sourcing files from Git is tricky

A large part of Wor d Press Playground‘s usefulness is in moving data in and out of it. Load a WordPress plugin into Playground, and you can provide a live preview in the WordPress plugin directory. Load a plugin or a theme from a Git repository, and you can develop it or review a Pull Request.

Playground is an in-browser tool, but to run the usual git clone, you need server. Fantastic Christoph Khouri created https://github-proxy.com/ to run a server-side git clone and serve the results to the browser. It works great and been useful to me and many others.

There’s just one problem: It runs on the server.

Servers tend to require regular care. The more your project grows, the more money you spend on bandwidth, storage, and server units you need. I was worried about scaling the proxy service because Playground is free and open source – it won’t generate any income to spend on scaling a server farm.

What about downloading files directly from the browser using GitHub APIs? After all, you can download any repository as a zip file and the REST API provides fine-grained access to repositories and files. Well, there are three problems with that:

Git is larger than GitHub. GitHub APIs won’t help with sourcing content from, say, GitLab.
The zip file is often much larger than the specific 15 files you want to load. The Gutenberg repository is 20MB+, but you may only want to edit 100KB worth of documentation files in Playground.
The REST API is so sadly slow. It doesn’t have and endpoint for “download this subdirectory as a zip file”. You have to list files in the repo, then in a specific directory, and then request them one by one. As you do that, each fetch() sends an OPTIONS request and a GET request.

Yesterday, as Christoph, Brandon, and I were discussing scaling the server-side machinery for github-proxy.com, something clicked for me. We’re running an entire WordPress in the browser – couldn’t we just run a Git client?

Running a Git Client in the browser

The good news was isomorphic-git, wasm-git, and a few other projects were already running Git in the browser. The bad news was none of them supported fetching a subset of files via sparse checkout. You’d still have to download 20MB of data even if you only wanted 100KB.

However, Everything the desktop Git client does, including sparse checkouts, can be done via HTTP by requesting URLs like https://github.com/WordPress/wordpress-playground.git.

Git documentation was… less than helpful, but eventually it worked! A few hours later I was running Git commands by sending GET and POST requests to the repository-URLs.

Fetching a hash of the branch

The first command I needed was ls-refs to get the SHA1 hash of the right git branch. Here’s how you can get it with fetch() for the HEAD branch of the WordPress/wordpress-playground repo:

const response = await fetch(
  'https://github.com/WordPress/gutenberg.git/git-upload-pack',
  {
    method: 'POST',
    headers: {
        'Accept': 'application/x-git-upload-pack-advertisement',
        'content-type': 'application/x-git-upload-pack-request',
        'Git-Protocol': 'version=2'
    },
    body: [
        `0014command=ls-refs\n`,
      // ^^^^ line length in hex
        `0015agent=git/2.37.3\n`,
        `0017object-format=sha1\n`,
        '0001',
      // ^^^^ command separator
        // Filter the results to only contain the HEAD branch,
        // otherwise it will return all the branches and
        // tags which may require downloading many 
        // megabytes of data:
        `0009peel\n`,
        `0014ref-prefix HEAD\n`,
        '0000',
      // ^^^^ end of request
    ].join(""),
  }
);

I won’t go into details of the Git protocol – the point is with a few special headers and lines you can be a Git client. If you paste that fetch() in your devtools while on GitHub.com, it would return a response similar to this:

0032950f5c8239b6e78e9051ec5e845bac5aa863c4cb HEAD
0000

Good! That’s our commit hash.

Fetching a list of objects at a specific commit

With this, we can fetch the list of objects in that branch:

fetch("https://github.com/wordpress/gutenberg/git-upload-pack", {
  "headers": {
    "accept": "application/x-git-upload-pack-advertisement",
    "content-type": "application/x-git-upload-pack-request",
  },
  "referrer": "http://localhost:8000/",
  "referrerPolicy": "strict-origin-when-cross-origin",
  "body": [
      `0088want 950f5c8239b6e78e9051ec5e845bac5aa863c4cb multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3 filter \n`,
      `0015filter blob:none\n`,
      // ^ sparse checkout secret says.
      // only fetches a list of objects without
      // their content
      `0035shallow 950f5c8239b6e78e9051ec5e845bac5aa863c4cb\n`,
      `000ddeepen 1\n`,
      `0000`,
      `0009done\n`,
      `0009done\n`,
  ].join(""),
  "method": "POST"
});

And here’s the response:

00000008NAK
0026Enumerating objects: 2189, done.
0025Counting objects:   0% (1/2189)
...
0032Compressing objects: 100% (1568/1568), done.
2004PACK(binary data)
0040 Total 2189 (delta 1), reused 1550 (delta 0), pack-reused 0
0006�0000

The binary data after PACK is a compressed list of all objects the repository had at commit 950f5c8239b6e78e9051ec5e845bac5aa863c4cb. It is not a list of files that were committed in 950f5c. It’s all files.

The pack format is a binary blob. It’s similar to ZIP in that it encodes of a series of objects encoded as a binary header followed by binary data. Here’s an approximate visual to help grok the idea:

PACK format – inaccurate explanation,
Pack consists of the string "PACK" and binary data structured roughly as follows:

 ___________________________________
|                                   |
|        ASCII string "PACK"        |
|        Binary data starts         |
|           Pack Header             |
|___________________________________|
|                                   |
|        Offset 0x0010              |
|          Object 1 Header          |  (Object type, hash,
|                                   |   data length, etc.)
|        ________________           |
|       |                |          |
|       |  Object 1 Data |          |  (Gzipped data)
|       |________________|          |
|                                   |
|        Offset 0x0050              |
|          Object 2 Header          |  
|                                   | 
|        ________________           |
|       |                |          |
|       |  Object 2 Data |          |  (Gzipped data)
|       |________________|          |
|___________________________________|
|                                   |
|           Pack Footer             |
|         Binary data ends          |
|___________________________________|

The decoding is tedious so I used the decoder provided by isomorphic Git package:

const iterator = streamToIterator(await response.body);
const parsed = await parseUploadPackResponse(iterator);
const packfile = Buffer.from(await collect(parsed.packfile));

const index = await GitPackIndex.fromPack({
    pack: packfile
});

The parsed index object provides information about all the objects encoded in the received packfile. Let’s peek inside:

{
  // ...
  "hashes": [
    "5f4f0a5367476fdb7c98ffa5fa35300ec4c3f48b",
    "950f5c8239b6e78e9051ec5e845bac5aa863c4cb",
    // ...
  ],
  "offsets": {
    "5f4f0a5367476fdb7c98ffa5fa35300ec4c3f48b": 12,
    "950f5c8239b6e78e9051ec5e845bac5aa863c4cb": 181,
    // ...
  },
  "offsetCache": {
    "12": {
      "type": "tree",
      "object": "100644 async-http-download.php\u0000��p4��\u0014�g\u0015i��\u0004��\\���100644 async-http.php\u0000�\n�8K�RT������F\u001b8�� (more binary data)"
    },
    // ...
  },
  "readDepth": 4,
  "externalReadDepth": 0
}

Each object has a type and some data. The decoder stored some objects in the offsetCache, and kept track of others in form of a hash => offset in packfile mapping.

Let’s read the details of the commit from our parsed index:

> const commit = await index.read({
    oid: '950f5c8239b6e78e9051ec5e845bac5aa863c4cb'
  });

{
  "type": "commit",
  "object": "tree c7b8440c83b8c987895f9a1949650eb60bccd2ec\nparent b6132f2d381865353e09edf88aa64a0dd042811a\nauthor Adam Zieliński <adam@adamziel.com> 1717689108 +0200\ncommitter Adam Zieliński <adam@adamziel.com> 1717689108 +0200\n\nUpdate rebuild workflow\n"
}

It’s the object type, the hash, and the uncompressed object bytes which, in this case, provide us commit details in a specific microformat. From here, we can get the tree hash and look for its details in the same index we’ve already downloaded:

> const tree = await index.read({ oid: "c7b8440c83b8c987895f9a1949650eb60bccd2ec" })

{
  "type": "tree",
  "object": "40000 .github\u0000_O\nSgGo�|����50\u000e���40000 (... binary data ...)"
}

The contents of the tree object is a list of files in the repository. Just like with commit, tree details are encoded in their own microformat. Luckily, isomorphic-git ships relevant decoders:

> GitTree.from(result.object).entries()
[
  {
    "mode": "040000",
    "path": ".github",
    "oid": "ece277ec006eb517d5c5399d7a5c00b7e61018f1",
    "type": "blob"
  },
  {
    "mode": "100644",
    "path": "readme.txt",
    "oid": "3fe6e3aaf1dc4df204be575041383fc8e2e1e070",
    "type": "blob"
  },
  {
    "mode": "040000",
    "path": "src",
    "oid": "dbc84f20ee64fbd924617b41ee0e66128c9a8d97",
    "type": "tree"
  },
  // ...
]

Yay! That’s the list of files and directories in the repository root with there hashes! From here we can recursively retrieve the ones relevant for our sparse checkout.

Fetching full files from specific paths

We’re finally ready to checkout a few particular paths. Let’s ask for a blob at readme.txt and a tree at docs/tools:

const response = fetch("https://github.com/wordpress/gutenberg/git-upload-pack", {
  "headers": {
    "accept": "application/x-git-upload-pack-advertisement",
    "content-type": "application/x-git-upload-pack-request",
  },
  "body": [
      `0081want 28facb763312f40c9ab3251fb91edb87c8476cf9 multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3\n`,
      `0081want 3fe6e3aaf1dc4df204be575041383fc8e2e1e070 multi_ack_detailed no-done side-band-64k thin-pack ofs-delta agent=git/2.37.3\n`,
      `00000009done`
  ].join(""),
  "method": "POST"
});

The response is another index, but this time each blob comes with binary contents. Some decoding and recursive processing later, we finally get this:

{
    "readme.txt": "=== Gutenberg ===\nContri (...)",
    "docs/tool": {
        "index.js": "/**\n * External depe (...)",
        "manifest.js": "/* eslint no-console (...)"
    }
}

Yay! It took some effort, but it was worth it!

Cors proxy and other notes

You’ll still need to run a CORS proxy. The fetch() examples above will work if you try them in devtools on github.com, but you won’t be able to just use them on your site. Git API typically does not expose the Access-Control-* headers required by the browser to run these requests.

So we need a server after all. Was this a failure, then? No! A CORS proxy is cheaper, simpler, and safer to maintain than a Git service. Also, it can fetch all the files in 3 fetch() requests instead of two requests per file like the GitHub REST API requires.

Try it yourself

I’ve shared a functional demo that includes a CORS proxy in this repository on GitHub: https://github.com/adamziel/git-sparse-checkout-in-js

Aftermath and ending thoughs

WordPress Playground can now get a support for nearly native sparse git checkout, and eventually also for more git commands like commit, rebase, or push.

This implementation:

Is much faster than the existing “import from GitHub” Playground feature based on GitHub REST API.
Works with GitLab and other Git providers.
Can optimize even further, e.g. by stream-processing the objects as they are downloaded.

I wonder how hard would it be to extend it to SVN and other data sources. It would take tunneling the svn+ssh protocol over a CORS proxy which might be complex, but perhaps it would be worth it?

I’ve also explored a rough prototype in PHP. Perhaps one day WordPress could support installing and updating plugins from Git?

Also, here’s three random things I’ve found helpful along the way:

Git clone in Haskell
Git HTTP protocol
Debug packets: GIT_TRACE_PACKET=1 git fetch origin trunk

I hope you’ve enjoyed learning about Git protocol as much as I did. Ping me if you build something fancy on top of this. Happy hacking!

Adam's Perspective