Automating the censorship arms race

Magazine: Features
Automating the censorship arms race

Evading oppressive internet censorship is possible, but discovering how is difficult and time-consuming for humans. Geneva is a genetic algorithm that automatically discovers and implements censorship circumvention strategies---many of which were long thought impossible.

Automating the censorship arms race

By Kevin Bock, Dave Levin, December 2020

Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition

Tags: Genetic algorithms, Technology and censorship

Authoritarian nation states censor online communication in an effort to suppress political protests, limit citizens' access to information, and threaten journalists. In blocking the ability to communicate openly and freely, censors stifle change. We developed a new tool, Geneva, that automatically discovers new ways to evade censorship faster than censors can react and patch their systems.

Geneva is far from the first censorship evasion tool. For decades, censors have played a cat-and-mouse game against the activists and researchers who seek to evade them, resulting in an impressive array of protocols, proxies, and patches. Unfortunately, this game has historically favored the censor. The traditional approach to creating a new evasion strategy is to first measure and understand censors' secretive infrastructure, and apply human intuition and ingenuity to hopefully arrive at a clever insight into how to circumvent it. Manual discovery processes can take months or years, and when a censor changes its infrastructure, it can force researchers back to square one, ultimately leading to long lulls in evasion success.

Geneva inverts the discovery process. First, Geneva automatically trains against live censors to discover how to circumvent them. Only then do researchers inspect its findings to glean new insights into how the censors operate. Geneva is fast—typically taking only 1-2 hours to defeat new forms of censorship it has never seen before. Perhaps most importantly, Geneva is free from the biases of human researchers; it has discovered circumvention strategies that prior work (and even we) thought simply "should not" work. Chief among them is an entire new class of censorship evasion strategies, "server-side evasion," by which clients can evade censorship without having to install any evasion software whatsoever (not even Geneva).

In this article, we will describe in more detail how censors work, how Geneva automates the evasion process, and what this means for the future of the anti-censorship arms race.

How Censors Censor

There are many forms of censorship including political pressure, outright blocking of certain protocols, or simply taking large swaths of the internet offline. However, the most pervasive form of censorship takes place inside the network itself. China, Iran, India, Pakistan, and others deploy digital infrastructure throughout their networks that inspects the contents of the network connections flowing into and out of their countries. When they detect specific forbidden keywords or domain names, they automatically tear down or block the network connection.

Censoring in this way is extremely widespread and common, but it is not without its own set of challenges. Effectively tracking and inspecting every connection that crosses their borders requires a large amount of data processing—particularly for a country as large as China. As a result, censors tend to take shortcuts, such as limiting how much state they keep about each connection or how long they keep it. Fundamentally, censors must contend with the "eavesdropper's dilemma," which states that a box in the middle of a connection between two communicating end-hosts cannot perfectly determine whether it observes the same things as the end-hosts (packets can be dropped, altered, or processed differently at the ends). Taking advantage of these shortcomings is the foundation of much work on censorship evasion.

Evasion by Changing Packets

A wide range of popular tools have been developed to circumvent censorship such as Tor, VPNs, various protocol obfuscation tools, and more. All of these require deployment at the client as well as proxies outside the censoring regime. A different approach—evasion through packet manipulation—requires deployment only at the client. The key idea is that the client can introduce new or altered packets into an existing connection to confuse the censor, but without confusing the server.

The canonical example of a packet-manipulation evasion strategy involves a simple HTTP web connection, and works as follows. Figure 1a shows the normal flow of a TCP/IP connection establishment; the upper blocks show the connection establishment, and the lower represent the connection data itself. Upon detecting a censored term, the censor (see Figure 1b) injects a RST packet both to client and server, effectively terminating the connection. When using the canonical evasion strategy, after establishing a TCP connection but before sending the forbidden request, the client inserts a teardown (RST) packet that instructs the server that it wishes to terminate the connection (see Figure 2a). However, the client ensures the teardown packet does not actually reach the server: It sets the time-to-live (TTL) field in the packet large enough to reach the censor, but small enough that it forces routers to drop it before it reaches the server. As a result, the censor believes the connection has been terminated, and thus—as one of its resource-saving shortcuts—stops paying attention to future packets on the connection. The server, on the other hand, never received the RST, and thus the connection stays alive. The client can then freely send its forbidden request; the censor ignores it, and the server fulfills it. This strategy works to this day against the so-called "Great Firewall" (GFW) of China.

Geneva is a genetic algorithm—a biologically inspired learning system—built to discover new censorship evasion strategies.

To date packet-manipulation strategies, like the TTL-limited RST, have been the result of impressive human ingenuity. Researchers hypothesize about how censors operate, measure them, and apply their insights to develop, implement, and test evasion strategies. Unfortunately, this process can take months or years, because the internal details of censors' infrastructure are not made public. Worse yet, when researchers publish their evasion strategies, censors can patch their system, forcing researchers back to square one.

A Path Forward: Automating Evasion

The goal of our work is to invert this dynamic. We have developed a genetic algorithm called "Geneva" (for Gen-etic eva-sion) that trains directly against censors and automates the discovery of evasion strategies, without having to know anything ahead of time about how the censor works. Not only has Geneva re-derived essentially all previously published packet-manipulation-based evasion strategies in an afternoon, but it has discovered more than 60 evasion strategies in countries around the world.

Censors regularly update their infrastructure, complicating prior evasion work that operated at human-speed; Geneva is an important departure because it operates at machine-speed. Even as censoring regimes in China, Iran, and Kazakhstan have deployed new, never-before-studied censorship systems, Geneva has discovered ways to evade them within hours of being deployed.

How Geneva Evades Censorship

Geneva is a genetic algorithm—a biologically inspired learning system—built to discover new censorship evasion strategies. Genetic algorithms learn through the process of evolution over a series of generations. Within each generation, a population of individuals is evaluated to determine which are fittest and should survive to propagate their genetic material to the next generation. Over time, the genetic algorithm should evolve increasingly fit individuals, eventually arriving at effective solutions to the problem it seeks to solve.

Geneva evolves strategies that evade censorship by manipulating the stream of packets that are exchanged during a normal client/server interaction.

Through this process, Geneva evolves strategies that evade censorship by manipulating the stream of packets that are exchanged during a normal client/server interaction, such that the censor cannot properly interfere with the connection. Geneva's modifications are one-sided: It can either run on the client-side (and modify packets leaving the client), or run on the server-side (and modify packets leaving the server). This allows us to deploy Geneva's strategies just on one side of the connection and evade censorship without requiring deployment on both sides.

Applying learning algorithms to the problem of censorship evasion introduces several challenges. The first major challenge is to identify good building blocks for the algorithm. To manipulate packets, what set of operations should an algorithm be allowed to perform? On one extreme, we could allow it to add, remove, or flip bits in packets. Such a scheme could eventually learn any strategy, but with so many degrees of freedom, it may take an inordinate amount of time to do so. On the other extreme, we could encode known strategies that humans identified in the past; this would be highly efficient, but would limit its ability to discover new strategies. Our first key insight to enable this work was a set of building blocks to balance these two extremes. Specifically, we give Geneva five simple packet-level primitives it can use to modify a given packet:

duplicate—duplicate the packet
fragment—split the packet
tamper—change a header field in the packet
drop—drop the packet
send—send the packet

These five actions can be composed together in a binary tree structure to form more complex strategies (duplicate and fragment have two children; tamper has one; drop/send are leaves). With this, we can implement a censorship evasion strategy as a tree of packet modifications that describe how a sequence of packets should be modified (see Figure 2b).

The second key challenge is defining an effective fitness function. Fitness functions are the algorithmic instantiation of "survival of the fittest." They determine which individuals (in this case, censorship evasion strategies) in the population should survive to the next generation, and guide the algorithm to find ones that work. At a high level, Geneva's fitness function punishes strategies for breaking TCP connections and rewards them for successfully defeating censorship. We also reward conciseness in the action trees; this does not improve the overall success of the evasion strategies, but it allows us humans to better understand what the strategies are doing, why they work, and what they tell us about how censors operate.

Geneva does not require any seed knowledge about prior strategies or how censors operate. Instead, we start with an initial population of action trees generated purely at random. Geneva then runs each one with forbidden requests against a real censor, evaluates their fitness, and mutates and mates those that survive to the next generation. Over a series of generations, Geneva refines the individuals that show some promise into effective, concise evasion strategies.

We deployed Geneva client-side inside real-world censoring nation states, and Geneva successfully found 36 novel censorship evasion strategies across four countries: China, India, Iran, and Kazakhstan. This, in turn, will enable activists, users, and researchers to communicate without interference from censors.

Geneva rapidly re-discovered common tacts such as the TTL-limited RST described earlier. Soon, it was finding strategies that many thought would be impossible against the GFW of China. One of these novel strategies manipulated HTTP requests containing a forbidden keyword by segmenting it into three packets. For instance, it would turn a single packet "GET/?search=ultrasurf" into three packets: "GET/?se", "arch", and "=ultrasurf". (Ultrasurf, the name of a censorship circumvention system, is a commonly censored keyword.) This was surprising because the GFW recombines segmented packets—and moreover, the censored keyword still appears intact in a packet. We hypothesize that this works because of a bug in the GFW.

There are many forms of censorship including political pressure, outright blocking of certain protocols, or simply taking large swaths of the internet offline.

This demonstrates the two very broad classes of strategies that Geneva finds: gaps in the censors' logic and bugs in their implementation.

Censorship Evasion From the Server-Side

Prior to Geneva, all censorship evasion techniques have required some degree of participation from the client, and this seems natural: to evade censorship, shouldn't the client have to do something?

The prospect of server-side evasion—whereby the server outside the censoring regime evades censorship without any extra client-side software—is a sort of holy grail. It would enable servers to be reachable by all users within a censoring regime, including users who lack the technical knowledge to use evasion tools and users who did not realize they were being censored in the first place.

In a sense, server-side evasion "shouldn't" work. Before censorship takes place, a server typically sends very few packets (as little as just a single SYN/ACK packet while establishing a new connection) before the client's forbidden request triggers censorship. Because of the server's limited involvement, it has long been thought that purely server-side censorship evasion was not possible.

Fortunately, Geneva did not know that it "shouldn't" work. We altered Geneva to be able to run from the server-side, and trained it against four countries (China, India, Iran, and Kazakhstan) across five protocols (DNS-over-TCP, FTP, HTTP, HTTPS, and SMTP). To date, it has found 16 server-side strategies, all of which work with completely unmodified clients.

An example of a server-side strategy works as follows (see Figure 3b). In Kazakhstan, instead of sending a standard SYN/ACK packet without a payload, a web server can send two SYN/ACK packets, both containing an HTTP GET request for a non-forbidden keyword. The client ignores these payloads and continues with its request, but the censor appears to get confused as to who in the connection is the server and who is the client. After observing two benign (but actually inconsequential) requests, the censor assumes the connection is legitimate and stops paying attention, allowing the client to issue its forbidden request unhindered. We verified strategies like this can work for all unmodified clients, including various versions of Windows, MacOS, and Linux.

In addition to strategies like this that confuse the censor into believing the client is the server, many of Geneva's server-side strategies take advantage of esoteric features of the TCP three-way handshake. These features are poorly implemented in middle-boxes, but accurately implemented in client operating systems.

Advantage: Geneva

For both client- and server-side strategies, Geneva trains very quickly. Over the past year and a half, we have observed brand new forms of censorship being deployed in Kazakhstan, India, Iran, and China; and for each, Geneva was able to discover evasion strategies within 1-2 hours.

One example of this occurred February of this year. While performing experiments in Iran, we noticed some strategies that used to work no longer did. Our first thought was Iran had changed their core censorship infrastructure, perhaps patching some of the bugs that Geneva had identified. After further analysis, however, we discovered Iran had deployed a brand new form of censorship in combination with their pre-existing censorship infrastructure. In particular, they deployed a "protocol filter": a system that uses fingerprints to identify which application protocol each TCP connection is using, and blocks all outbound traffic for any connection that does not match predefined fingerprints. If a connection does match a fingerprint, then it subjects the packets to its standard censorship system. This system allows Iran to crack down on any application they can't already censor.

Geneva is free from the biases of human researchers; it has discovered circumvention strategies that prior work (and even we) thought simply "should not" work.

By applying Geneva, we discovered four ways to circumvent the protocol filter and its standard censorship simultaneously. These strategies are highly unique, and we can learn more about how the protocol filter works by studying them. For example, one strategy works by sending nine acknowledgement packets during the TCP three-way handshake (this is highly unusual, but allowed); from this (and follow-up experiments), we can infer the protocol filter has a relatively low maximum number of packets it can process for each connection. This information can be used by tool developers to make more resilient tools and by activists to try to identify the original manufacturer of the censorship infrastructure.

Because Geneva operates at machine-speeds, it allows us to rapidly respond to new forms of censorship by discovering evasion strategies.

The Next Phase of the Arms Race

Geneva shows it is possible to automate the discovery of evasion strategies, putting evaders at a significant advantage over censors—for now. A logical next step for the censors to take is to apply automated techniques of their own. For instance, they could use Geneva-like tools today to start to identify issues in their own infrastructure.

Fortunately, we do not think this would put them at a significant advantage. Broadly speaking, Geneva discovers two kinds of evasion strategies: bugs in the censors' implementation and gaps in the censors' logic. Bugs could potentially be patched with relative ease (assuming they are not in their hardware vendors' proprietary code), but logic gaps are far more difficult to fix. This is because all middleboxes suffer from the "eavesdropper's dilemma." They must operate as if they see the same packets that the two communicating end-hosts see. But this is not always true in general (due to packet drops, errors, etc.), and Geneva holds the upper hand by leveraging those mistakes.

One of the benefits of Geneva is that it is willing to try anomalous packet sequences that are so strange humans might not think to try them. However, this is also one of its potential downsides: many of the strange packets it creates would also make it easier for censors to fingerprint or detect Geneva while training. Geneva could account for this possibility in the fitness function by rewarding packet sequences that conform to "normal" looking packets.

We cannot be certain how the cat-and-mouse game between censors and evaders will change in the future, but we expect that automated agents like Geneva will play a critical role.

To Learn More

For more information about Geneva, including its open-source code and regular posts about new findings, visit https://censorship.ai/.

Authors

Kevin Bock is a Ph.D. candidate at the University of Maryland, studying computer science and network security. Kevin leads the Geneva project, and teaches a penetration testing class at UMD.

Dave Levin is an assistant professor in the Computer Science Department at the University of Maryland. He received an NSF CAREER award and an Undergraduate Research Mentoring award from the National Center for Women & Information Technology (NCWIT). Dave founded the Breakerspace lab at UMD, where he advises over two dozen undergraduate students in their research—many of whom work on Geneva.

Figures

Figure 1. Two network connections between a client and server, with a censor observing their traffic. (a) A normal connection, which begins with a three-way handshake, followed by the client requesting data and receiving a response. (b) The client requests a forbidden keyword, triggering the censor to terminate the connection at both ends.

Figure 2. An example of a client-side evasion strategy found by Geneva. [a] The client sends a RST packet with a short TTL that reaches the censor but not the server, causing the censor to ignore future packets from the client. (b) Geneva's tree representation of this strategy.

Figure 3. Two server-side strategies Geneva uses to confuse the censor. (a) The server sends a RST packet during the handshake to confuse the Great Firewall of China. (b) The server interferes with Kazakhstan's censor by sending two innocuous HTTP GET requests during the handshake; these payloads are ignored by the client, but they are processed by the censor.

Crossroads The ACM Magazine for Students

Magazine: Features Automating the censorship arms race

Automating the censorship arms race

Magazine: Features
Automating the censorship arms race