GFW Technical Review 10 – Trojan and Statistical Fingerprinting
During WWII, radio technology was developed well enough to be widely used for military communication. The issue with radio was that messages get broadcast into the air – not only your intended receiver but everyone within range, including the enemy, can intercept the signal and decode the message. Both Allied and Axis forces developed cryptographic schemes to encrypt their radio messages, attempting to camouflage them as random noise. Nazi Germany famously used the Enigma machine for their military communication. However, it turns out that designing a perfect cryptographic scheme is very difficult – the innate patterns of the messages and characteristics of the cipher can leak information, ultimately leading to Allied Intelligence breaking the German code.
Rather than trying to make radio indecipherable, there is a different approach – planting false intelligence. Right before the invasion of Normandy, the Allied forces carried out a mass deception campaign. They deliberately leaked false information, deployed balloon tanks and landing craft, and carried out deceptive military exercises, misguiding the Nazis into thinking the naval invasion was going to take place at Calais. The campaign was a great success – the Nazis deployed their strongest forces at Calais, a key reason why the Normandy invasion succeeded.
Two approaches: one is to encrypt and conceal, the other is to imitate and disguise. The same principles map to modern censorship circumvention as two strategies: polymorphism and steganography. Polymorphism takes the concealment path – polymorphic proxies make network traffic look like nothing but a stream of completely random bytes, so the GFW cannot classify them. Shadowsocks and VMess took this approach. But similar to how Enigma was broken by Allied Intelligence’s careful analysis and computation, the “look like nothing” proxies unavoidably have recognisable characteristics that the GFW can exploit. In fact, a stream of bytes with high entropy, no recognisable protocol headers, and no cleartext handshake is a signal in itself. By the late 2010s, the GFW had developed increasingly effective techniques for identifying and blocking these fully encrypted protocols – not by breaking their encryption, but by noticing that the traffic looked encrypted and nothing else.
The response was a philosophical shift in circumvention design, taking the disguise path of steganography. Instead of camouflaging your proxy traffic and hoping it dodges some blacklist, what if you imitate – or even directly use – an existing protocol and disguise yourself as a common class of network traffic? That way, you avoid exposing identifiable characteristics to the GFW. If you pick a sufficiently common protocol and disguise well enough, the cost and collateral damage the GFW would incur in blocking it becomes enormous.
This is the idea behind TLS-based evasion technologies: use TLS directly and disguise proxy traffic as the most common class of traffic on the internet.
Trojan
The core idea of Trojan is simple: use TLS to disguise proxied traffic as normal HTTPS.
A Trojan server is, first and foremost, a real TLS server with a valid certificate for a real domain. When any client connects, the TLS handshake proceeds exactly as it would for any HTTPS website – because it is a standard TLS handshake, handled by a standard TLS stack. There is nothing to fingerprint in the handshake because there is nothing custom about it.
The distinction happens after the handshake completes, inside the encrypted tunnel. The server reads the first few bytes of application data. If those bytes contain a valid Trojan authentication token – a 56-byte hex-encoded SHA-224 hash of a pre-shared password – the server treats the connection as a proxy session and forwards traffic to the requested destination.
The Trojan protocol itself is very simple. Since it operates inside the TLS tunnel, there is no need for its own encryption layer. The designer also opted for minimal features, so there is no additional overhead beyond addressing. The “Trojan Request” field simply contains a SOCKS5-style destination header – address type, address, and port – and the payload follows immediately.
Probe Defense
To defend against active probes, a Trojan server must behave identically to a web server when receiving unauthenticated requests. In such a scenario, the server silently forwards the entire connection to a fallback – typically Nginx or another web server hosting a real website. The unauthenticated client sees a normal webpage and has no way to confirm that the server is anything other than an ordinary HTTPS site.
TLS Based Evasion Family
Trojan was the first widely adopted TLS-based tool for censorship circumvention. A wider family of protocols has since adopted similar concepts.
Cloak
Cloak is a TLS-based pluggable transport intended to be used in conjunction with existing protocols like Shadowsocks or OpenVPN, similar to Tor’s idea of pluggable transports. It wraps the existing protocol inside a TLS channel to provide censorship resistance through TLS indistinguishability.
VLESS
VLESS is an evolution of VMess, stripping away the cryptographic portions along with other simplifications like the removal of the time-sync requirement. It lacks an encryption layer of its own, as it is intended to be paired with TLS for cryptographic security. It is also often paired with a reverse proxy like Nginx for a fallback website mechanism. The resulting architecture looks very similar to Trojan.
The Merits and Costs
TLS-based evasion is a significant step forward from Shadowsocks. It offers benefits across multiple dimensions:
-
Obfuscation. The majority of the modern internet runs on TLS. Using TLS for proxy traffic makes protocol-level analysis much more difficult, if not impossible, hindering the GFW’s detection capability. The potential collateral damage of blocking TLS is also far higher.
-
Probe resistance. Since you are presenting yourself as a website, placing a real website behind Trojan makes the server extremely probe-resistant.
-
Security. TLS is a well-studied, battle-tested cryptographic protocol. It is far safer than self-designed protocols such as early Shadowsocks, which had multiple security vulnerabilities.
Like everything in life, these benefits come with costs:
-
Performance. TLS incurs an additional round trip from the handshake, whereas Shadowsocks or VMess have no handshakes at all. This is often not noticeable to the end user, but it does impact performance – especially on long-range, low-quality links, which is the typical setup for proxy servers.
-
Complexity. The Trojan protocol itself is not much more complicated than Shadowsocks, but properly setting up a Trojan server requires buying a domain, obtaining and installing certificates, and configuring a backend web server. This is significantly more involved, and many steps require solid knowledge of networking and web infrastructure.
Statistical Fingerprinting
With the introduction of Trojan and other TLS-based evasion tools, and the deployment of various active probe resistance techniques, it became increasingly challenging for the GFW to identify proxy traffic. The traditional methods of protocol identification – specific protocol-level patterns, behaviours, and entropy analysis – became less effective. Modern GFW began to adopt more sophisticated methods involving statistical analysis and machine learning.
TLS Fingerprinting
As introduced in Post 8, every TLS implementation has specific patterns in the ClientHello and ServerHello packets – from the cipher suites and extensions to various header fields. These patterns can be summarised as the TLS fingerprint of that application. Depending on the cryptographic library used, Trojan and other TLS-based tools also have their own fingerprints that the GFW can use to identify them. If such a fingerprint is uncommon enough to be attributed to a proxy protocol alone, the GFW can confidently block all connections carrying it.
This is particularly troublesome for Go, as most circumvention tools – including Trojan – were written in Go, and Go’s standard TLS library has a distinct fingerprint that stands out.
The countermeasure is mimicking the TLS fingerprint of popular applications, usually modern browsers like Chrome or Firefox. Many Go-based clients adopted uTLS – a fork of Go’s standard TLS library that allows the application to customise the ClientHello for mimicry purposes. When adopting this strategy, it is also important that the fingerprint stays reasonably up to date to avoid using outdated signatures that no real browser would produce.
Certificate
As a legitimate TLS server, the proxy must present a certificate – a feature visible to the GFW during the TLS handshake. Even though the GFW cannot determine the server’s purpose from the certificate alone, it can query the certificate registration record using SNI, which leaks considerable information. There are practically four ways to configure the certificate:
-
No certificate. It is possible to run TLS without a certificate, relying purely on a pre-shared secret – known as TLS-PSK. However, this is very uncommon on the internet and creates a unique pattern that is easily recognisable.
-
Self-signed certificate. Since the proxy server is not serving the general public, you can use a self-signed certificate and add it to the trusted store on the client machine. The problem is that a self-signed certificate cannot be found in any public registry, making it inherently suspicious.
-
Certificate signed by a self-signed CA. A self-signed CA is not in the browser’s default chain of trust and cannot be found in any public registry either. However, this setup is not so uncommon – large organisations and TLS middleboxes tend to use it.
-
Public SSL certificate. Free and paid services exist for certificate registration. In general, this is the most similar to a real website. However, the domain and certificate registration record might still be suspicious. A single, fresh certificate for a rarely visited site, issued by a free CA like Let’s Encrypt, is much more suspicious than a bulk registration from a well-known organisation with a reputable, business-oriented CA.
Timing and Packet Length Patterns
Trojan exposes no protocol-level signals, but there are more nuanced patterns that the GFW can follow. Consider a typical scenario: a user opens a TLS connection to the Trojan server, then immediately opens another TLS connection to the real destination inside the Trojan TLS channel. This TLS-over-TLS pattern is very distinctive to proxies.
To identify TLS-over-TLS, the GFW can examine the first few packets after the outer TLS handshake completes. These are handshake packets for the second, inner TLS layer – they have distinct packet lengths compared to typical TLS application data, and the round-trip time of these packets is slightly longer than the initial handshake since they are being routed onward to the real destination.
Modern TLS evasion tools mitigate these patterns by introducing random padding to the payload – especially the first few packets – and adding random timing jitter to the responses.
Behavioral and Contextual Signals
A more subtle trace of proxy usage comes from behavioural patterns. A Trojan user’s network profile is a long-lived, single TLS session targeting a specific foreign IP and port. This is very different from typical internet traffic, where the client opens multiple, relatively short-lived TLS sessions to various destinations. Such analysis can even extend beyond a single day – by cooperating with ISPs, the GFW has the capability to track user behaviour across days and months to produce accurate behavioural profiles.
There is also the destination IP itself. Proxy servers are usually deployed on VPS providers, which assign IP addresses from their allocated ranges – ranges that can be looked up in the IANA registry. These IP ranges are flagged as higher risk because they are rarely used by popular websites or services, but are frequently used by proxy service providers or individuals deploying their own proxies. Conversely, IP ranges belonging to popular websites and well-known network operators are lower risk – they do not offer self-hosting and blocking them incurs far greater collateral damage.
Active Probing
We described probing Shadowsocks servers in Post 7. Trojan is in principle much more difficult to probe, as it is a standard TLS server – especially when a fallback web server is configured. Nonetheless, for any request the Trojan server has to go through its authentication process, which is distinctive. Research has shown that by closely studying the source code of Trojan implementations, it is possible to send well-crafted, intentionally malformed packets that trigger specific timing behaviours in the authentication logic, distinguishing the server from a standard web server.
A Statistical Approach
None of the above fingerprinting methods can give a reliable classification result on its own, but they become very powerful when combined. Recent leaked documents indicate that the GFW has multiple independent modules – each with different capabilities such as entropy analysis, protocol-level analysis, TLS fingerprinting, domain/SNI tagging, and behavioural study – operating either in real time or near-real time from a large pool of logs and metrics. Each module produces a score, which are then combined through lightweight ML models (or even simple linear regression) to classify connections into categories that lead to allowing, blocking, or flagging for further review and probing.
The Arms Race Continues
Adoption of TLS is an important milestone in the development of censorship circumvention technologies. Trojan’s developers described it as an “unidentifiable mechanism that helps you bypass GFW.” Today we know it is not bulletproof. Starting October 2022, the GFW began blocking various TLS-based circumvention protocols – including Trojan, VLESS, and V2Ray over TLS – on a large scale, deploying the sophisticated multi-module statistical fingerprinting mechanism described above. To tackle this evolution, new TLS-based protocols and tools like NaïveProxy, XTLS, and REALITY emerged – each attempting to close the remaining fingerprinting gaps that statistical analysis exploits.
References
- M. C. Tschantz, S. Afroz, Anonymous, and V. Paxson, “SoK: Towards Grounding Censorship Circumvention in Empiricism,” 2016 IEEE Symposium on Security and Privacy (SP). https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7546542
- Mingshi Wu, Jackson Sippe, Danesh Sivakumar, Jack Burg, Peter Anderson, Xiaokang Wang, Kevin Bock, Amir Houmansadr, Dave Levin, and Eric Wustrow. How the Great Firewall of China Detects and Blocks Fully Encrypted Traffic. In 32nd USENIX Security Symposium (USENIX Security 23). https://www.usenix.org/conference/usenixsecurity23/presentation/wu-mingshi
- trojan-gfw. The Trojan Protocol. https://trojan-gfw.github.io/trojan/protocol.html
- trojan-gfw. trojan. https://github.com/trojan-gfw/trojan
- cbeuw. Cloak. https://github.com/cbeuw/Cloak
- klzgrad. trojan issue #14, Design discussion. https://github.com/trojan-gfw/trojan/issues/14
- Liuying Lv, Peng Zhou. TrojanProbe: Fingerprinting Trojan tunnel implementations by actively probing crafted HTTP requests. Computers & Security. Volume 148. 2025. https://doi.org/10.1016/j.cose.2024.104147
- refraction-networking. utls. https://github.com/refraction-networking/utls
- Hannes Tschofenig and Pasi Eronen. Pre-Shared Key Ciphersuites for Transport Layer Security (TLS). RFC 4279. https://datatracker.ietf.org/doc/html/rfc4279
- Inside the Great Firewall Part 2: Technical Infrastructure. DomainTools Investigations. https://dti.domaintools.com/research/inside-the-great-firewall-part-2-technical-infrastructure
- Large scale blocking of TLS-based censorship circumvention tools in China. gfw-report. https://github.com/net4people/bbs/issues/129
- Wikipedia. Operation Bodyguard. https://en.wikipedia.org/wiki/Operation_Bodyguard
Enjoy Reading This Article?
Here are some more articles you might like to read next: