GFW Technical Review 11 – Statistical Fingerprinting

TLS-based evasion raises the bar sharply. Protocols like Trojan hide inside genuine TLS sessions, and the GFW can no longer find a clean protocol-level signal to blacklist. That argument holds at the protocol level. But the GFW does not have to stay there.

The GFW’s answer has been statistical fingerprinting: profile the context around the connection rather than the connection itself. TLS implementation quirks, certificate and SNI registration patterns, packet timing and length, long-term behavior. No single signal is conclusive. Combined through lightweight models, they are accurate enough to flag traffic that looks, byte for byte, like ordinary HTTPS.

TLS Fingerprinting

As introduced in Post 8, every TLS implementation has specific patterns in the ClientHello and ServerHello packets, from the cipher suites and extensions to various header fields. These patterns can be summarized as the TLS fingerprint of that application. Depending on the cryptographic library used, Trojan and other TLS-based tools also have their own fingerprints that the GFW can use to identify them. If such a fingerprint is uncommon enough to be attributed to a proxy protocol alone, the GFW can confidently block all connections carrying it.

This is particularly troublesome for Go, as most circumvention tools (including Trojan) were written in Go, and Go’s standard TLS library has a distinct fingerprint that stands out.

The countermeasure is mimicking the TLS fingerprint of popular applications, usually modern browsers like Chrome or Firefox. Many Go-based clients adopted uTLS, a fork of Go’s standard TLS library that allows the application to customize the ClientHello for mimicry purposes. The fingerprint also needs to stay reasonably up to date, since outdated signatures no real browser would produce become a fingerprint of their own.

Certificate

As a legitimate TLS server, the proxy must present a certificate, a feature visible to the GFW during the TLS handshake. Even though the GFW cannot determine the server’s purpose from the certificate alone, it can query the certificate registration record using SNI, which leaks considerable information. There are practically four ways to configure the certificate:

No certificate. It is possible to run TLS without a certificate, relying purely on a pre-shared secret known as TLS-PSK. However, this is very uncommon on the internet and creates a unique pattern that is easily recognizable.
Self-signed certificate. Since the proxy server is not serving the general public, you can use a self-signed certificate and add it to the trusted store on the client machine. The problem is that a self-signed certificate cannot be found in any public registry, making it inherently suspicious.
Certificate signed by a self-signed CA. A self-signed CA is not in the browser’s default chain of trust and cannot be found in any public registry either. However, this setup is not so uncommon: large organizations and TLS middleboxes tend to use it.
Public SSL certificate. Free and paid services exist for certificate registration. In general, this is the most similar to a real website. However, the domain and certificate registration record might still be suspicious. A single, fresh certificate for a rarely visited site, issued by a free CA like Let’s Encrypt, is much more suspicious than a bulk registration from a well-known organization with a reputable, business-oriented CA.

Timing and Packet Length Patterns

Trojan exposes no protocol-level signals, but there are more nuanced patterns that the GFW can follow. A user typically opens a TLS connection to the Trojan server, then immediately opens another TLS connection to the real destination inside that tunnel. The TLS-over-TLS pattern is very distinctive to proxies.

To identify TLS-over-TLS, the GFW can examine the first few packets after the outer TLS handshake completes. These are handshake packets for the second, inner TLS layer: they have distinct lengths compared to typical TLS application data, and their round-trip time is slightly longer than the initial handshake, since the server has to route them onward to the real destination.

Modern TLS evasion tools mitigate these patterns by padding the payload (especially the first few packets) with random bytes, and adding random timing jitter to the responses.

Behavioral and Contextual Signals

A more subtle trace of proxy usage comes from behavioral patterns. A Trojan user’s network profile is a long-lived, single TLS session targeting a specific foreign IP and port. This is very different from typical internet traffic, where the client opens multiple, relatively short-lived TLS sessions to various destinations. Such analysis can even extend beyond a single day: by cooperating with ISPs, the GFW can track user behavior across days and months to build accurate behavioral profiles.

There is also the destination IP itself. Proxy servers are usually deployed on VPS providers, which assign IP addresses from their allocated ranges, all listed in the IANA registry. IP ranges from popular VPS providers like Vultr and DigitalOcean are flagged as higher risk because they are rarely used by popular websites or services, but frequently used by proxy service providers or individuals deploying their own proxies. IP ranges belonging to popular websites and well-known network operators are lower risk: they do not offer self-hosting, and blocking them incurs far greater collateral damage.

Active Probing

We described probing Shadowsocks servers in Post 7. Trojan is in principle much more difficult to probe, as it is a standard TLS server, especially when a fallback web server is configured. Nonetheless, every request still goes through the Trojan server’s authentication process, which is distinctive. By closely studying the source code of Trojan implementations, it is possible to send well-crafted, intentionally malformed packets that trigger specific timing behaviors in the authentication logic, distinguishing the server from a standard web server.

A Statistical Approach

None of the above fingerprinting methods can give a reliable classification result on its own, but they become very powerful when combined. Recent leaked documents indicate that the GFW runs multiple independent modules (entropy analysis, protocol-level analysis, TLS fingerprinting, domain/SNI tagging, behavioral study), operating in real time or near-real time against a large pool of logs and metrics. Each module produces a score, and the scores are combined through lightweight ML models (or even simple linear regression) to classify connections into categories that lead to allowing, blocking, or flagging for further review and probing.

A more complete view of how modern GFW's various components connect

The Arms Race Continues

TLS was a major leap for circumvention. Trojan’s developers described it as an “unidentifiable mechanism that helps you bypass GFW.” Today we know it is not bulletproof. Starting in October 2022, the GFW began large-scale blocking of TLS-based circumvention protocols, including Trojan, VLESS, and V2Ray over TLS, drawing on the multi-module statistical fingerprinting described above. A new wave of tools – NaïveProxy, XTLS, REALITY – emerged in response, each aiming to close the remaining fingerprinting gaps that statistical analysis exploits.

References

Liuying Lv, Peng Zhou. TrojanProbe: Fingerprinting Trojan tunnel implementations by actively probing crafted HTTP requests. Computers & Security. Volume 148. 2025. https://doi.org/10.1016/j.cose.2024.104147
refraction-networking. utls. https://github.com/refraction-networking/utls
Hannes Tschofenig and Pasi Eronen. Pre-Shared Key Ciphersuites for Transport Layer Security (TLS). RFC 4279. https://datatracker.ietf.org/doc/html/rfc4279
Inside the Great Firewall Part 2: Technical Infrastructure. DomainTools Investigations. https://dti.domaintools.com/research/inside-the-great-firewall-part-2-technical-infrastructure
Large scale blocking of TLS-based censorship circumvention tools in China. gfw-report. https://github.com/net4people/bbs/issues/129