Some Vulnerabilities are Invisible

Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.

These adversarial encodings produce no visual artifacts.

#include <stdio.h>
#include <stdbool.h>

int main() {
bool isAdmin = false;
/* begin admins only */ if (isAdmin) {
printf("You are an admin.\n");
/* end admins only */ }
return 0;
}

The trick

The trick is to use Unicode control characters to reorder tokens in source code at the encoding level.

These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens.

Compilers and interpreters adhere to the logical ordering of source code, not the visual order.

The attack

The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.

The previous example, for instance, works by making a comment appear as if it were code:

/* if (isAdmin) { begin admins only */

Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.

This attack pattern is tracked as CVE-2021-42574.

The supply chain

This attack is particularly powerful within the context of software supply chains.

If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.

The technique

There are multiple techniques that can be used to exploit the visual reordering of source code tokens:

Early Returns cause a function to short circuit by executing a `return` statement that visually appears to be within a comment.

Commenting-Out causes a comment to visually appear as code, which in turn is not executed.

Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.

The variant

A similar attack exists which uses homoglyphs, or characters that appear near identical.

#include <iostream>

void sayHello() {
std::cout << "Hello, World!\n";
}



void sayНello() {
std::cout << "Bye, World!\n";
}

The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference.

An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code.

This attack variant is tracked as CVE-2021-42694.

The defense

Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.

Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.

Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.

The paper

Complete details can be found in the related paper.

If you use the paper or anything on this site in your own work, please cite the following:

@inproceedings{boucher_trojansource_2023,
    title = {Trojan {Source}: {Invisible} {Vulnerabilities}},
    author = {Nicholas Boucher and Ross Anderson},
    booktitle = {32nd USENIX Security Symposium (USENIX Security 23)},
    year = {2023},
    address = {Anaheim, CA},
    publisher = {USENIX Association},
    month = aug,
    url = {https://arxiv.org/abs/2111.00169}
}

Trojan Source

Invisible Source Code Vulnerabilities

Some Vulnerabilities are Invisible

Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.

These adversarial encodings produce no visual artifacts.

The trick

The trick is to use Unicode control characters to reorder tokens in source code at the encoding level.

These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens.

Compilers and interpreters adhere to the logical ordering of source code, not the visual order.

The attack

The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.

The previous example, for instance, works by making a comment appear as if it were code:

/* if (isAdmin) { begin admins only */

Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.

This attack pattern is tracked as CVE-2021-42574.

The supply chain

This attack is particularly powerful within the context of software supply chains.

If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.

The technique

There are multiple techniques that can be used to exploit the visual reordering of source code tokens:

Early Returns cause a function to short circuit by executing a `return` statement that visually appears to be within a comment.

Commenting-Out causes a comment to visually appear as code, which in turn is not executed.

Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.

The variant

A similar attack exists which uses homoglyphs, or characters that appear near identical.

The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference.

An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code.

This attack variant is tracked as CVE-2021-42694.

The defense

Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.

Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.

Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.

The paper

Complete details can be found in the related paper.

If you use the paper or anything on this site in your own work, please cite the following:

Trojan Source

Invisible Source Code Vulnerabilities

Some Vulnerabilities are Invisible

Rather than inserting logical bugs, adversaries can attack the encoding of source code files to inject vulnerabilities.

These adversarial encodings produce no visual artifacts.

The trick

The trick is to use Unicode control characters to reorder tokens in source code at the encoding level.

These visually reordered tokens can be used to display logic that, while semantically correct, diverges from the logic presented by the logical ordering of source code tokens.

Compilers and interpreters adhere to the logical ordering of source code, not the visual order.

The attack

The attack is to use control characters embedded in comments and strings to reorder source code characters in a way that changes its logic.

The previous example, for instance, works by making a comment appear as if it were code:

/* if (isAdmin) { begin admins only */

Adversaries can leverage this deception to commit vulnerabilities into code that will not be seen by human reviewers.

This attack pattern is tracked as CVE-2021-42574.

The supply chain

This attack is particularly powerful within the context of software supply chains.

If an adversary successfully commits targeted vulnerabilities into open source code by deceiving human reviewers, downstream software will likely inherit the vulnerability.

The technique

There are multiple techniques that can be used to exploit the visual reordering of source code tokens:

Early Returns cause a function to short circuit by executing a return statement that visually appears to be within a comment.

Commenting-Out causes a comment to visually appear as code, which in turn is not executed.

Stretched Strings cause portions of string literals to visually appear as code, which has the same effect as commenting-out and causes string comparisons to fail.

The variant

A similar attack exists which uses homoglyphs, or characters that appear near identical.

The above example defines two distinct functions with near indistinguishable visual differences highlighted for reference.

An attacker can define such homoglyph functions in an upstream package imported into the global namespace of the target, which they then call from the victim code.

This attack variant is tracked as CVE-2021-42694.

The defense

Compilers, interpreters, and build pipelines supporting Unicode should throw errors or warnings for unterminated bidirectional control characters in comments or string literals, and for identifiers with mixed-script confusable characters.

Language specifications should formally disallow unterminated bidirectional control characters in comments and string literals.

Code editors and repository frontends should make bidirectional control characters and mixed-script confusable characters perceptible with visual symbols or warnings.

The paper

Complete details can be found in the related paper.

If you use the paper or anything on this site in your own work, please cite the following:

Early Returns cause a function to short circuit by executing a `return` statement that visually appears to be within a comment.