liblisa.nl/index.html at main · libLISA/liblisa.nl · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <link rel="icon" sizes="16x16 32x32 48x48 64x64 256x256" href="/favicon.ico">
    <link rel="canonical" href="https://liblisa.nl/">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="description" content="libLISA derives x86-64 instruction semantics bottom-up from CPU behavior. The semantics are machine-readable and CPU-specific.">
    <meta property="og:type" content="website">
    <meta property="og:title" content="libLISA - Instruction Discovery and Analysis on x86-64">
    <meta property="og:url" content="https://liblisa.nl">
    <meta property="og:image" content="https://liblisa.nl/logo.png">
    <meta property="og:description" content="libLISA derives x86-64 instruction semantics bottom-up from CPU behavior. The semantics are machine-readable and CPU-specific.">
    <meta name="twitter:card" content="summary">
    <meta property="twitter:domain" content="liblisa.nl">
    <meta property="twitter:url" content="https://liblisa.nl">
    <meta name="twitter:title" content="libLISA - Instruction Discovery and Analysis on x86-64">
    <meta name="twitter:description" content="libLISA derives x86-64 instruction semantics bottom-up from CPU behavior. The semantics are machine-readable and CPU-specific.">
    <meta name="twitter:image" content="https://liblisa.nl/logo.png">
    <title>libLISA - Instruction Discovery and Analysis on x86-64</title>
    <link rel="stylesheet" href="style.css">
  </head>
  <body>
    <div class="header">
        <h1>libLISA - Instruction Discovery and Analysis on x86-64</h1>
        <div class="authors">Jos Craaijo, Freek Verbeek, Binoy Ravindran</div>
    </div>
    <article class="page">
        <p>
            libLISA is a tool that can <i>fully automatically</i> scan instruction space, discover instructions and synthesize their semantics.
            It produces machine-readable, CPU-specific x86-64 instruction semantics.
            It relies on as little human specification as possible:
            specifically, it does not rely on a handwritten (dis)assembler to dictate which instructions are executable on a given CPU, or what their operands are.
        </p>

        <p>
            We presented libLISA at <a href="https://2024.splashcon.org/track/splash-2024-oopsla#event-overview" target="_blank">OOPSLA'24</a>.
        </p>

        <div class="buttonrow">
            <a class="button green" href="liblisa.pdf">
                <img src="file-pdf-regular.svg" class="glyph" alt="PDF file" />
                Read the paper
            </a>
            <a class="button yellow" target="_blank" href="https://explore.liblisa.nl/">
                <img src="map-regular.svg" class="glyph" alt="map icon" />
                Explore the data
            </a>
            <a class="button purple" target="_blank" href="https://github.com/liblisa">
                <img src="github-brands-solid.svg" class="glyph" alt="GitHub" />
                View source code
            </a>
        </div>

        <h2>Motivation</h2>
        <p>
            Even though heavily researched, a full formal model of the x86-64 instruction set is still not available.
            This is caused by the sheer complexity of the x86-64 architecture:
            the informal specification found in <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html" target="_blank">Intel manuals is roughly 4700 pages</a>, and even these are known to be <a href="https://stefanheule.com/publications/pldi16-strata/" target="_blank">not</a> <a href="https://comsec.ethz.ch/research/hardware-design-security/rememberr/" target="_blank">trustworthy</a>.
        </p>
        <p>
            Specifications of CPU architectures often rely heavily on manual work, which is error-prone and labor-intensive.
            The current <a href="https://github.com/kframework/x86-64-semantics" target="_blank">state-of-the-art formal semantics</a> for x86-64 took 8 man-months to write, and even that specification still contains 34 errors (see Section 5.2 of our paper).
        </p>
        <p>
            This situation becomes even more dire when taking into account that different x86-64 machines will behave differently: not only can they have different instruction sets, but behavior is also allowed to be undefined, in which case the same instruction has different behavior on different machines.
        </p>
        <p>
            libLISA aims to solve this problem by using a CPU as the ground truth, and deriving semantics by observing instruction execution.
            It uses fuzzing and program synthesis to derive semantics automatically.
        </p>
        <p>
            One of the use-cases for libLISA's semantics is verification of other binary analysis tools.
            Errors in instruction semantics often propagate to the final output of a binary analyzer or emulator, making them behave incorrectly.
            Differences between actual CPU behavior and these tools pose a security risk: malware can abuse such differences to obfuscate its behavior, or evade analysis in controlled environments like emulators.
        </p>
        <p>
            Our current focus is on using libLISA's automatically inferred semantics to verify the correctness of disassemblers, emulators and other semantics.
        </p>

        <h2>x86-64 Instruction Semantics</h2>
        <p>
            We analyzed five different architectures: AMD 3900X, AMD 7700X, Intel i9-13900 (p), Intel i9-13900 (e) and Intel Xeon Silver 4110.
            For each architecture, we generated around 120k <i>encodings</i>.
        </p>
        <p>
            The semantics can be accessed in the following two ways:

            <ul>
                <li>Use <a target="_blank" href="https://explore.liblisa.nl/">explore.liblisa.nl</a> to browse the semantics manually.</li>
                <li>Use the <a target="_blank" href="https://github.com/libLISA/liblisa/tree/main/cli/liblisa-semantics-tool"><tt>liblisa-semantics-tool</tt></a> or <a target="_blank" href="https://crates.io/crates/liblisa">the <tt>liblisa</tt> Rust crate</a> to use the semantics programmatically.</li>
            </ul>
        </p>
        <p>
            The generated x86-64 semantics are CPU-specific.
            Around 90% of all encodings is identical on all 5 CPU architectures that we analyzed.
            The remaining 10% differs between architectures.
            We can broadly classify these differences into two categories: instruction set extensions and undefined behavior.
            Some x86-64 instruction set extensions, such as the <tt>SHA1</tt> instructions, are not implemented on all CPUs.
            There are also many differences in how undefined behavior is implemented across different x86-64 CPU architectures.
        </p>
        <p>
            An example of undefined behavior is the <a target="_blank" href="https://explore.liblisa.nl/instruction/0FAFC3"><tt>IMUL</tt> instruction</a>.
            The <a target="_blank" href="https://www.felixcloutier.com/x86/imul#flags-affected">reference manual</a> states: <i>"The SF, ZF, AF, and PF flags are undefined."</i>
            This means that the values of these flags can differ between CPU architectures, even if the instruction is provided with the same inputs.
            As can be seen <a target="_blank" href="https://explore.liblisa.nl/instruction/0FAFC3">in libLISA's semantics explorer</a>, The AMD 3900X and AMD 7700X do not modify these flags.
            The Intel i9-13900 (p) and Intel Xeon Silver 4110 compute the SF and PF correctly, and set the AF and ZF to 0.
            The Intel i9-13900 (e) additionally also computes the ZF flag correctly, and sets only the AF to 0.
        </p>
        <p>
            The AMD 3900X and AMD 7700X often share the same instruction semantics.
            This is likely because the AMD 7700X is a successor of the AMD 3900X.
            Similarly, the Intel i9-13900 (p) and Intel Xeon Silver 4110 often also share semantics, likely because the Intel i9-13900 (p) is also a successor of the Intel Xeon Silver 4110.
        </p>

        <h2>Acknowledgements</h2>
        <p>
            This work is supported by the Defense Advanced Research Projects Agency (DARPA) and Naval Information Warfare Center Pacific (NIWC Pacific) under Contract No. N66001-21-C-4028.
        </p>
    </article>
  </body>
</html>