Understanding Assembly

From lecture notes on low-level programming
Date
Author @asyncze

Contents

Assembly languages #

An assembly language is a low-level symbolic language with processor specific instructions and syntax, such as those developed at AT&T and Intel.

Instructions, syntax, data types, registers, and hardware support are typically specified in the instruction set architecture (ISA), which represent an abstract model of some computer implementation. An assembly program is a sequence of instructions, where each instruction represents an actual operation to be performed by the processor.

In most assembly-like languages, an instruction has the form mnemonic <source>, <destination> (AT&T syntax, or mnemonic <destination>, <source> in Intel syntax) such as mov %eax, %ebx, which is instruction to copy value from %eax to %ebx.

Mnemonics and directives #

Mnemonics tell the CPU what to do, such as mov, add, sub, push, pop, call, and jmp.

The source and destination placeholders are registers (%eax, %esp, or %al), other memory locations (0x401000, 8(%ebp) or %edx, %ecx, 4), or constants (source only, such as $42 or $0x401000). Below are assembly program directives, which are like commands for the assembler:

.data to identify section with variables

.text to identify section with code

.byte, .word, or .long to define integer as 8, 16, or 32-bit

.ascii or .asciz to define string with or without terminator

There's also labels that represent symbols at current address, where number: .byte 42 is same as char number = 42; in the C programming language.

Registers #

A register is a memory location on the CPU and prefixed with %. There are general-purpose registers, which includes stack pointer %esp, frame pointer %ebp, and instruction pointer %eip, and flags registers:

A constant is prefixed with $ and operand size is specified as suffix to mnemonic, a byte is b (8 bit), word is w (16 bit), and long is l (32-bit or 64-bit floating point).

Here's the initial layout and final layout after esecuting mov $1, %eax instruction, i.e. copy constant 1 and set rest to zero (note that %ax is bits 0-15 and %eax is bits 0-31):

%a1 %ah
0 8 16 31
%a1 %ah
10000000 00000000 0 – 0 0 – 0
0 8 16 31

In C programs, a memory location can be accessed using a pointer. A pointer is a special variable that store a memory address, de-referencing a pointer variable means accessing the value stored at the memory address pointed to by the pointer. In order to dereference a pointer, the memory address must be computed based on the contents of the base and index registers with an optional constant displacement and scaling factor (determined by architecture and instruction set).

Instructions #

Here's an example assembly program to output hello assembly (note that r* is 64 bit).

; assembly program (64-bit, intel syntax)
section .text
    global _main        ; start point for execution

_main:
    mov rax, 1          ; write (system call)
    mov rdi, 1          ; stdout
    mov rsi, msg        ; address to output
    mov rdx, 14         ; bytes to output
    syscall             ; invoke write
    mov rax, 60         ; exit (system call)
    xor rdi, rdi        ; 0
    syscall             ; invoke exit

section .data
    ; db is raw bytes and line feed \n is 0xah, or 10
    msg db "hello assembly", 10

Data transfer #

The data transfer instructions are used to move data between registers, memory, and stack.

; set destination as source
mov <source>, <destination>

; swap destinations
xchg <destination>, <destination>

; store source on top of stack
push <source>

; get destination from top of stack
pop <destination>

Binary arithmetic #

The binary arithmetic instructions are used to perform arithmetic operations on binary data.

; addition, destination += source
add <source>, <destination>

; subtraction, destination -= source
sub <source>, <destination>

; increment, destination += 1
inc <destination>

; decrement, destination -= 1
dec <destination>

; negation, destination = -destination
neg <destination>

Logical operators #

The logical operator instructions are used to perform logical operations on binary data.

; and, destination &= source
and <source>, <destination>

; or, destination |= source
or <source>, <destination>

; exclusive or, destination ^= source
xor <source>, <destination>

; not, destination = ~destination
not <destination>

Unconditional branches #

The unconditional branch instructions are used to change the flow of execution.

; jump to address
jmp <address>

; push return address and call function at address
call <address>

; pop return address and return
ret

; call OS-defined handler represented by const
int <const>

Conditional branches #

The conditional branch instructions are used to change the flow of execution based on the value of the flags register.

; jump below (unsigned), %eax < %ebx (label is location)
cmp %ebx, %eax
jb <label>

; jump not less (signed), %eax >= %ebx
cmp %ebx, %eax
jnl <label>

; jump zero, %eax = 0
test %eax, %eax
jz <label>

; jump not signed, or not below (signed), %eax >= 0
cmp $0, %eax
jns <label>

Misc instructions #

Other common instructions such as lea and nop.

; load effective address (source must be in memory), destination = &source
lea <source>, <destination>

; do nothing
nop