Understanding AssemblyFrom lecture notes on low-level programming |
Date | ||
---|---|---|---|
Author | @asyncze |
An assembly language is a low-level symbolic language with processor specific instructions and syntax, such as those developed at AT&T and Intel.
Instructions, syntax, data types, registers, and hardware support are typically specified in the instruction set architecture (ISA), which represent an abstract model of some computer implementation. An assembly program is a sequence of instructions, where each instruction represents an actual operation to be performed by the processor.
In most assembly-like languages, an instruction has the form mnemonic <source>, <destination>
(AT&T syntax, or mnemonic <destination>, <source>
in Intel syntax) such as mov %eax, %ebx
, which is instruction to copy value from %eax
to %ebx
.
Mnemonics tell the CPU what to do, such as mov
, add
, sub
, push
, pop
, call
, and jmp
.
The source and destination placeholders are registers (%eax
, %esp
, or %al
), other memory locations (0x401000, 8(%ebp)
or %edx, %ecx, 4
), or constants (source only, such as $42
or $0x401000
). Below are assembly program directives, which are like commands for the assembler:
.data to identify section with variables
.text to identify section with code
.byte, .word, or .long to define integer as 8, 16, or 32-bit
.ascii or .asciz to define string with or without terminator
There's also labels that represent symbols at current address, where number: .byte 42
is same as char number = 42;
in the C programming language.
A register is a memory location on the CPU and prefixed with %
. There are general-purpose registers, which includes stack pointer %esp
, frame pointer %ebp
, and instruction pointer %eip
, and flags registers:
extended (32-bit) registers includes %eax
, %ebx
, %ecx
, %edx
, %esi
, and %edi
smaller parts (16-bit), such as %ax
, %bx
, %cx
, %dx
, %sp
(stack pointer), %bp
(frame pointer), %si
, and %di
lower byte, such as %a1
, %b1
, %c1
, and %d1
second byte, such as %ah
, %bh
, %ch
, and %dh
A constant is prefixed with $
and operand size is specified as suffix to mnemonic, a byte is b
(8 bit), word is w
(16 bit), and long is l
(32-bit or 64-bit floating point).
Here's the initial layout and final layout after esecuting mov $1, %eax
instruction, i.e. copy constant 1 and set rest to zero (note that %ax
is bits 0-15 and %eax
is bits 0-31):
%a1 |
%ah |
||
---|---|---|---|
0 | 8 | 16 | 31 |
%a1 |
%ah |
||
---|---|---|---|
10000000 | 00000000 | 0 – 0 | 0 – 0 |
0 | 8 | 16 | 31 |
In C programs, a memory location can be accessed using a pointer. A pointer is a special variable that store a memory address, de-referencing a pointer variable means accessing the value stored at the memory address pointed to by the pointer. In order to dereference a pointer, the memory address must be computed based on the contents of the base and index registers with an optional constant displacement and scaling factor (determined by architecture and instruction set).
Here's an example assembly program to output hello assembly
(note that r*
is 64 bit).
; assembly program (64-bit, intel syntax)
section .text
global _main ; start point for execution
_main:
mov rax, 1 ; write (system call)
mov rdi, 1 ; stdout
mov rsi, msg ; address to output
mov rdx, 14 ; bytes to output
syscall ; invoke write
mov rax, 60 ; exit (system call)
xor rdi, rdi ; 0
syscall ; invoke exit
section .data
; db is raw bytes and line feed \n is 0xah, or 10
msg db "hello assembly", 10
The data transfer instructions are used to move data between registers, memory, and stack.
; set destination as source
mov <source>, <destination>
; swap destinations
xchg <destination>, <destination>
; store source on top of stack
push <source>
; get destination from top of stack
pop <destination>
The binary arithmetic instructions are used to perform arithmetic operations on binary data.
; addition, destination += source
add <source>, <destination>
; subtraction, destination -= source
sub <source>, <destination>
; increment, destination += 1
inc <destination>
; decrement, destination -= 1
dec <destination>
; negation, destination = -destination
neg <destination>
The logical operator instructions are used to perform logical operations on binary data.
; and, destination &= source
and <source>, <destination>
; or, destination |= source
or <source>, <destination>
; exclusive or, destination ^= source
xor <source>, <destination>
; not, destination = ~destination
not <destination>
The unconditional branch instructions are used to change the flow of execution.
; jump to address
jmp <address>
; push return address and call function at address
call <address>
; pop return address and return
ret
; call OS-defined handler represented by const
int <const>
The conditional branch instructions are used to change the flow of execution based on the value of the flags register.
; jump below (unsigned), %eax < %ebx (label is location)
cmp %ebx, %eax
jb <label>
; jump not less (signed), %eax >= %ebx
cmp %ebx, %eax
jnl <label>
; jump zero, %eax = 0
test %eax, %eax
jz <label>
; jump not signed, or not below (signed), %eax >= 0
cmp $0, %eax
jns <label>
Other common instructions such as lea
and nop
.
; load effective address (source must be in memory), destination = &source
lea <source>, <destination>
; do nothing
nop