Assignment 1
Develop a program that takes a binary as input and determines the longest dependency chain/dataflow path through the binary. Assuming every instruction has a delay of 1 cycle before it can forward its output (via registers or memory) to subsequent instructions, how many cycles does it take to execute every instruction in a binary? You don't have to model any real architectural bottlenecks, and can assume perfect caching and branch prediction.
For Reference
Butler, et al., "Single Instruction Stream Parallelism is Greater than Two," ISCA 1991.
David Wall, "Limits of Instruction-Level Parallelism," ASPLOS 1991.
Guidelines
- Assume perfect address/branch prediction.
- Assume no cache miss penalty/stalls.
- Assume infinite Load/Store queue, i.e. ignore memory consistency(ordering of memory operations across different addresses).
- Binaries used for testing will use one syscall - exit(). For simplicity, assume exit depends on all other instructions.
- Programs will be relatively short, e.g. 50K instructions, so don't worry much about optimizing.
- The “YOUR CODE HERE” markings are suggestions. Feel free to modify any of the existing code.
Installation
Download the newest version of pin at: http://www.pintool.org/downloads.html
export PIN=<PIN_DIR_PATH>
Build the pintool with the included makefile
make
If it complains about not being able to attach to process, it's likely because the OS disables inter-process attachment. You can try:
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
You'll need the libdwarf.so packaged with pin in your LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=$PIN/intel64/runtime
Instrumenting a program
Use the included Makefile to compile sample programs. Take a look at the multiple 'hello world' programs. How many instructions do you think they'll use?
To instrument a binary with the pintool:
$PIN/intel64/bin/pinbin -t ./dataflow.so -- <PROGRAM>
or the provided run script does the same:
./run.sh <PROGRAM>
Samples
hello64.s input
.section .data helloWorld: .ascii "Hello, World!", "\n" .section .text .globl _start _start: mov $0x2000004, %rax mov $1, %rdi mov $helloWorld, %rsi mov $14, %rdx syscall mov $60, %rax mov $0, %rdi syscall
Pintool output
> ./run.sh sampleprograms/hello64 writes to register 19 gr gr64 0x4000b0 mov rax, 0x2000004 writes to register 12 gr gr64 0x4000b7 mov rdi, 0x1 writes to register 13 gr gr64 0x4000be mov rsi, 0x6000e0 writes to register 17 gr gr64 0x4000c5 mov rdx, 0xe writes to register 35 writes to register 34 0x4000cc syscall writes to register 19 gr gr64 0x4000ce mov rax, 0x3c writes to register 12 gr gr64 0x4000d5 mov rdi, 0x0 writes to register 35 writes to register 34 0x4000dc syscall Application stopped. Shutting down.
loop-slow.s input
This application loops over a string and in each iteration modifies a single character. The loop isn't unrolled.
.section .data helloWorld: .ascii "Hello, World!!" .section .text .globl _start _start: movl $helloWorld, %edi add $14, %edi loop: movl -1(%edi), %eax add $1, %eax movl %eax, -1(%edi) sub $1, %edi cmp $helloWorld, %edi jne loop print: ## Uncomment for debugging # movl $4, %eax # movl $1, %ebx # movl $helloWorld, %ecx # movl $14, %edx # int $0x80 movl $1, %eax movl $0, %ebx int $0x80
Your code should output 19. Why is this?
- The first 2 instructions take 2 cycles
- The loop body has several operations that depend on the subtract from the previous iteration:
- The N(number of iterations) subtracts cyclically feed one another, creating a dependency chain of N steps
- The last subtract has a dependency chain of two - the compare and not-taken conditional jump
- The second to last subtract has a dependency chain of three - the move→add→move chain
- The last two moves - outside the loop - can be issued early, and are thus off the critical path.
- The final syscall (int $0×80) is dependent on all instructions, and thus the loop dataflow.
This means the dataflow depth is 2 + N-1(subtracts) + 3(mov→add→mov) + 1(syscall) = 2 + 13 + 3 + 1 = 19. This is the number your program should print for this loop program as input. You could also get this as 2 + N(subtracts) + 2(cmp→branch) + 1(syscall) since the sub(N-1)→mov→add→move is in parallel with sub(N-1)→sub(N)→cmp→jmp.
loop.s input
This input has a loop which has been unrolled once, and iterates 6 times. Each iteration is completely independent and uses only 3 instructions. How many cycles does this take?
.section .data helloWorld: .ascii "Hello, World!!" .section .text .globl _start _start: movl $helloWorld, %edi movl $13, %edx loop: movl 0(%edi,%edx,1), %eax add $1, %eax movl %eax, 0(%edi,%edx,1) movl -1(%edi,%edx,1), %eax add $1, %eax movl %eax, -1(%edi,%edx,1) sub $2, %edx jns loop print: ## Uncomment for debugging # movl $4, %eax # movl $1, %ebx # movl $helloWorld, %ecx # movl $14, %edx # int $0x80 movl $1, %eax movl $0, %ebx int $0x80