Заблокирован Статус: Не в сети Регистрация: 09.03.2004
Caches
1. Consider a direct-mapped, 32 B cache with 8 B blocks and a 1 block victim buffer. For the sequence of references given, identify the cache misses and label them as compulsory, capacity or conflict. The addresses are in octal (each digit represents 3 bits) and the cache is originally invalid. Give your results in a table format, like the one shown in class.
2. Program A contains 20% loads and 10% stores. Program A executes on a processor that has the following memory hierarchy:
An instruction cache with a hit latency of 1 cycle and a miss rate of 4%.
A write-thru (no write-allocate) data cache with a hit latency of 1 cycle and a miss rate of 10%. The cache has a write-buffer.
A write-back (write-allocate) L2 cache with a hit latency of 10 cycles and a miss rate of 20%. The cache doesn't have a write-buffer and at any given point in time 50% of the blocks are dirty.
A main memory with a hit latency of 50 cycles and a miss rate of 0%.
a. What is the per-instruction latency of the instruction memory hierarchy, i.e., CPII$?
b. What is the per-instruction latency of the data memory hierarchy, i.e., CPID$?
c. What would CPII$ and CPID$ be if the clock speed were doubled?
3. Intel's Itanium architecture has non-temporal load instructions. A non-temporal load is the inverse of a prefetch, it loads a value into a register from the closest component in the memory hierarchy, but doesn't move the block up. What is the point of having non-temporal loads? Write code for a short loop and highlight a memory access in this loop that is a good candidate for implementation with a non-temporal load. Would a compiler be able to tell what loads should be non-temporal and what loads should be "normal"? Does a compiler need to know any cache parameters in order to do this?
4. Processor A has an L2 that draws 20mW per access and a direct-mapped write-back data-cache supports set-wise dynamic resizing. The data cache uses 32B. At 32KB, it has a 4% miss rate and draws 10mW per access. At 16KB, it has a 8% miss rate and draws 5mW per access. Assume that on a size-switch, half the blocks in the cache have to be relocated and that half the blocks are dirty. What's the lowest number of accesses the cache has to spend in "small" mode to offset the power cost of the flushes at the two switches? (Hint: a block invalidation consumes effectively no power, you just have to clear the valid bit; a flush on the other hand consumes power at both the L1 cache and the L2 cache).
Virtual Memory
5. Consider a 32-bit machine with 16KB pages and a maximum physical memory size of 1GB. Assume that each PTE contains a PFN only and that PTEs are integer number of bytes (i.e., even if a PTE is 5 bits, it is stored as 1 byte).
a. How much storage is needed for a single-level page table?
b. How much storage is needed for the first-level table only of a two-level virtual page table? (Assume each 2nd level table is 16KB in size)
Simulation/Programming
Before you start, copy the code tree from ~amir/cis501/simplescalar/simulators/ again. There have been some changes and additions.
6. Use CACTI to measure the access latency and power consumption of a cache with the following configuration:
32 KB total capacity
32 B blocks
2-way set-associative
1 read/write port and 1 bank
32-bit address/32-bit data interface
0.09 um technology
1.0 V supply voltage
The CACTI minimizes latency and power consumption by trying multiple partitioned configurations. Disable layout optimizations using the flag n. What happens to latency and power consumption when you do the following?
Double the capacity, but leave the block size and associativity unchanged?
Double the block size, but leave the capacity and associativity unchanged?
Double the associativity, but leave the capacity and block size unchanged?
Double the number of ports?
Double the supply voltage?
Double the feature size?
Do these correspond to the trends we derived in class? If they do not, do you have any theories as to why?
7. Use the SimpleScalar simulator sim-cache to simulate the data side of the memory hierarchy for gcc (SPECint2000). Assume the following parameters:
data cache: 8KB, 32B lines, direct mapped, LRU replacement, write thru, write buffer
data TLB: 4KB pages, 16 entries, fully-associative, LRU replacement
L2 cache: 128KB, 64B lines, 4-way set associative, LRU replacement, write-back, no write-buffer
instruction cache and TLB: none
Use sampling (see the web page for how) to reduce the running time of the simulations. Use 2% sampling at 10M instruction sample granularity with 4% warmup, i.e., skip 470M instructions, warmup the caches (to remove artificial compulsory misses due to sampling boundaries) for 20M instructions, and sample for 10M instructions, cyclically. To handle programs shorter than 500M instructions, make your first skip region size 20M instructions (use -insn:sample and -insn:sample:first to do this).
a. What is the data cache miss rate?
b. What is the data TLB miss rate?
c. What is the L2 cache local miss rate?
d. What is the L2 cache global miss rate (global miss rate is L2 misses / L1 accesss)?
e. What percentage of the L2 misses are dirty?
f. Assume that the hit time of the data cache and TLB is 1 cycle, the hit time of the second level cache is 10 cycles, and the hit time of memory is 100 cycles. Assume a TLB miss will always hit in the L2 cache. What is the average latency of a memory access in this system?
g. Design and run experiments to separate the miss rates of the (first level) data cache into three numbers, one for each of the different kinds of misses. What are the compulsory, capacity, and conflict miss rates (e.g., compulsory miss rate = compulsory misses / total accesses)?
h. Alpha supports software prefetch instructions. Modify sim-cache.c to disable these prefetches (Hint: you can just prevent them from accessing the cache). What is the effect on the D$ miss rate? Which misses did software prefetching help most? Capacity, compulsory, or conflict?
i. Modify sim-cache.c to replace software prefetches with a next-line hardware prefetcher. Experiment with both an "access-based" prefetcher (i.e., one in which each access triggers a prefetch) and a "miss-based" prefetcher (i.e., one in which only misses trigger prefetches). Which is more effective? Is either more effective than software prefetching? (Hint: you do not have to change cache.c to do this, you can do it entirely in sim-cache.c).
Member
Статус: Не в сети Регистрация: 07.10.2003 Откуда: Russia, Moscow
manzzz 32К вряд ли скорее 32Байта - это же задача.
Все решу с ходу вряд ли, но с интересом посмотрю на симуляции, которые там приводятся. Ссылку в ЛС не подкинешь?
_________________ В поиске включайте "Искать все слова". Избегайте многоточий.
Зачем нужен разгон? http://tsc.overclockers.ru
Сейчас этот форум просматривают: нет зарегистрированных пользователей и гости: 21
Вы не можете начинать темы Вы не можете отвечать на сообщения Вы не можете редактировать свои сообщения Вы не можете удалять свои сообщения Вы не можете добавлять вложения