Laboratory manual for TSEA44 Olle Seger, Per Karlström, Andreas Ehliar Computer Engineering Department of Electrical Engineering Linköping University, S-581 83 Linköping, Sweden Email: olle.seger@liu.se, perk@isy.liu.se, ehliar@isy.liu.
Contents 1 The system 1.1 Introduction . . . . . . . . . . . . . . . . . 1.2 Hardware . . . . . . . . . . . . . . . . . . 1.2.1 Virtex-II Development board . . . . 1.2.2 Communication/Memory Module . 1.2.3 Virtex-II 4000 FPGA . . . . . . . . 1.3 Open RISC . . . . . . . . . . . . . . . . . 1.3.1 Top Design . . . . . . . . . . . . . 1.3.2 Structure of the Verilog code . . . . 1.3.3 OR1200 CPU . . . . . . . . . . . . 1.3.4 The Wishbone Interconnect Bus . . 1.3.5 Memory Controller . . . . . . . . . 1.3.
CONTENTS 4 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 32 34 35 36 36 36 38 38 38 40 40 41 41 42 4 Lab task 2 - Design a JPEG accelerator 4.1 The lab system . . . . . . . . . . . . . . . . . . . . . . 4.2 Proposed architecture . . . . . . . . . . . . . . . . . . . 4.2.
CONTENTS 5 6 Lab task 4 - Custom Instructions 6.1 Introduction . . . . . . . . . . . . . . . . 6.1.1 Huffman Coding . . . . . . . . . 6.1.2 The Problem . . . . . . . . . . . 6.2 Adding a New Instruction . . . . . . . . . 6.2.1 Making the Processor Understand 6.2.2 Adding Special Purpose Registers 6.2.3 Adding the Required Hardware . 6.3 Proposed Architecture . . . . . . . . . . 6.3.1 Control Unit . . . . . . . . . . . 6.3.2 Data Path . . . . . . . . . . . . . 6.3.3 Store Unit . . . . . . . . . . . . . 6.
CONTENTS
Chapter 1 The system 1.1 Introduction This text is intended as a laboratory compendium for the course TSEA44 Computer Hardware - a System On a Chip. We begin with a presentation of the hardware and software used for the laboratory exercises. If you wonder, the name dafk seen in many places in this course, comes from the name DAtorteknik FortsättningsKurs which is the Swedish name of the first version of this course. Roughly translated it means advanced course in computer technology.
CHAPTER 1. THE SYSTEM 8 Figure 1.1: Block diagram of the Avnet main board. 1.2 Hardware 1.2.1 Virtex-II Development board In the course we will use a development FPGA board from Avnet Corporation. A block diagram of this board is shown in Figure 1.1. More details are given in the User’s Guide, [1].
1.2. HARDWARE RJ45 9 irDA Transceiver USB PCMCIA 16MBytes FLASH (x32) Magnetics Micron MT28F640J3A USB 2.0 XCVR Cypress CY7C68013 1MByte SRAM (x32) Cypress CY7C1041V33 10/100/1000 Ethernet PHY National DP83861 Buffer Buffer Buffer Buffer 64MBytes SDRAM (x32) Micron MT48LC16M16A2 AvBus Connectors (x2) Figure 1.2: Block diagram of the Avnet communication and memory module. 1.2.
CHAPTER 1. THE SYSTEM 10 CLB (1 CLB = 4 slices = Max 128 bits) Device System Gates Array Row x Col. Slices SelectRAM Blocks Maximum Distributed RAM Kbits Multiplier Blocks 18 Kbit Blocks Max R AM (Kbits) DCMs Max I/O Pads(1) XC2V40 40K 8x8 256 8 4 4 72 4 88 XC2V80 80K 16 x 8 512 16 8 8 144 4 120 XC2V250 250K 24 x 16 1,536 48 24 24 432 8 200 XC2V500 500K 32 x 24 3,072 96 32 32 576 8 264 XC2V1000 1M 40 x 32 5,120 160 40 40 720 8 432 XC2V1500 1.
1.3. OPEN RISC 11 DCM DCM IOB Global Clock Mux Configurable Logic Programmable I/Os CLB Block SelectRAM Multiplier Figure 1.3: Virtex-II architectural overview. 1.3 Open RISC 1.3.1 Top Design The computer used in this lab course is designed with Verilog modules, which can be downloaded free from Open Cores (www.opencores.org) and some modules designed by us. This section describes the main system defined in the file dafk.sv, which you will use in lab task 2–4 The computer in Figure 1.
CHAPTER 1. THE SYSTEM 12 FXINA FX MUXFX FXINB Y DY D Q FF/LAT CE LUT G inputs D YQ CLK SR REV BY F5 MUXF5 X LUT F inputs DX D D Q XQ FF/LAT CE CLK SR REV BX CE CLK SR a) b) Figure 1.4: a) Virtex-II slice configuration b) Detail of slice (top half). 1.3.2 Structure of the Verilog code The structure of the Verilog code closely resembles the block diagram shown in Figure 1.5. Components outside the FPGA are simulated.
1.3. OPEN RISC 13 SRAM 1 MB Master Slave 0 0 Mem Ctrl OR1200 CPU 4kBRAM 4kBROM 1 1 SDRAM 64 MB FLASH 16 MB 2 UART 7 JTAG Debug Parport LED, DIPswitch 3 4 Wishbone PS2 Keyboard Accelerator 5 VGA Leela SRAM 2 3 Ether Ctrl PHY FPGA Hub Figure 1.5: An Open RISC computer. 1.3.3 OR1200 CPU A block diagram of the OR1200 CPU is shown in Figure 1.6. More information about the CPU can be found in [6, 7]. Figure 1.6: Block diagram of the OR1200 CPU.
CHAPTER 1.
1.3. OPEN RISC 15 slave. Furthermore tristate is not used, instead there are two databuses, one in each direction. The address bus and the data busses are 32 bits wide. Master Slave A0,D0 A0,D0,STB A0,D0,STB A1,D1 i_bus_m A0,D0 A7,D7 gnt A0 Arbiter gnt A7 Address Decoder D1,ACK D1 D0, D1 ,ACK i_dat_s, i_bus_s D1 D7, Figure 1.7: The Wishbone interconnect bus. In this example Master 0 is addressing Slave 1. Master 0 has won the arbitration.
CHAPTER 1. THE SYSTEM 16 The Ethernet IP Core is capable of operating at 10 or 100 Mbps for Ethernet and Fast Ethernet applications. An external PHY is needed for a complete Ethernet solution. In short the ethernet controller works as follows. There are 64 transmit buffers and 64 receive buffers. These buffers are typically located in the SRAM.
1.4. SOFTWARE 17 Command d m g l u Explanation display memory content modify memory content go (execute) load Intel hex file boot uClinux (copy from FLASH) Table 1.2: Some useful commands in the monitor. A simple program. In this section we will demonstrate how to compile, load and run a C-program in the monitor evironment. We will use the program described in Listing 1.1 as an example. Listing 1.1: simpleprog # i n c l u d e " common .
CHAPTER 1. THE SYSTEM 18 Listing 1.2: Makefile for simpleprog (Listing 1.1) # The name o f t h e program we want t o c o m p i l e PROGRAM = s i m p l e p r o g # The d i r e c t o r y c o n t a i n i n g t h e open r i s c s u p p o r t d i r LIBDIR = . . / l i b INCLUDEDIR = . .
1.4. SOFTWARE 19 Listing 1.3: Link script for simpleprog (Listing 1.1) MEMORY { vectors sdram } : ORIGIN = 0 x00000000 , LENGTH = 0 x00002000 : ORIGIN = 0 x00002000 , LENGTH = 0 x03ffe000 SECTIONS { . vectors : { *(. vectors ) } > vectors . text : { *(. text ) } > sdram . rodata ALIGN (4) : { *(. rodata ) } > sdram . rodata . str1 .1 ALIGN (4) : { *(. rodata . str1 .1) } > sdram . data ALIGN (4): { *(. data ) } > sdram . bss ALIGN (4): { *(. bss ) } > sdram } 1.4.
CHAPTER 1. THE SYSTEM 20 The simulator can also be started in an interactive mode by or32-uclinux-sim -f sim.cfg -i prog . In Figure 1.8 we show as an example the simulation of a simple monitor in an xterm window. Figure 1.8: Simulation of the bender monitor. The command help lists available commands, for instance t (trace): >t 00000100: : 00000000 l.j 0x0 (executed) [time 40ns, #1] 00000104: : 00000000 l.
1.4. SOFTWARE 21 The command help will list the built-in shell commands. An important file is /etc/rc, the start-up file, which is shown in Listing 1.4. If you want to change the start-up behavior of µClinux this the file to change. In a running µClinux this file resides in a non-writable file system. A new system must be recompiled on a host computer, downloaded over the serial port and flashed to the flash memory. It is very unlikely that you have to do this in the course of this lab series. Listing 1.
CHAPTER 1. THE SYSTEM 22 Listing 1.5: Program showing contents of a special purpose register. # include # include # include # include # include < s y s / t y p e s . h> < s y s / s t a t . h> i n t main ( i n t a r g c , char ∗ a r g v [ ] ) { unsigned long val , addr ; i f ( a r g c == 2) { addr = s t r t o u l ( argv [ 1 ] , 0 , 0 ) ; / ∗ Read SPR ∗ / asm ( "l.
Chapter 2 Lab task 0 - Build a UART in Verilog 2.1 Introduction In this introductory lab exercise you will learn the HDL Verilog. We require that you are familiar with another HDL, typically VHDL. In our opinion hardware design is done by drawing hardware diagrams, so that the programming in Verilog is just a final simple translation step! You will also get (re)acquainted with the tools used in this course, ModelSim and make (or Xilinx Project Navigator). 2.2 A simple UART 2.2.
CHAPTER 2. LAB TASK 0 - BUILD A UART IN VERILOG 24 2.2.2 The hardware The system clock is running at 40 MHz. You will need a reset-signal and a send-signal, see Figure 2.2. Both these signals are active-high. send_i(SW2) rst_i(SW1) switch_i tx_o UART led_o rx_i clk_i Figure 2.2: The UART. Your task is twofold: • send an ASCII-coded character from the DIP switch to the PC by pressing the switch SW2, see Figure 2.2.
2.2. A SIMPLE UART 25 Listing 2.1: Test bench for the UART. ‘ t i m e s c a l e 1 n s / 10 p s module l a b 0 _ t b ( ) ; reg c l k _ i ; reg r s t _ i ; reg s e n d _ i ; reg [ 7 : 0 ] s w i t c h _ i ; wire [ 7 : 0 ] led_o ; wire jumper ; / / I n s t a n t i a t e a UART lab0 u a r t ( . c l k _ i ( c l k _ i ) , . r s t _ i ( r s t _ i ) , . r x _ i ( jumper ) , . tx_o ( jumper ) , . led_o ( led_o ) , . switch_i ( switch_i ) , . send_i ( send_i ) ) ; always #12.
CHAPTER 2. LAB TASK 0 - BUILD A UART IN VERILOG 26 2.3 Exercises Preparation task 1 Draw a HW diagram of the UART. Use simple components like counters, registers, shift registers, and state machines. Laboration task 1 a) Translate your HW diagram into Verilog code. b) Simulate your design in ModelSim. c) Synthesize your design, program the FPGA and test run your design. 2.3.1 Commands To start the simulator, use the command make sim_lab0. To generate a bitfile to program the FPGA with use make lab0.
2.4.
CHAPTER 2.
Chapter 3 Lab task 1 - Interfacing to the Wishbone bus 3.1 Introduction In this lab exercise you will get acquainted with the OR 1200 RISC processor and particularly the Wishbone bus. You will do this by designing and interfacing two modules, a UART and a performance counter module to the Wishbone bus. WB 0 Boot Monitor in ROM RAM 1 OR1200 1 stx_pad_o I/F 2 7 UART Parallel Port 9 I/F srx_pad_i out_pad_o in_pad_i Performance Counters lab1.sv clk_i rst_i Figure 3.1: The computer.
CHAPTER 3. LAB TASK 1 - INTERFACING TO THE WISHBONE BUS 4. simulate the computer running the benchmark program. 5. design a module containing hardware performance counters (perf_top.sv in the lab skeleton). 3.2 Some Basic Facts on the Wishbone Bus The Wishbone bus is intended for implementation in FPGAs or ASICs. Typical for such a bus is that multiplexers are used instead of tristate buffers. Two data buses are used, one for each direction, see Figure 3.2a. clk wb.adr Master Slave wb.adr wb.stb wb.
3.2. SOME BASIC FACTS ON THE WISHBONE BUS 31 3. The Master deasserts the wb.stb, wb.cyc and wb.we-signals. 4. The slave deasserts the wb.ack-signal. For the read cycle, see Figure 3.2c, we have: 1. The master places the address on the bus wb.adr and asserts the wb.stb-signal, the wb.cyc-signal, and deasserts the wb.we-signal. 2. The slave, when ready, decodes the address bus, places the data on the data bus wb.dat_i and asserts the wb.ack-signal. 3. The Master deasserts the wb.stb and wb.cyc-signals. 4.
CHAPTER 3. LAB TASK 1 - INTERFACING TO THE WISHBONE BUS 32 In the return path the addressed slave’s s-bus is connected to all the masters. This is handled by the block DEC. The ack-signal is, however, only asserted at the master that won the arbitration. 3.3 A Simple Computer 3.3.1 General For the lab you will have to download tsea44.tgz if you haven’t done so already. Uncompress the zip-file to your home directory. Inspect the directory hw and you will find: • the file lab1/lab1_uart_top.
3.3. A SIMPLE COMPUTER 33 rx_full F/F wb.dat_i[16] wb.stb wb.we wb.sel[3] S R rd Control Unit end_char_rx & wb.adr[2] load rx shift shift_rx Shift rx reg wb.dat_i[31:24] Reg in wb.stb wb.ack tx_empty F/F wb.stb wb.we wb.sel[3] wb.adr[2] end_char_tx S R wb.dat_i[22:21] send wr Control Unit & load tx reg wb.dat_o[31:24] load_tx load shift Shift Reg shift_tx >=1 tx out Figure 3.4: A sketch of the Wishbone interface for the UART.
CHAPTER 3. LAB TASK 1 - INTERFACING TO THE WISHBONE BUS 34 a) b) sel[3] 7 rx/tx 0 31 rx/tx sel[2] 23 sel[1] 15 sel[0] 7 0 9000_0000 4 9000_0000 1 2 3 4 5 tx_empty rx_full tx_empty rx_full Figure 3.5: a) Address map for the UART connected to an 8 bit bus b) Address map for the UART connected to a 32 bit bus. The sel-signals are used to address individual bytes. definition of the wishbone SystemVerilog interface can be found in the appendix section B.5. Listing 3.1: Lab skeleton lab1_uart_top.
3.3. A SIMPLE COMPUTER 35 Check mon2.c to see what the monitor does at startup so that you can verify that the hardware does the correct thing. 3.3.4 Test Your Design In Figure 3.6a we show a test bench for the computer. The only signals that the test bench has to activate in this case are the clk_i- and rst_i-signals. We check the behavior of the computer by listening to tx-signal from the UART. Part of a testbench has already been written for you in dafk_tb/lab1_tb.v.
CHAPTER 3. LAB TASK 1 - INTERFACING TO THE WISHBONE BUS 36 3.4 A Benchmark Program 3.4.1 JPEG Compression We will use the first part, DCT, of the JPEG compression algorithm to test our computer. This section is inspired by [3]. We begin with a short discussion of how DCT works. 3.4.2 Integer DCT The two dimensional discrete cosine transform (DCT) for an 8 × 8 array a[x, y] is defined as A[u, v] = c[u]c[v] · 7 7 X X a[x, y] cos x=0 y=0 πu 1 πv 1 (x + ) cos (y + ) 8 2 8 2 (3.
3.4. A BENCHMARK PROGRAM 37 Sofar we have presented three ways of computing the 2-D DCT. We compare the computation complexity of the algorithms: Algorithm Eq (3.1) Eq (3.2) Loeffler original MUL 4096 1024 224 ADD 4032 892 416 The post multiplication with c[u] has been left out of the table. √ The OR1200 CPU has no floating point arithmetic, so the sin/cosine factors and 2 in Figure 3.7 must be mapped to integers. We have chosen to multiply with 213 and rounding to the nearest integer.
CHAPTER 3. LAB TASK 1 - INTERFACING TO THE WISHBONE BUS 38 Preparation task 5 Why do we go through all the trouble inserting the module in Figure 3.8? Why is it so bad having 2 multipliers in series? 3.4.3 The Test Program dct_sw For this lab you will get a test program dct_sw.c, written by us. It is a straightforward implementation of Loeffler’s algorithm and computes the 2-D DCT of an 8 × 8 image.
3.5. DESIGN A PERFORMANCE COUNTER MODULE 39 3. contain four 32 bit counters that can be read and written on the addresses 0x9900_0000 to 0x9900_000c. 4. The counter on address 0x9900_0000 shall count the number of clock cycles that m0.cyc and m0.stb are both asserted. The counter on address 0x9900_0004 shall count the number of clock cycles that m0.ack is asserted. 5. The counter on address 0x9900_0008 shall count the number of clock cycles that m1.cyc and m1.stb are both asserted.
CHAPTER 3. LAB TASK 1 - INTERFACING TO THE WISHBONE BUS 40 3.6 Useful Commands We have prepared a makefile based build system that is responsible for both building the monitor firmware and synthesizing the hardware from the RTL source code. You can use it on the Linux computers in Muxen 1. The following targets will be useful for you: • make lab1 Creates a bit file of the computer in this lab task. • make sim_lab1 Launches Modelsim on the “lab1” system. • make sim_uart Launches Modelsim on your UART.
3.7. HOW TO GET STARTED WRITING/EXECUTING C PROGRAMS 41 • synthdir/foo.syr: Synthesis report • synthdir/foo_map.mrp: Map report • synthdir/foo.par: Place and Route report • synthdir/foo.twr: Timing analyzer report (Where foo is the name of the top level file you compiled, as in dafk or lab1). 3.7 How to get Started Writing/Executing C Programs A good starting point is the program simpleprog situated in the directory firmware. It can be compiled with make in Linux. The executable file is simpleprog.
CHAPTER 3. LAB TASK 1 - INTERFACING TO THE WISHBONE BUS 42 macros shown in Listing 3.5 to access memory mapped I/O. These macros are defined in both the monitor (mon2.h) and in jpeglib.h but if you write a small test program you might have to include them in your own source code as well. Using these macros1 the program from Listing 3.3 would look like whats shown in Listing 3.6. Listing 3.5: Recommended macros for memory mapped I/O access.
Chapter 4 Lab task 2 - Design a JPEG accelerator 4.1 The lab system In this lab task you will learn how to build a hardware accelerator for the JPEG image compression algorithm. In this lab you will use the build target dafk.bit.
CHAPTER 4. LAB TASK 2 - DESIGN A JPEG ACCELERATOR 44 32 wb.dat_o NC wb.adr Block RAM wb.stb in counter t_rd 8x12=96 Transpose Memory WB Ctrl 1 DCT 32 t_wr 8x12=96 64 wb.ack NC 32 Q2 32 8x16=128 ... Block RAM wb.dat_o DCT2 Control Unit counter out wb.dat_i 1 32 wb.adr csr NC Figure 4.1: Proposed architecture for the 2-D DCT-accelerator. csr is a Control/Status register. Not all wires are shown. 2. A row of the image is read from the in RAM in 2 clock cycles.
4.2. PROPOSED ARCHITECTURE 45 to use a block RAM is, in our opinion, to instantiate a library primitive. The code in Listing 4.1 instantiates a block RAM shown in Figure 4.2. SSR is a set/reset signal, that only affects the output latches, not the RAM memory cells. DIP and DOP can be used for additional data such as parity bits but we do not use them in this lab.
CHAPTER 4. LAB TASK 2 - DESIGN A JPEG ACCELERATOR 46 wire [ 7 : 0 ] d a t a _ i , d a t a _ o ; wire [ 3 : 0 ] addr_a , addr_b ; / / 1 c o m b i n a t o r i a l read port a s s i g n d a t a _ o = mem[ a d d r _ a ] ; / / 1 synchronous write port a l w a y s @( p o s e d g e c l k ) b e g i n i f ( we ) mem[ a d d r _ b ] <= d a t a _ i ; end 4.2.3 The transpose memory The transpose memory shall: • be designed as a Verilog module, with the interface shown in Listing 4.3. • hold an 8 × 8 × 12 bit image.
4.3. INTRODUCTION TO µCLINUX 47 Laboration task 4 Design and implement the DCT accelerator with a WB interface. Laboration task 5 Write a testbench for your DCT accelerator. 4.3 Introduction to µClinux In the remaining labs we are going to run µClinux on the openrisc system. The most important difference between µClinux and Linux is that µClinux works without an MMU. This means that there is no memory protection for programs running on µClinux.
CHAPTER 4. LAB TASK 2 - DESIGN A JPEG ACCELERATOR 48 servers to be left after you log out since this would prohibit other lab groups from starting a TFTP server.) 4.3.3 Downloading applications via TFTP In order to download and run the hello application we must use tftp. First, hello has to be copied to the tftp directory in your home directory. After that you can write the following commands in µClinux: /> cd /mnt /mnt> tftp 192.168.0.62 tftp> get hello Received 28664 bytes in 0.
4.4. INTRODUCTION TO JPEGFILES 49 • jpegtest.c contains the test program we will use • testbild.raw is a grayscale image in raw format. • perfctr.c,perfctr.h This is the place to look if you want to add a new performance counter • jcdctmgr.c Contains the main computation loop and definitions of static variables. Also contains the forward_DCT function which calls the 2D DCT kernel and does the quantization. • jdct.c Contains the 2D DCT kernel • jchuff.c Contains the Huffman and RLE encoder. • webcam.
CHAPTER 4. LAB TASK 2 - DESIGN A JPEG ACCELERATOR 50 4.4.2 The jpegtest application This is the main test application we are going to use in the lab series. It will first read a raw picture from a file named testbild.raw, encode it to JPEG format and write it to an output file which you specify on the command line. It will also output performance data on how many clock cycles some important functions consumed.
4.6. QUANTIZATION 51 1.8 1.6 CPU 1.4 1.2 read(N) 1 Q(N) huff(N) 0.8 0.6 ACC 0.4 0.2 0 dmadct(N) 0 0.5 1 1.5 2 clockcycles 2.5 4 x 10 Figure 4.3: Timestamps for JPEG compression pipeline. Each color coded patch represents the processing of an 8 × 8-block. Colorcodes: DMA+DCT red, readout green, quantization blue and Huffman encoding yellow. 4.6 Quantization 4.6.
CHAPTER 4. LAB TASK 2 - DESIGN A JPEG ACCELERATOR 52 shifting 17 steps: Y [u, v] = round (A[u, v] − 8192 · δ[u, v]) · R[u, v] · 2−17 −96 −3 0 0 0 0 0 0 −24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 −2 0 0 0 0 0 0 0 , = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (4.3) which left only four non zero coefficients. 4.6.2 Design of a hardware accelerator for quantization The following piece of code in jcdctmgr.
4.7. TIPS AND TRICKS 53 4.7 Tips and tricks In this section we have collected some notes that you might find useful. • If you want to simulate the 2D DCT accelerator together with the rest of the system, you will have to modify the monitor to run your testcode right after the system has started. See the directions in section 3.3.3 on how to modify the monitor. • In order to improve the performance of the code you can remove some of the performance counters.
CHAPTER 4.
Chapter 5 Lab task 3 5.1 DMA in the DCT Accelerator In this lab we will improve the DCT accelerator by using DMA. You can find a specification on how to do DMA in wishbone in appendix B. 5.1.1 Proposed architecture In this lab we will modify the DCT accelerator created in lab 2 to use DMA. In this case the idea is that the DMA module will feed the DCT accelerator with data from the system memory but the CPU is still responsible for reading the data from the DCT accelerator.
CHAPTER 5. LAB TASK 3 56 WAITREADY_LAST IDLE RELEASEBUS GETBLOCK WAITREADY Figure 5.1: The proposed state diagram for the DMA accelerator. In Figure 5.1 there is a state diagram which is suitable for the DMA accelerator. The states are described below: • IDLE: The DMA module is not doing anything. • GETBLOCK: The DMA module is fetching an 8x8 block. Once the block is fetched we go to the WAITREADY state and starts the DCT transform.
5.1. DMA IN THE DCT ACCELERATOR 57 • dma_bram_data: The data we want to write to inmem in jpeg_top. • dma_bram_addr: The address we want to write the data to. • dma_bram_we: The write enable signal for inmem. • dma_start_dct: When this clock signal is high for one clock cycle, the DMA accelerator will start to transform the current block in inmem. • dct_busy: This input signal is high from one clock cycle after dma_start_dct has been activated to the moment all results have been written to outmem.
CHAPTER 5. LAB TASK 3 58 It is important to note that the DCT accelerator should still work as before if DMA is not in use. But you are (of course) allowed to assume that the accelerator will be used either in DMA mode or regular mode at one time. This means that the results when both the DMA module and a wishbone master tries to write to inmem in jpeg_top at the same time are allowed to be undefined. You should mainly modify jpeg_dma.sv in this lab but you will also have to modify jpeg_top.
5.2. WHAT TO INCLUDE IN THE LAB REPORT 59 5.2 What to Include in the Lab Report The lab report should contain all source code that you have written. (The source code should of course be commented.) We would also like you to include a block diagram of your hardware. If you have written any FSM you should include a state diagram graph of the FSM.
CHAPTER 5. LAB TASK 3 60 wbm.adr wbm.stb wbm.cyc wbm.dat_i wbm.ack DMA Address Generator Module wbs.adr wbs.dat_o wbs.dat_i dct_busy dma_start_dct dma_bram_we dma_bram_data 32 NC wbs.dat_o dma_bram_addr Block RAM wbs.adr counter in wbs.stb t_rd Transpose 8x12=96 Memory WB Ctrl 1 DCT 32 t_wr 8x12=96 64 wbs.ack NC 32 Q2 32 8x16=128 ... Block RAM wbs.dat_o counter out wbs.dat_i 1 DCT2 Control Unit wbs.addr csr 32 NC Figure 5.
Chapter 6 Lab task 4 - Custom Instructions 6.1 Introduction In this lab task you will learn how to design and integrate a new instruction into the processor. This part will target the bit alignment problem arising when trying to write bit streams to the memory. A bit stream is here defined as a stream of bits with no particular memory alignment. In accordance with jpegfiles the bit pattern to write to the stream will be called the code and the number of bits to write the size of the code.
CHAPTER 6. LAB TASK 4 - CUSTOM INSTRUCTIONS 62 suggests four Huffman tables. Two for the luminance, one for the DC values and one for the AC values. And two for the chrominance channel one for the DC values and one for the AC values. the size of the Huffman codes in a JPEG stream varies from 1 to 16 bit. 6.1.2 The Problem Although the lookup and replace is an easy operation the problem arises when we want to write the bit stream to the memory.
6.3. PROPOSED ARCHITECTURE 63 tions in software. We will add a new group of special purpose registers to the processor, therefore we must make some changes in the or1200_sprs module. In the file or1200_sprs.sv you will find a ‘ifdef OR1200_SBIT_IMPL preprocessor directive. Add the missing code inside this directive. Once this is done you should be able to use the special purpose registers since they have already been implemented for you. 6.2.
CHAPTER 6. LAB TASK 4 - CUSTOM INSTRUCTIONS 64 6.3.2 Data Path The data path unit should defined in the or1200_vlx_dp module, a stub can be found in file or1200_vlx_dp.sv. The data path unit should contain all logic relevant for the data manipulation needed. This unit will be responsible for aligning and merging the incoming bits with the bits previously stored. Different solutions are possible, either taking one or several clock cycles. This is basically a trade off between speed and hardware.
6.5. SOFTWARE IMPLEMENTATION 65 purpose register space in group 24 (the group is selected with bit 15 - 11 of a special purpose register address), the three addresses for the three special purpose registers in the vlx unit is shown in table 6.2. If you feel like adding more special purpose registers you are free to do so, you are also free to remap the special purpose registers, e.g. if you want to use a 64 bit buffer instead of intended 32 bit version.
CHAPTER 6. LAB TASK 4 - CUSTOM INSTRUCTIONS : output operands : input operands : l i s t of clobbered r e g i s t e r s ); The asm volatile keyword instructs the compiler to insert exactly this assembler instruction at exactly this position into the compiled code. The first part of the construct is an assembler template string, e.g. "l.sd 0x0(%0),%1" .
6.5. SOFTWARE IMPLEMENTATION 6.5.2 67 Integration into jpegfiles When you find that the hardware is working and you have written some test programs to verify that the the processor can execute your instruction. You are ready for the next step, to get the instruction to work for you in jpegfiles. You need to add code for three phases of operation, as described below. The only file you need to modify is jchuff.c, look for the #ifdef HW_INST blocks.
CHAPTER 6. LAB TASK 4 - CUSTOM INSTRUCTIONS 68 Code 0xF F C0 Name SOF0 Explanation Start of frame for baseline coded pictures. Stands Alone No 0xF F E0 APP0 Application specific data used by JFIF. No 0xF F DA SOS Start of scan. The image data starts after this marker segment. No 0xF F Dn RSTn Restart marker n(n = 0, 1...7), restart decoding after this marker. Yes 0xF F D9 EOI End of image, data after this marker is ignored. Yes Table 6.3: Important JFIF markers 6.
6.8. WHAT TO INCLUDE IN THE LAB REPORT 69 • How to read from a special purpose registers with address 0xC000 in C-code: asm volatile("l.
CHAPTER 6. LAB TASK 4 - CUSTOM INSTRUCTIONS 70 • What was bad? • What can we improve for the next year? • Do you have any other ideas for this course? • Did you feel that you learned anything of value? • Any other comments you may have. • A rough estimation of time spent on the lab tasks. And of course, the normal parts of a lab report such as a table of contents, an introduction, a conclusion, etc.
Bibliography [1] Xilinx Virtex-II Development Kit, www.avnet.com [2] Communications/Memory Module User’s Guide, www.avnet.com [3] Per Karlström,Mikael Andersson: Parallel JPEG Processing with a Hardware Accelerated DSP Processor, LITH-ISY-EX-3548-20004 [4] Loeffler,Ligtenberg and Moschytz: Practical Fast 1-D DCT Algorithms with 11 Multiplications, ICASSP-89, pp. 988-991 [5] Virtex-II Platform FPGAs:Complete Data Sheet, www.xilinx.com [6] OpenRISC 1000 Architecture Manual, www.opencores.
BIBLIOGRAPHY
Appendix A Open RISC Reference Platform This is the ORP standard memory map. The actual memory map for our system is in section 1.4.1. A.
APPENDIX A. OPEN RISC REFERENCE PLATFORM 74 A.
Appendix B The Wishbone specification B.1 Introduction The Wishbone specification basically dictates the interfaces and how they should behave. The method of connecting the interfaces to each other is very much up to the designer. This chapter contains a brief explanation of the Wishbone specification (which consists of approximately 140 pages). More information can be found in the official specification [9]. Some rules from the specification are cited in the following text.
APPENDIX B. THE WISHBONE SPECIFICATION 76 Table B.1: Wishbone signals (named from the master side). Name adr dat_o dat_i we sel stb cyc ack cti bte err Direction M->S M->S M<-S M->S M->S M->S M->S M<-S M->S M->S M<-S Width 32 32 32 1 4 1 1 1 3 2 1 Description Address bus Data bus out Data bus in Write enable Byte selects Strobe signal Valid bus cycle Bus cycle acknowledgment Cycle type identifier Burst type extension Bus cycle error B.
B.3. WISHBONE CLASSICAL CYCLES B.2.6 77 cyc This signal indicates that a valid bus cycle is in progress. This signal should be asserted for the duration of all (consecutive) bus cycles. B.2.7 ack This acknowledgment input indicates the normal termination of a bus cycle. Abnormal termination is indicated through the err signal. B.2.8 cti The cti signal indicates the current bus cycle type. This signal is described in more detail in the section on incrementing burst cycles, see section B.4. B.2.
APPENDIX B. THE WISHBONE SPECIFICATION Master signals 78 CLK_I ADR_O DAT_O DAT_I WE_O SEL_O STB_O CYC_O ACK_I −WSS− VALID VALID VALID WSM WSS WSM CLK_I CYC_O STB_O ACK_I WSS Figure B.3: Wishbone classical single write cycle. Figure B.4: Wishbone classical block cycles. Burst-style accesses can be done through the classical bus cycles. This will (almost always) require wait-states to be inserted at various points during the access. Fig. B.
B.5. SYSTEM VERILOG INTERFACE 79 Table B.2: cti and bte signal values Signal group cti Master signals bte Value 000 001 010 011-110 111 00 01 10 11 Description Classic cycle Constant address burst cycle Incrementing burst cycle Reserved End of burst Linear burst 4-beat wrap burst 8-beat wrap burst 16-beat wrap burst CLK_I CTI_O BTE_O ADR_O DAT_O DAT_I WE_O SEL_O CYC_O STB_O ACK_I 010 111 00 n VALID n+4 n+8 n+C VALID VALID VALID VALID Figure B.5: Wishbone linear increment burst.
APPENDIX B.
Appendix C Tips & Trix In this appendix we have collected a number of tips and trix that you might find useful. • If you encounter some weird problems with the hardware you can try to turn the power off to the FPGA system before configuring it. We have had some problems with the FPGA board which can be solved in this manner. • You do not have to restart Modelsim or recompile all files to get new changes included in the simulation. Just recompile the file you changed and type restart -f in Modelsim.
APPENDIX C. TIPS & TRIX • If you cannot find certain commands, make sure that the following commands are present in your .bashrc: source /opt/Xilinx/settings.