Vector Processors

Please download to get full document.

View again

of 4
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Documents

Published:

Views: 62 | Pages: 4

Extension: PDF | Download: 0

Share
Description
International Journal of Computer Applications (0975 – 8887) Volume 20– No.4, April 2011 Architecture of SIMD Type Vector Processor Mohammad Suaib National Institute of Technology Hamirpur, India Abel Palaty National Institute of Technology Hamirpur, India Kumar Sambhav Pandey National Institute of Technology Hamirpur, India ABSTRACT Throughput and performance are the major constraints in designing system level models. As vector processor used deeply pipelined functional unit, the operation
Tags
Transcript
   International Journal of Computer Applications (0975  –  8887)Volume 20  –  No.4, April 2011 42 Architecture of SIMD Type Vector Processor   Mohammad Suaib National Institute ofTechnology Hamirpur, India Abel Palaty National Institute ofTechnology Hamirpur, India Kumar Sambhav Pandey National Institute ofTechnology Hamirpur, India ABSTRACT  Throughput and performance are the major constraints indesigning system level models. As vector processor used deeplypipelined functional unit, the operation on elements of vectorwas performed concurrently. It means the elements wereprocessed one by one. Improvement can be made in vectorprocessing by incorporating parallelism in execution of theseconcurrent operations so that these operations can be performedsimultaneously. This paper presents a design andimplementation of SIMD-Vector processor that implements thisparallelism on short vectors having 4 words. The operation onthese words is performed simultaneously i.e. the operation onthese words is performed in one cycle. This reduces the clock cycles per instruction (CPI). To implement parallelism in vectorprocessing requires parallel issue and execution of vectorinstructions. Vector processor operates on a vector andsuperscalar processor issues multiple instructions at a time. Thismeans parallel pipelines are implemented and then made theseto support vector data. SIMD-Vector processor will operate onshort vector say 4 words vector in a superscalar fashion i.e. 4words will be fetched at a time and then executed in parallel.This requires redundant functional units e.g. if addition is to beperformed on two vectors multiple adders are needed. We havedesigned the architecture of SIMD type Vector processor. Allthe designing parameters are explained. Keywords  SIMD type Vector processor, vertical and horizontal parallelism,ILP. 1.   INTRODUCTION Parallel processing is the need of today’s architectures. Parallel processing reduces the execution time taken by any program.The execution time taken by any program is determined by threefactors: First, the number of instructions executed. Second,number of clock cycles needed to execute each instruction andthe third is the length of each clock cycle. Here we shall try toreduce the number of clock cycles by introducing a newprocessor named SIMD type of vector processor. Superscalarand VLIW architectures improve the performance by reducingthe Cycles Per Instruction (CPI). This architecture take theadvantages of superscalar processor as well as vector processor.SIMD-Vector architecture supports In-order issue with out-of-order completion. All the vector instructions are issued in-orderand kept in Instruction cache. After checking the structural anddata hazard all the vector instructions are executed in out-of-order sequence. Reorder buffer is used to write the output in-order. Hence we get the correct output sequence.Technology is changing rapidly and significantly in past fewyears. For microprocessor technologies multimedia applicationsare the main stream computing. In this scenario we can improvethe performance of the processor by exploiting data levelparallelism (DLP) and instruction level parallelism (ILP). Toexploit DLP, instructions are executed in single instructionmultiple data (SIMD) fashion. We adopt the SIMD processorsinto general purpose processors [2]. Multimedia processors has alot of inherent parallelism so it can be easily exploited by SIMDinstructions at low cost and energy overhead.Here we can see a lot of superior theoretic performance. Butpractically it is not possible due to some limitations. If we addmore processing unit into our SIMD-Vector architecture then itsufficiently increase the hardware cost as well as complexity of the processor. So as a result we worked on short vector. SIMD-Vector architecture supports the instructions of vector length 4.In this architecture we assume that all the instructions are vectorand should be of the length of four. This architecture has 4execution units. All the four vector elements are processed onfour different processing units. This execution is performedparallel in one clock cycle. Hence we can reduce the clock cycles to perform multimedia applications. To reduce thecomplexity of the system chaining is not used to improve theperformance of vector processing. If some instructions have thevector length less than four then available vector elements aresent to execution engines and remaining are circuited to ground.Short vector implementation introduces large parallelizationoverhead such as loop handling and address generation [1]. There are many examples of SIMD processors such as IBM’sVMX, AMD’s 3D Now!, Intel’s SSE and Motorola’s Altivec. In these processors we can embed vector processing with takingthe advantage of 4 way superscalar processor.The SIMD-Vector architecture brings new levels of performanceand energy efficiency. Organization of paper is as follows. Insection 2 the motivations of this work is introduced. Section 3describes the SIMD-Vector architecture. SIMD-Vector iscompared with other conventional vector architecture in section4. Then the evaluation result is shown in section 5. Section 6describes the conclusion of whole work. Finally section 7 givesthe future work. 2.   MOTIVATION A vector ISA packages multiple homogeneous, independentoperations into a single short instruction which results into acompact code. The code is compact because a single shortvector instruction can describe N operation. This reducesinstruction bandwidth requirements.Reduction in instruction bandwidth: A single vector instructioncomprises of N operations thereby reducing the instructionbandwidth. In the proposed scheme throughput and performancecan be enhanced by introducing parallelism. It can be done byincorporating superscalar issue in vector processing.   International Journal of Computer Applications (0975  –  8887)Volume 20  –  No.4, April 2011 43Hardware reduction: In vector instruction N operations arehomogeneous. This saves hardware in the decode and issuestage. The opcode is decoded once and all N operations can beissued as a group to the same functional unit. In our proposedscheme, this is taken as the basic design constraint.SIMD extensions and vector architecture are quite similar. Theprinciple difference is that how the instructions control isimplemented and communication between execution unit andmemory unit. With the help of pipelining technology vectorprocessor can overlap computation, load, store operations onvector elements. So vector length may be long and variable. Thiskind of parallelism is called vertical parallelism. Instructionlatency is bigger than one cycle per vector element. WhileSIMD extension duplicates the execution units to perform theparallel execution. This type of parallelization is calledhorizontal parallelism. Due to limitation of hardware cost wecannot add much execution units so the vector length should befixed and short.for (int a=0;a<64;a++){z[a]=x[a]+y[a]:}(a) Scalar formfor (int a=0;a<64;a+=4){z[a+3:a]=x[a+3:a]+y[a+3:a]}(b) SIMD-Vector formFor above given example there are 64 iterations in scalararchitecture. Scalar architecture takes one clock cycle instructionlatency. While using SIMD-Vector architecture four vectorinstructions can be executed in one clock cycle simultaneously.So instruction latency is just greater than 16. 3.   SIMD TYPE VECTOR PROCESSOR In this section we describe the architecture of SIMD-Vectorprocessor, pipelining and working of proposed architecture. 3.1   Proposed Architecture In proposed architecture SIMD unit is the functional unit toperform the vector operations. It is similar as conventionalSIMD unit. Architectural overview of proposed scheme is givenin Figure 1.   For a given set of vector operations each time SIMD unitexecutes one vector instruction at a time concurrently as vectorinstruction has four vector element only. To handle the longvector operations we need the smart compiler for vectorizing theinstructions. All the vectorized instructions should be of length4. We add a additional unit called vector code cache (VCC) tohandle the long vector operations. We restrict the size of VCCache to 1 KB that can store 256 operations of 32 bitinstruction encoding that is enough for most of the multimediaapplications. Loop controller generates the loop control signal tocomplete long vector operations with keeping in mind that 4operation can be done in one clock cycle. It is very tedious toprovide the memory location to all the vector element usingconventional memory system. To support the strided memorylocation to vector elements we need an address generator unit[3]. This address generator unit is connected to vector registerfile and memory via load-store unit. And all remaining units areas conventional with standard meaning. Figure 2 shows theSIMD unit having 4 execution units that can execute 4operations in parallel in one clock cycle. Table 1. Architectural parameterParameter Explanation Bit Size B S Bit size of SIMD unit 128B VRF Bit size of vector register file 128B LS Bit size of load store unit 128B VE Bit size of vector element 32L V Vector length 4We have described some parameters for SIMD type Vectorprocessor that are listed in table 1. Our vector register shouldsupport 4 vector element of 32 bit each. So length of vectorregister file (VRF) would be 128. Generally we take the SIMDunit of 128 bit length. Memory unit that is load-store unit wouldalso be 128 bit long. These type of architecture is well supportedby IBM's Altivec ISA [4] and Intel's SSE ISA. We are taking 32bit long vector element. Our proposed architecture wouldsupport the instructions of vector length 4. 3.2   Pipelining In SIMD Type Vectorprocessor In Figure 3 it is shown that how pipeline technology is exploitedin SIMD-Vector architecture. At x axis clock cycle is plottedand y axis vector instructions (VI) are shown. Five stagepipelines are shown in Figure 3. By seeing pipeline structure it iseasily understood there are four functional unit that can beoperated simultaneously on 4 vector element in one clock cycle.   International Journal of Computer Applications (0975  –  8887)Volume 20  –  No.4, April 2011 44   IFIDLoopControllerSIMD UnitI CacheVRFLD/STAddressGeneratorD CacheData Bus   Fig 1: Proposed Architecture of SIMD type Vector Processor 3.3   Working Of SIMD Type VectorProcessor In SIMD-Vector, superscalar implementation is converted tosupport vector data instead of scalar data. To implement paralleloperations on vector redundant functional units are needed.SIMD-Vector behavior is shown in figure 4. 4.   COMPARISON WITH OTHERARCHITECTURE In this section we have compared SIMD-Vector architecturewith SIMD extensions and vector architecture. Proposedarchitecture take the advantages of SIMD as well as vectorprocessors. The width of SIMD-Vector VRF file is muchsmaller than vector architecture implemented in recent singlechip processors [5,6]. CURegs Regs RegsRegsPE1PE2 PE3PE4mem mem mem memData Bus   Fig 2: SIMD unitFig 3: Pipelining in SIMD type Vector Processor   Fig 4: Working of SIMD-Vector processor     International Journal of Computer Applications (0975  –  8887)Volume 20  –  No.4, April 2011 45 Table 2. Architecture ComparisonFeature SIMD-VectorSIMD Vector VectorLength4 32 >=64MemoryaccessAutomaticaddressgenerationSequentialaccessStridedaccessInstructionlatency1 cycle pervectorelement1 cycle perinstruction1 cycle perelementParallelism combined Vertical Horizontal 5.   EVALUATION By using proposed SIMD-Vector architecture we can enhancethe performance of the system. We have analyzed instructioncounts on many multimedia operations like fast fouriertransform, matrix multiplication, finite impulse response filterinfinite impulse response filter using scalar, SIMD and SIMD-Vector architecture. Response of the analysis is shown n thefigure 5. This figure completely shows that when we use SIMD-Vector architecture number of instructions are fairly less. 00.20.40.60.81FFT MAT FIR IIRScalarSIMDSIMD-Vector   Fig 5: Comparison of instruction counts 6.   CONCLUSION SIMD-Vector processor implements parallelisms on shortsvector having four words. The operation on these words isperformed simultaneously i.e. the operation on these words isperformed in one cycle. This reduces the clock cycles perinstruction (CPI). The parallelism in vector processing requiressuperscalar issue of vector instructions. Above paper gives thearchitecture of proposed processor that can be exploited inmany multimedia applications. 7.   FUTURE WORK In the future, the parallelism in operation can be enhanced tosupport longer vectors having more words. This leads to anincrease in the hardware as more parallelism requires morefunctional units. 8.   REFERENCES [1]   Shin, J., Hall, M.W., Chame, J.: Superword-LevelParallelism in the Presence of Control Flow. In: CGO2005, pp. 165  –  175 (2005).[2]   Lee, R.: Multimedia Extensions for General-purposeProcessors. In: SIPS 1997, pp. 9  –  23 (1997).[3]   Talla, D.: Architectural techniques to accelerate multimediaapplications on general-purpose processors, Ph.D. Thesis,The University of Texas at Austin (2001).[4]   Diefendorff, K., et al.: Altivec Extension to PowerPCAccelerates Media Processing. IEEE Micro 2000 20(2),85  –  95 (2000).[5]   Corbal, J., Espasa, R., Valero, M.: Exploiting a New Levelof DLP in Multimedia Applications. In: MICRO 1999(1999).[6]   Kozyrakis, C.E., Patterson, D.A.: Scalable VectorProcessors for Embedded Systems. IEEE Micro 23(6), 36  –  45 (2003).[7]   K. Yeager, “The MIPS R10000 Superscalar Microprocessor”, in Proceedings of IEEE Micro, Vol. 16, No. 2, pp. 28-41, April 1996.[8]   James E. Smith, Gurindar S. Sohi, “The Microarchitecture of Supersca lar Processors”, in Proceedings of IEEE, Vol. 83, No. 12, pp. 1609-1624, December 1995).[9]   Open SystemC Initiative (OSCI), www.systemc.org.   Instructioncount
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x