ARM Mali GPUMidgard ArchitectureMathias PalmqvistSenior Software Engineer/ARM MPGLund University, Faculty of Engineering LTHDecember 6th 2016 ARM 2016

Agenda Midgard Architecture OverviewDevice Driver Work buildingGPU Software InterfaceGPU Hardware JobsJob ManagerUnified Shader CoreShader Core:Vertex WorkTiling UnitShader Core: Fragment WorkShader Core: Tripipe Power Consumption & Bandwidth Reduction ARM Lund 2 ARM 2016Power ConsumptionForward Pixel KillTransaction EliminationARM Framebuffer CompressionPixel local storageARM Lund GPU teamsGPU HW/SW developmentParty!Student Oppurtunities

Mathias PalmqvistWork Senior Software Engineer, ARM 5 yearsGPU driver, EGL on Android/X11, Multimedia. Previously at Sony Ericsson 7 yearsInterests Game Development(often non-graphics related)Real-time techniques with mobile focusPersonalAndroid Graphics Integration. Education LTH, M.S.EE (e98). Fourth year at U.C. San Diego.Hardware and graphics main focus.3 ARM 2016Married to California girl from UCSD since 13yearsTwo children, 5 and 2. Aiden and Avery.

Midgard GPU Architecture4 ARM 2016

AcrynomsAMBAAXIAPBACEGPUVPUDPUISASIMD5 ARM 2016Advanced Microcontroller Bus ArchitectureAMBA Advanced eXtensible InterfaceAMBA Advanced Peripherial BusAMBA AXI Coherency ExtensionsGraphics Processing UnitVideo Processing UnitDisplay Processing UnitInstruction Set ArchitectureSingle Instruction Multiple Data

Midgard Architecture Overview Tile-based, Unified Shading ArchitectureLatest Midgard GPU is T880Maximum of 16 shader coresTile size 16x16 (4x4-32x32 internally)GLES 3.2,Vulkan 1.0, CL 1.2, DX FL11 2MSAA 4x, 8x,16xT880 in Samsung GS7 and Huawei P9 First product T604 in 2011q4 6SoC: Samsung Exynos 5250Products: Nexus10, Samsung Chromebook ARM 2016

GPU top orecoreMMUInterconnectL2 Cache #0Driver7 ARM 2016L2 Cache #1External Shared MemoryTiling Unit

Simplified Application-Driver InteractionAppFrame 0Frame 1GPUF0F1ScheduledScheduled8 ARM 2016FlushedF2F0F3ScheduledF1PostDPUFrame 3FlushedFlushedFlushedDriverFrame 2F2PostPostF0F1

Driver building GPU resource structuresAppDriverTex StructureShader1 StructureglTexImageX()Build Texture structureFormat, DimensionsData PointerProgramPointerglShaderX()glBufferX()9Build Shader structureBuild Vertex/Indexbuffer structuresBuffer StructureFormat, DimensionsData PointerShader2 StructureglDraw()glShaderX()glDraw().Build Vertex Job StructureLink with Buffers()Build Fragment Job StructureLink with FB and last Vertex JobFrameBuffer PointerLast Job Pointer ARM ob2BufferPointerStructureIndex PointerBuffer PointerShader PointerIndex PointerShader Pointer

GPU hardware job typesVertex Job(V)Vertex shader running on a set of verticesTiler Job(T)Tiling Unit (Fixed Function) work to split transformed primitivesinto affected tilesFragment Job(F)Single render target job running over all tilesJob ChainA chain of jobsJob Chain 1Job Chain 2VVT10 ARM 2016VTVTTF

GPU job progression11 ARM 2016

GPU Software ruptJobsAPBJobManagerAXIExternal Shared MemoryAXIVertexJob1 StructureVertexJob2 Structure12 ARM 2016TilingJob1 StructureTilingJob2 StructureFragmentJobStructure

Job ManagerVertexJob (96 vertices)TilingJobFragmentJob (Full Frame)Job TaskNFTask3VTask2TTask2FTask2Shader CoreVTask1Shader CoreShader CoreTTask1Tiling UnitExternal Shared Memory13 ARM 2016FTask1(Tile1)FTask2(Tile2)Shader CoreShader CoreShader CoreFTask1

Unified Shader CoreVertex FrontendFragment FrontendExternal Shared MemoryTiler DataTripipeAttributes/Uniforms inVaryings outTexturesFramebufferFragment Backend14 ARM 2016TileTileBufferTileBufferBuffer

Shader Core: Vertex WorkVertex Thread CreatorThread 1Thread 2Thread 3Thread 4 Vertex Threads do not write to tile buffers butdirectly to main memoryVertex Tasks contain multiple of 4 verticesExternal Shared MemoryAttributes/Uniforms inVaryings out15 ARM 2016

Tiling UnitTiler’s goalHierarchy Level 0 - 16x16 tile binsHierarchy Level 1 - 32x32 tile binsHierarchy Level 2 - 64x64 tile bins1234Find out what tiles are covered by aprimtive. Update tile structure with that 025Small primitive - Affect few tiles - Use low hierarchylevel - Save read bandwidthLarge primitive - Affect many tiles - Use highhierarchy level - Save write bandwidthHeuristic approachDetermine what is best given distribution16 ARM 2016

Shader Core: Fragment WorkFragment Front-EndFragment Back-EndPolygon List ReaderDepth-bounds &Hierarchical ZS TestTile WritebackRasterizerTile MemoryEarly ZS TestingBlendingFPK Queue (128 Quads)Fragment Thread Creator17 ARM 2016TripipeLate ZS Testing

Shader Core: Tripipe details SIMD processing core that executesinstructions from the Midgard ISA 3 types of pipes, but 5 total 128-bit operands, float or integer 183 Arithmetic1 Load/Store1 Texture2xFP64, 4xFP32, 8xFP16Up to 256 active threads at the same time ARM 2016

Power ConsumptionandBandwidth Reduction19 ARM 2016

Power Consumption Power comparision Battery life time & heat dissapationStatic and Dynamic power consumption Phone SoC 1-3WTablet 10WDesktop Gfx card 100-200WConstraints Approximate GPU powerconsumption distributionStatic power. Related to technology and area. Improved through power management(gating)Dynamic power. Heavily tied to use case. Responsability of designer.Performance DensityFPS/mm 2. Smaller Area and less dynamic power - possibly more cores20 ARM 2016TexturePipeArithmeticPipesOther

Reduce Overdraw: Forward Pixel KillInsert small FIFO for 2x2 quads afterEarlyZ but before entering tripipe.Reject quads ”in front” before theyreach expensive tripipe execution21 ARM 2016

Transaction Elimination 22Signature/Hash calculated of color tilebuffer before write-out and saved.If it matches previous signature, write-out is skipped saving bandwidth. ARM 2016

ARM Framebuffer Compression/AFBC 23Real-time, lossless, small-area framebuffer compression technologyUsed on both external(window) and internal(FBO) buffers for GPUFormats: RGB/YUV (including 10-bit), depth/stencil ARM 2016

EXT shader pixel local storage Bandwidth Efficient Deferred ShadingDefine G-Buffer in Pixel Local Storage space(in tile memory)G-Buffer initialization24 ARM 2016Lighting accumulationFinal shading

ARM Lund25 ARM 2016

ARM Lund GPU teams Hardware design/verification 15 (Fragment front/back-end)GPU Modelling 17Graphics/Backend Compiler team 16HW/SW regression testing 10Multimedia(GPU VPU DPU) development regression 4ARM Lund total 100. Remaining are Video IP development.Cambridge/Trondheim major GPU development sites26 ARM 2016

GPU HW Development HDL: Verilog System Veriloggit/gerritCluster based RTL simulation frameworkHW Engineers usually with low-spec desktop/laptop. 27Everything runs on massive CPU clusters in Cambridge ARM 2016

GPU SW Development Sw tools: git/gerrit/svn, scons(DDK configuration), codecollaborator, JIRAGPU DDK release cycle: scrum-based. Release every iteration. Regression Server Test Framework: Internal Handles FPGA board flashing, driver building, board allocation and test execution, automaticbug filing with git bisectHardware platforms 28Iteration: 2 monthsSprints: 2 weeks. 4 sprints/iterationARM Juno/VersatileExpress single/multiple FPGA tilesSilicon development platforms with MaliMali GPU Model (bit accurate, close to cycle accurate) ARM 2016

FPGA Lab29 ARM 2016

ARM Lund Open-House Event TONIGHT(6/12) Where What 30Tuesday December 6:th @17.15 Emdalavägen 6, 4:th floorFood/Drinks/SnacksLund Team ”stations”. Talk to a hw designer or compiler engineerGraphics & Video HW/SW presentationsDemosMingle ARM 2016

ARM Lund Student opportunities 31Graduate job offerings – apply through www.arm.com/careersInternship (part/full time) – apply through www.arm.com/careersMaster thesis work – see adds on hand outs, apply to [email protected] you have any questions or specific proposals on how to engage with ARM, justdrop an email to [email protected] ARM 2016

Q &AThe trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited(or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may betrademarks of their respective owners.Copyright 2016 ARM Limited ARM 2016

Backup slides33 ARM 2016

Midgard Shader Core: Fragment Frontend Polygon List Reader / Triangle Setup Depth-bounds and Hierarchical-Z S Testing Cull quads based on ZS values, if possibleThose which pass ZS may cull old quads in FPK QueueBuffer with 128 quads; Freya supports priority quad issue into FTCFragment Thread Creator 34RasterizerEarly ZS TestingFPK Queue Generate 2x2 fragment quads with per per-fragment coverage maskEarly ZS Testing Conservative fast cull of primitives based on line equationsDepth-bounds &Hierarchical ZS TestRasterizer Loads primitives from the tiler polygon list; 13 cycles a trianglePolygon List ReaderSpawn a quad as four fragment threads over four cycles ARM 2016FPK Queue (128 Quads)Fragment Thread Creator

Midgard Shader Core: Fragment Backend Late ZS Testing 35Storage for color and depth stencil dataMali-T760 onwards provide flexible allocation for MRT and/or MSAATile Writeback BlendingBlend up to 4 samples per clock for subset of blend equations 4x MSAA needs 1 cycle, 8x needs 2 cycles, and 16x needs 4 cycles Complex blend operations implemented using blend shadersTile Memory Resolve any outstanding ZS tests we couldn’t do earlyBlending Late ZS TestingRequires one cycle per output pixelProvides the CRC generation for transaction eliminationAFBC compression and YUV writeback supported (Mali-T760 onwards) ARM 2016Tile MemoryTile Writeback

Smart Composition Selectively update parts of a frame Khronos EGL extensions 36KHR partial updateEXT buffer age Producer can utilize age of existing backbuffer contents to draw frame withmultiple bounding boxes Similar to TE, but here we wont even write to the tilebuffer ARM 2016

GPU DDK release cycle: scrum-based. Release every iteration. Iteration: 2 months Sprints: 2 weeks. 4 sprints/iteration Regression Server Test Framework: Internal Handles FPGA board flashing, driver building, board allocation and test execution, automatic bug filing with git bisect