ECE382M.20: SoC, Lab 1 (2024)

The goals of this lab are to:

<![if !supportLists]>•<![endif]>Learn the structure of the Darknet source code andcompile the code for the ARM platform

<![if !supportLists]>•<![endif]>Identify and propose ways to remove the bottleneckof the code when run on the ARM platform

The assignment of this lab includes the following:

<![if !supportLists]>•<![endif]>Set up the design and board environment

<![if !supportLists]>•<![endif]>Profile the code to identify the timeconsuming portions of the code

<![if !supportLists]>•<![endif]>Complete an exercise to remove a type of bottleneck

<![if !supportLists]>•<![endif]>Isolate modules of the Darknet and performfloating-to-fixed point conversion

<![if !supportLists]>•<![endif]>Perform additional software optimizations

Lab work for this class can be done either on the ECE Department’sLRC Linux servers or the Ultra96 Board. For Lab 1, you can compile theapplication either on the board or cross-compile it on the LRC servers, but ourtarget is the ARM platform, i.e. all profiling will need to be done on theboard itself. Note that for functional testing, you can also natively compileand execute Darknet on any other, e.g. Intel platform.

<![if !supportLists]>(a)<![endif]>ECE Linux Servers

We will be usingthe LRC servers for the class. Instructions for remote access via ssh are listed here: https://wikis.utexas.edu/display/eceit/ECE+Linux+Application+Servers

You can use the /misc/scratch directory on theLRC machines as your own workspace. The scratch directory will not be wiped outuntil the end of the semester. However, scratch space is also not backed up,i.e. use at your own risk. Execute the following commands:

% cd /misc/scratch

% mkdir <yourusername>

For softwaredevelopment targeting the board, e will be using Xilinx’s SDK. Thisincludes the capability to compile and link applications for the board using theaarch64-linux-gnu-gcc cross-compiler tool chain, which is installed on the LRCmachines and provided by Xilinx together with their development environment:

%module load xilinx/2018
%source /usr/local/packages/xilinx_2018/vivado_hl/SDK/2018.3/settings64.sh
<![if !supportLineBreakNewLine]>
<![endif]>

<![if !supportLists]>(b)<![endif]>Boards

Each team will get an Ultra96 boardpre-installed with Ubuntu 18.04. You can connect to the board initially from aLinux or Windows host via USB-UART as follows:

<![if !supportLists]>1.<![endif]>Power on and connect the Ultra96 board to the hostmachine using the provide USB-UART serial cable. If the board doesn’tboot automatically, press the Power Button (SW4). The blue Power On and Done LEDs (D1/D2) next to the microSD card socketshould be on.

<![if !supportLists]>2.<![endif]>On a Linux host, search the kernel messaging withthe command dmesg|grep tty and look for anindication that the USB-UART is enumerated as a device (typically listed as /dev/ttyUSB1). Connect thedevice with the minicom application,using the following command:

% minicom –D /dev/ttyUSB1 –b 115200 -8-o

The minicomterminal will connect and allow the Ultra96 board terminal output to be interactedwith. For further details about the board and its bringup,you can consult the Open HW Wiki.

<![if !supportLists]>3.<![endif]>On Windows, go into the Device Manager to find theCOM port for the USB connection and use a terminal application like Putty toconnect with a baudrate of 115200. See this getting started link for general details. If thedevice driver for the USB UART is not automatically installed, or for furthertroubleshooting, please see the USB-to-JTAG/UARTpod documentation by Avnet.

The login/password will be provided with the board.This account has root access via sudo. To setup Wifi on the board, first put the SSID and pre-shared key(PSK) of the network to connect into in the /root/wpa_supplicant.conf file. To generatethe PSK from a plain-text password, run:

% wpa_passphrase <ssid> <password>

and copy and paste the PSK entry into /root/wpa_supplicant.conf.

If you are on campus, you need to use the “utexas-iot” network and register the board’sMAC address (stamped onto the Wifi chip) with ITS here.This will give you the PSK value to put into wpa_supplicant.conf. Important: don’t forget tode-register the device from your EID at the end of the semester or you will beon the hook for any shenanigans by future users of the board!

Then start Wifi with:

% sudo /root/wifi.sh

This command may take ~30s to execute, but as longas the SSID and PSK are correct, it should connect. To run anssh server on the board, you can follow this guide.You can then connect your board to the network and potentially use ssh to access theboard remotely via Wifi. Install any necessarytools/libraries as you wish.

Again, you can compile theapplication directly on the board or cross-compile it on the LRC servers:

<![if !supportLists]>a)<![endif]>Get the latest Darknet code from the followinglink: https://github.com/AlexeyAB/darknet

%git clone https://github.com/AlexeyAB/darknet

<![if !supportLists]>b)<![endif]>Go to the Darknet directory

%cd darknet

<![if !supportLists]>c)<![endif]>Compile the Darknet sources. If you arecross-compiling for the board on the LRC servers, first update the Makefile touse the correct compiler settings:

CC=aarch64-linux-gnu-gcc
CPP=aarch64-linux-gnu-g++

Then, run make in the Darknetdirectory:

%make

<![if !supportLists]>d)<![endif]>We will be using the pre-trained Tiny YOLO CNNfor small and embedded devices. Get the pre-trained weight model from thefollowing link

%wget https://pjreddie.com/media/files/yolov3-tiny.weights

<![if !supportLists]>e)<![endif]>If you cross-compiled on the LRC machines,transfer the darknet executable and all configuration settings (weights fileand cfg/ and data/subdirectories) to the board. Test and run Darknet/YOLO with the followingcommand on the board:

% ./darknet detectortest cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg -save_labels

The save_labels flag will producethe golden reference output with detected classes and bounding boxes and saveit in the file data/dog.txt. Youcan also look at the generated predictions.jpgfor a visual representation of detection results.

For more information, and to getfamiliar with Darknet concepts and the source code, read the material and gothrough the following links:

https://pjreddie.com/darknet/yolo/

https://pjreddie.com/media/files/papers/yolo.pdf

<![if !supportLists]>a)<![endif]>Before you can profile your program, you must firstrecompile it specifically for profiling. To do so, add the -pg option the CFLAGS line in the Makefile. Then, recompile the code:

<![if !supportLists]>b)<![endif]>Profile the code using:

% ./darknet detectortest cfg/coco.data cfg/yolov3-tiny.cfg yolov3-tiny.weights data/dog.jpg

This command does not overwrite the reference data/dog.txt output file bydefault (unless the –save_labels option isincluded). This will allow us to use the original reference output as groundtruth to compare against when we start making modifications and optimizationsof Darknet as discussed below.

<![if !supportLists]>c)<![endif]>Running the program to completion causes a filenamed gmon.out to be created in the current directory. gprof works byanalyzing the data collected during the execution of your program after yourprogram has finished running. gmon.out holds this data in a gprof-readableformat.

<![if !supportLists]>d)<![endif]>Run gprof as follows:

%gprof darknet gmon.out >darknet.perf

<![if !supportLists]>e)<![endif]>Identify the bottleneck of the code based on theexecution time of each function. Report your profiling results.

As you probably realize by now, the generalmatrix-matrix multiply (GEMM) part in the convolutional layers occupies thedominant share of the total execution time. GEMM is known to becomputationally-intensive and expensive operations. Now, let’s try to dosome optimization to improve the execution speed of the GEMM. Image processingor object detection applications like YOLO in general require algorithms thatare typically specified using floating-point operations. However, for power,cost, and performance reasons, they are usually implemented with fixed-pointoperations either in software or as special-purpose hardware accelerators. To that end, wewill convert the floating-point GEMM in Darknet to a fixed-point GEMM.

First, isolate the GEMM as a standalone program from the darknet code.By default, Darknet’s GEMM uses a float data type. Convert the GEMM data type fromfloating-point to fixed-point using only integer data types, such as short/longints (signed or unsigned). This codesnippet shows how to perform floating- to fixed-point conversion in C/C++.

As you are converting the GEMM to fixed-point, a certain amount ofaccuracy loss is unavoidable. This whole idea of trading off accuracy withexecution speed is often called ApproximateComputing. Inthe context of the standalone GEMM, we can define an accuracy metric by thesignal-to-noise ratio (SNR). An example for calculating SNR usingMatlabisgiven below, where the output matrices of the floating point GEMM andfixed-point GEMM are assumed to be cout_flp and cout_fxp, respectively:

ddif = cout_fxp – cout_flp;

disp([‘SNR is’, num2str(10*log10(sum(cout_flp(:).^2)/sum(ddiff(:).^2))),‘dB’]);

Try to maximize the SNR of your fixed-point GEMM. Aim to achieve atleast >40 dB of SNR. Report the SNR of your converted GEMM. You can use this testbench to report the SNR.

Integrate the fixed-point GEMM back into the Darknet code and exploreopportunities for further optimizations in the larger Darknet context. Somehints for possible avenues:

<![if !supportLists]>•<![endif]>So far, we have performed the floating- tofixed-point at the GEMM boundary. This will require conversion overhead onevery GEMM call. To gain more significant system-wide performance, you canexplore pushing the conversion boundary further beyond the GEMM.

<![if !supportLists]>•<![endif]>Hint: When and where is the first time in the codethat we operate with floating-point images or weights? Instead of converting tofixed-point not until the GEMM is called, can we convert them the valuesearlier, e.g. the first time we see them?

<![if !supportLists]>•<![endif]>More specifically, many of the weight values usedin the GEMM are constant. Can we convert the weights into fixed-point constantsat compile time (rather than doing run-time conversion)?

<![if !supportLists]>•<![endif]>Some pre-processing operations before the GEMM inthe convolutional layers are filling the matrix C with zeros. Thelarger the size of matrix C, the longer run-time it takes to complete.Can we do something smarter? Do we have to always fill with zeros?

<![if !supportLists]>•<![endif]>Explore the fixed-point data types design space.What is the smallest fixed-point data type that you can use during conversion?In general, the smaller data type the better in terms of performance. Inparticular, you can exploit more SIMD parallelism (data packing) with smallerdata types (see below).

Use profiling to measure and guide you towards achieving as muchimprovement in total Darknet runtime as you can, with as minimal a loss in thedetection accuracy of the overall YOLO application that includes your convertedfixed-point modules and interfaces. Note that as you are optimizing the entireDarknet software, as discussed above a certain amount of prediction accuracyloss is expected. That being said, your optimized version should be able to atleast predict that there are four objects in the picture: dog, bicycle, car,truck. Your program should at least predict these four objects. The predictionaccuracy of these four objects might vary, but the accuracy vs. performancetradeoff should be optimized.

To measure accuracy of object detection applications, a commonly usedquality loss metric is the so-called meanAverage Precision (mAP), which is essentially theaverage of the maximum precisions at different recall values. For furthertheoretical background, refer to this link.Darknet includes the capability to compute the mAP of your modified program asfollows:

<![if !supportLists]>a)<![endif]>Unfortunately, the mAP computation in Darknet has abug and crashes if less than 4 images are provided. To fix the bug, apply thefollowing patchand recompile Darknet. The patch will also modify Darknet to only report mAP for objectclasses that are actually included in the provided image test set (as opposedto reporting average detection accuracy across all classes that the CNN wasoriginally trained for, even if those are not tested). Make sure you are in thedarknet directory andapply the patch:

<![if !supportLists]>b)<![endif]>Put the (relative) paths of the images you want tobe included in the mAPcomputation into a coco_testdev file in the darknet directory. Forexample:

<![if !supportLists]>c)<![endif]>Make sure that the ground truth reference files(e.g. data/dog.txt) are the onesproduced by a run of the original, unmodified floating-point Darknetimplementation. Then, run your modified fixed-point implementation on theimages listed in the coco_testdev file and computethe mAP:

Report the following:

<![if !supportLists]>•<![endif]>Total execution time of Darknet using youroptimized fixed-point versus the original floating-point implementation.

<![if !supportLists]>•<![endif]>mAP of Darknet using your optimized fixed-pointversion as compared against the original floating-point detection results.

As an extra credit item (that will,however, be very useful for your final project design), start from yourfixed-point code base developed in Step 6 and find additional opportunities toimplement other optimizations that further improve software performance on theARM. Report on the optimizations applied and results achieved.

Some suggestions for possibleoptimizations are:

<![if !supportLists]>• <![endif]>Exploiting SIMD vector processing. Many high-performancecomputing applications exploit vectorized instructions and SIMD processing capabilitiesof our ARM A53 CPU which includes a NEON SIMD vector unit. Leverage suchhardware capabilities to further improve run-time performance. You can look forthis link as a starting point:

<![if !supportLists]>– <![endif]>https://community.arm.com/android-community/b/android/posts/arm-neon-programming-quick-reference

<![if !supportLists]>•<![endif]>Cache locality-aware GEMM optimization. By default,Darknet uses a naïve triple-nested loop to implement the GEMM. This doesnot consider data reuse opportunities from the underlying cache and memoryhierarchies in the ARM platform. Implement a locality-aware GEMM and measurethe performance improvement accordingly. See these links as starting points:

<![if !supportLists]>–<![endif]>https://github.com/flame/how-to-optimize-gemm/wiki

<![if !supportLists]>–<![endif]>https://sites.google.com/lbl.gov/cs267-spr2019/hw-1

<![if !supportLists]>•<![endif]>Parallelization and/or pipelining of the Darknetprocessing chain on our quad-core ARM platform (this may also exposeopportunities for exploiting hardware/software parallelism when mapping theGEMM out into hardware in Lab 2 and the final project). This requires a deeperunderstanding of the Darknet processing chain, specifically to analyzedependencies (and hence parallelization opportunities) among Darknet blocks.Some basic instructions for how to implement parallel processing using the Pthreads library (available both on the board and on theLinux hosts) are available here.

<![if !supportLists]>•<![endif]>Use of the ARM Mali-400 MP2 GPU on the board.

Talk to us (instructor or TA) if you are interested, have questions orare looking for ideas/advice around any of these topics.

Submit your report and files in Canvas.The report should list the bottlenecks identified during profiling anddiscuss/propose ways used to remove them. List the differences between theoriginal flp and fxpversions of the Darknet code with respect to what you observed by profilingthem. Finally, report on the results of floating-point to fixed-pointconversion (Steps 5 and 6, including achieved performance improvements andaccuracy analysis) and any additional optimizations you performed (Step 7).Also include the fixed point code (tar ball archiveswith -czvf of code from Steps 5 and 6/7) as part of yourreport.

ECE382M.20: SoC, Lab 1 (2024)

References