Optimize AI algorithms without sacrificing accuracy

The ultimate measure of AI’s success will be how much it increases productivity in our daily lives. However, the industry has huge challenges in measuring progress. The large number of AI applications is constantly evolving: finding the right algorithm, optimizing the algorithm and finding the right tools. Additionally, complex hardware engineering is quickly updated with many different system architectures.

Recent History of the AI ​​Hardware Conundrum

A 2019 Stanford report said AI is accelerating faster than hardware development. “Before 2012, AI results closely followed Moore’s Law, with the calculation doubling every two years. […] Since 2012, the calculation has doubled every 3.4 months.

Since 2015, when an AI algorithm defeated human error in object identification, large investments in AI hardware have driven solid-state IP to accelerate next-generation processing, higher bandwidth memories and interfaces to try and keep up. Figure 1 shows how an AI competition progressed rapidly when backpropagation and modern neural networks were introduced in 2012 and combined with Nvidia’s compute-intensive GPU engines.

Fig. 1: After the introduction of modern neural networks in 2012, classification errors rapidly decline and quickly exceed human error results.

AI algorithms

AI algorithms are too large and demanding to run on SoCs designed for consumer products that require low power, small area, and low cost. Therefore, AI algorithms are compressed using techniques such as pruning and quantization. These techniques allow the system to require less memory and less computation, but will impact accuracy. The technical challenge is to implement compression techniques without affecting accuracy beyond what is needed for the application.

Besides the growth in the complexity of AI algorithms, the amount of data required for inference has also increased dramatically due to the increase in input data. Figure 2 shows the memory and computation required for an optimized vision algorithm designed for a relatively small memory footprint of 6 MB (memory required for SSD-MobileNet-V1). As you can see, the biggest challenge in this particular example is not the size of the AI ​​algorithm but rather the size of the data input. As pixels increase due to increased pixel size and color depth, memory requirements have increased from 5MB to over 400MB in the latest image captures. Today, the latest CMOS image sensor cameras for Samsung mobile phones support up to 108MP. These cameras could theoretically require performance of 40 tera operations per second (TOPS) at 30 fps and more than 1.3 GB of memory. ISP techniques and particular regions of interest for AI algorithms have limited requirements to these extremes. 40 TOPS performance is not yet available on mobile phones. But this example highlights the complexity and challenges of edge devices and also drives the IP interface of sensors. MIPI CSI-2 specifically targets features to solve this problem with region-of-interest capabilities, and MIPI C/D-PHYs continue to increase bandwidth to handle the latest CMOS image sensor data sizes that are moving towards hundreds of mega-pixels.

Fig. 2: System requirements for SSD-MobileNet-V1 designed for 6MB of memory, based on pixel size benchmarking results.

Current solutions compress AI algorithms, compress images, and focus on regions of interest. This makes hardware optimizations extremely complex, especially with SoCs that have limited memory, limited processing, and small power budgets.

Many customers are evaluating their AI solutions. Existing SoCs are compared with several different methods. The tera of operations per second is a leading performance indicator. Additional performance and power metrics give a clearer picture of chip capabilities, such as the types and qualities of operations a chip can handle. Inferences per second is also a leading indicator, but requires context as to frequency and other metrics. Thus, additional benchmarks have been developed to evaluate AI hardware.

There are standardized benchmarks like those from MLPerf/ML Commons and ai.benchmark.com. ML Commons provides measurement rules related to accuracy, speed, and efficiency, which is very important for understanding how well hardware can handle different AI algorithms. As mentioned before, without understanding accuracy goals, compression techniques can be used to fit AI into very small footprints, but there is a tradeoff between accuracy and compression methods. ML Commons also provides common datasets and best practices.

The Computer Vision Lab in Zurich, Switzerland also provides benchmarks for mobile processors and publishes their results and hardware requirements along with other information for reuse. This includes 78 tests and over 180 aspects of performance.

An interesting benchmark from Stanford called DAWNBench has since supported ML Commons’ efforts, but the tests themselves looked not just at an AI performance score, but also at the total time it took the processors to run at the both training and inferring AI algorithms. This meets one of the key aspects of hardware design engineering goals by reducing the overall cost of ownership or total cost of ownership. AI processing time determines whether cloud-based AI leasing or edge computing-based hardware ownership is more viable for organizations with respect to their overall AI hardware strategies.

Another popular benchmark is to use common open source charts and templates such as ResNET-50. There are three problems with some of these models. Unfortunately, the dataset for ResNET-50 is 256×256, which is not necessarily the resolution that can be used in the final application. Second, the model is older and has fewer layers than most newer models. Third, the model may be manually optimized by the processor’s IP vendor and not represent how the system will work with other models. But there are a large number of open source models available that are used beyond ResNET-50 that are probably more representative of the latest advances in the field and provide good performance indicators.

Finally, custom graphics and templates for specific applications are becoming more common. This is ideally the best case scenario for benchmarking AI hardware and ensuring that optimizations can be effectively made to reduce power and improve performance.

SoC developers all have very different goals as some SoCs seek to provide a platform for high performance AI, some for lower performance, some for a wide variety of functions and some for very specific applications. . For SoCs that are unsure which AI model they will need to be optimized for, a healthy mix of custom and freely available models provides a good indication of performance and power. This blend is the most commonly used in today’s market. However, the advent of new benchmarking standards like those described above seems to take on some relevance in comparisons after the introduction of SoCs in the market.

Pre-silicon assessments

Due to the complexity of optimizations at the edge, AI solutions today must co-design software and hardware. To do this, they must use the right benchmarking techniques, such as those described above. They must also have tools that allow designers to accurately explore different optimizations of the system, SoC or semiconductor IP, by studying the process node, memories, processors, interfaces, etc.

Synopsys provides efficient tools to simulate, prototype and compare IP, SoC and larger system in some cases.

The Synopsys HAPS prototyping solution is commonly used to demonstrate the capabilities of different processor configurations and the trade-offs. In particular, Synopsys demonstrated where the bandwidths of the wider AI system, beyond the CPU, start to be a bottleneck and when more bandwidth from the sensor input (via MIPI) or from Memory access (via LPDDR) may not be optimal for a processing task.

For power simulations, vendor estimates can vary widely, and emulation has proven to be better than simulation and/or static analysis of AI workloads. This is where the Synopsys ZeBu emulation system can play an important role.

Finally, system-level views of the SoC design can be explored with Platform Architect. Initially used for memory and processing performance and power exploration, Platform Architect has recently been increasingly used to understand system-level performance and power in relation to AI. Sensitivity analysis can be performed to identify optimal design parameters using Synopsys IP with predefined models of LPDDRs, ARC processors for AI, memories, and more.

Summary

AI algorithms bring constant changes to hardware, and as these techniques move from the cloud to the edge, the engineering issues to optimize become more complex. To ensure competitive success, pre-silicon evaluations are becoming increasingly important. Co-designing hardware and software has become a reality and the right tools and expertise are essential.

Synopsys has a proven portfolio of IPs that are used in many AI SoC designs. Synopsys has an experienced team that develops AI processing solutions, from ASIP designer to ARC processors. A portfolio of proven Foundation IPs, including memory compilers, have been widely adopted for IA SoCs. The IP interface for AI applications ranges from sensor inputs via I3C and MIPI, to chip-to-chip connections via CXL, PCIe and Die-to-Die solutions, and networking capabilities via Ethernet.

Finally, Synopsys tools provide a method to utilize expertise, services, and proven intellectual property in an environment best suited to optimize your AI hardware in this ever-changing landscape.

Sharon D. Cole