Skip to content

Pritiks23/NVIDIA-TensorRT-Model-Comparisons

Repository files navigation

TensorRT Logo

NVIDIA TensorRT Model Comparison Project: Technical Deep Dive

This notebook rigorously demonstrates the end-to-end workflow for optimizing deep learning models for inference acceleration using NVIDIA TensorRT. The process encompasses model definition, conversion to the Open Neural Network Exchange (ONNX) format, and the construction of a highly optimized TensorRT engine, followed by a performance and correctness comparison against the original PyTorch model.

Core Concepts & Implementation:

  1. Environment Setup: Establishment of a GPU-accelerated environment, validating CUDA availability, and installing crucial libraries including torch, onnx, polygraphy (for streamlined TensorRT engine building and inference), and the nvidia-tensorrt Python API.

  2. Model Definition: A simple, representative PyTorch nn.Module (SimpleModel) is defined, encapsulating basic linear layers and ReLU activations. This model serves as the target for optimization, showcasing the generalizability of the TensorRT pipeline.

  3. ONNX Export: The PyTorch model is meticulously exported to the ONNX intermediate representation. This critical step involves:

    • Setting the model to evaluation mode (model.eval()).
    • Tracing the model's computation graph with a dummy_input to capture operational dependencies.
    • Specifying an opset_version (e.g., 18) for ONNX compatibility.
    • Enabling do_constant_folding for graph simplification.
    • Defining input_names and output_names for explicit graph nodes.
    • Leveraging dynamic_shapes to allow for variable batch sizes during inference, enhancing flexibility.
  4. TensorRT Optimization: The ONNX model is then ingested by the TensorRT builder to construct an optimized inference engine. This process involves:

    • Graph Parsing: The ONNX graph is parsed into TensorRT's internal representation using trt.OnnxParser.
    • Builder Configuration: A trt.BuilderConfig is used to define optimization parameters, including:
      • MemoryPoolType.WORKSPACE: Allocating a workspace for intermediate computations during engine building.
      • FP16 Flag: Enabling mixed-precision (FP16) inference where hardware supports it (builder.platform_has_fast_fp16), significantly boosting performance while maintaining acceptable accuracy.
    • Optimization Profile: An OptimizationProfile is created to define the range of dynamic input shapes (minimum, optimal, maximum) the engine should support. This is crucial for handling dynamic batching efficiently.
    • Engine Serialization: The optimized graph is compiled into a highly efficient, platform-specific TensorRT engine, which is then serialized to a .trt file for later deployment.
  5. Inference and Comparison: A direct comparison is performed between the original PyTorch model and the TensorRT engine:

    • PyTorch Inference: The baseline inference latency is measured using the PyTorch model on the GPU.
    • TensorRT Inference: The optimized TensorRT engine is loaded via polygraphy.backend.trt.TrtRunner to execute inference. TrtRunner efficiently manages input/output buffer allocations and device transfers.
    • Numerical Validation: The outputs from both models are compared using np.allclose with a specified atol (absolute tolerance) to ensure numerical fidelity post-optimization, accounting for potential precision differences (especially with FP16).
    • Performance Metrics: Inference times are precisely measured and a speedup ratio is calculated, quantitatively demonstrating the performance gains achieved by TensorRT optimization. This showcases TensorRT's capability to deliver significant inference acceleration, making models production-ready for demanding, low-latency applications.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors