|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "source": [ |
| 6 | + "Code summarization or code explanation is a task that converts a code written in a programming language to a natural language. This particular task has several\n", |
| 7 | + "benefits, such as understanding code without looking at its intrinsic details, documenting code for better maintenance, etc. To do that, one needs to\n", |
| 8 | + "understand the basic details of code structure works, and use that knowledge to generate the summary using various AI-based approaches. In this particular\n", |
| 9 | + "example, we will be using Large Language Models (LLM), specifically Granite 8B, an open-source model built by IBM. We will show how easily a developer can use\n", |
| 10 | + "CLDK to expose various parts of the code by calling various APIs without implementing various time-intensive program analyses from scratch." |
| 11 | + ], |
| 12 | + "metadata": { |
| 13 | + "collapsed": false |
| 14 | + }, |
| 15 | + "id": "6ad70b81e8957fc0" |
| 16 | + }, |
| 17 | + { |
| 18 | + "cell_type": "markdown", |
| 19 | + "source": [ |
| 20 | + "Step 1: Add all the neccessary imports" |
| 21 | + ], |
| 22 | + "metadata": { |
| 23 | + "collapsed": false |
| 24 | + }, |
| 25 | + "id": "15555404790e1411" |
| 26 | + }, |
| 27 | + { |
| 28 | + "cell_type": "code", |
| 29 | + "execution_count": null, |
| 30 | + "outputs": [], |
| 31 | + "source": [ |
| 32 | + "import os\n", |
| 33 | + "from pathlib import Path\n", |
| 34 | + "import ollama\n", |
| 35 | + "from cldk import CLDK\n", |
| 36 | + "from cldk.analysis import AnalysisLevel" |
| 37 | + ], |
| 38 | + "metadata": { |
| 39 | + "collapsed": false |
| 40 | + }, |
| 41 | + "id": "8e8e5de7e5c68020" |
| 42 | + }, |
| 43 | + { |
| 44 | + "cell_type": "markdown", |
| 45 | + "source": [ |
| 46 | + "Step 2: Formulate the LLM prompt. The prompt can be tailored towards various needs. In this case, we show a simple example of generating summary for each\n", |
| 47 | + "method in a Java class" |
| 48 | + ], |
| 49 | + "metadata": { |
| 50 | + "collapsed": false |
| 51 | + }, |
| 52 | + "id": "ffc4ee9a6d27acc2" |
| 53 | + }, |
| 54 | + { |
| 55 | + "cell_type": "code", |
| 56 | + "execution_count": null, |
| 57 | + "outputs": [], |
| 58 | + "source": [ |
| 59 | + "def format_inst(code, focal_method, focal_class, language):\n", |
| 60 | + " \"\"\"\n", |
| 61 | + " Format the instruction for the given focal method and class.\n", |
| 62 | + " \"\"\"\n", |
| 63 | + " inst = f\"Question: Can you write a brief summary for the method `{focal_method}` in the class `{focal_class}` below?\\n\"\n", |
| 64 | + "\n", |
| 65 | + " inst += \"\\n\"\n", |
| 66 | + " inst += f\"```{language}\\n\"\n", |
| 67 | + " inst += code\n", |
| 68 | + " inst += \"```\" if code.endswith(\"\\n\") else \"\\n```\"\n", |
| 69 | + " inst += \"\\n\"\n", |
| 70 | + " return inst" |
| 71 | + ], |
| 72 | + "metadata": { |
| 73 | + "collapsed": false |
| 74 | + }, |
| 75 | + "id": "9e23523c71636727" |
| 76 | + }, |
| 77 | + { |
| 78 | + "cell_type": "markdown", |
| 79 | + "source": [], |
| 80 | + "metadata": { |
| 81 | + "collapsed": false |
| 82 | + }, |
| 83 | + "id": "a4e9cb4e4f00b25c" |
| 84 | + }, |
| 85 | + { |
| 86 | + "cell_type": "markdown", |
| 87 | + "source": [ |
| 88 | + "Step 3: Create a function to call LLM. There are various ways to achieve that. However, for illustrative purpose, we use ollama, a library to communicate with models downloaded locally." |
| 89 | + ], |
| 90 | + "metadata": { |
| 91 | + "collapsed": false |
| 92 | + }, |
| 93 | + "id": "dd8439be222b5caa" |
| 94 | + }, |
| 95 | + { |
| 96 | + "cell_type": "code", |
| 97 | + "execution_count": null, |
| 98 | + "outputs": [], |
| 99 | + "source": [ |
| 100 | + "def prompt_ollama(message: str, model_id: str = \"granite-code:8b-instruct\") -> str:\n", |
| 101 | + " \"\"\"Prompt local model on Ollama\"\"\"\n", |
| 102 | + " response_object = ollama.generate(model=model_id, prompt=message)\n", |
| 103 | + " return response_object[\"response\"]" |
| 104 | + ], |
| 105 | + "metadata": { |
| 106 | + "collapsed": false |
| 107 | + }, |
| 108 | + "id": "62807e0cbf985ae6" |
| 109 | + }, |
| 110 | + { |
| 111 | + "cell_type": "markdown", |
| 112 | + "source": [ |
| 113 | + "Step 4: Create an object of CLDK and provide the programming language of the source code." |
| 114 | + ], |
| 115 | + "metadata": { |
| 116 | + "collapsed": false |
| 117 | + }, |
| 118 | + "id": "1022e86e38e12767" |
| 119 | + }, |
| 120 | + { |
| 121 | + "cell_type": "code", |
| 122 | + "execution_count": null, |
| 123 | + "outputs": [], |
| 124 | + "source": [ |
| 125 | + "if __name__ == \"__main__\":\n", |
| 126 | + " # Create a new instance of the CLDK class\n", |
| 127 | + " cldk = CLDK(language=\"java\")" |
| 128 | + ], |
| 129 | + "metadata": { |
| 130 | + "collapsed": false |
| 131 | + }, |
| 132 | + "id": "a2c8bbe4e3244f60" |
| 133 | + }, |
| 134 | + { |
| 135 | + "cell_type": "markdown", |
| 136 | + "source": [ |
| 137 | + "Step 5: CLDK uses different analysis engine--Codeanalyzer (built using WALA and Javaparser), Treesitter, and CodeQL (future). By default, codenanalyzer has\n", |
| 138 | + "been selected as the default analysis engine. Also, CLDK support different analysis levels--(a) symbol table, (b) call graph, (c) program dependency graph, and\n", |
| 139 | + "(d) system dependency graph. Analysis engine can be selected using ```AnalysisLevel``` enum. In this example, we will generate summarization of all the methods\n", |
| 140 | + "of an application. To select the application location, you can set the environment variable ```JAVA_APP_PATH```. " |
| 141 | + ], |
| 142 | + "metadata": { |
| 143 | + "collapsed": false |
| 144 | + }, |
| 145 | + "id": "23dd4a6e5d5cb0c5" |
| 146 | + }, |
| 147 | + { |
| 148 | + "cell_type": "code", |
| 149 | + "execution_count": null, |
| 150 | + "outputs": [], |
| 151 | + "source": [ |
| 152 | + " # Create an analysis object over the java application\n", |
| 153 | + " analysis = cldk.analysis(project_path=\"JAVA_APP_PATH\", analysis_level=AnalysisLevel.symbol_table)" |
| 154 | + ], |
| 155 | + "metadata": { |
| 156 | + "collapsed": false |
| 157 | + }, |
| 158 | + "id": "fdd09f5e77d4a68a" |
| 159 | + }, |
| 160 | + { |
| 161 | + "cell_type": "markdown", |
| 162 | + "source": [ |
| 163 | + "Step 6: Iterate over all the class files and create the prompt. In this case, we want to provide a customized Java class in the prompt. For instance,\n", |
| 164 | + "\n", |
| 165 | + "```\n", |
| 166 | + "package com.ibm.org;\n", |
| 167 | + "import A.B.C.D;\n", |
| 168 | + "...\n", |
| 169 | + "public class Foo {\n", |
| 170 | + " // code comment\n", |
| 171 | + " public void bar(){ \n", |
| 172 | + " int a;\n", |
| 173 | + " a = baz();\n", |
| 174 | + " // do something\n", |
| 175 | + " }\n", |
| 176 | + " private int baz()\n", |
| 177 | + " {\n", |
| 178 | + " // do something\n", |
| 179 | + " }\n", |
| 180 | + " public String dummy (String a)\n", |
| 181 | + " {\n", |
| 182 | + " // do somthing\n", |
| 183 | + " } \n", |
| 184 | + "```\n", |
| 185 | + "Given the above class, let's say we want to generate a summary for the ```bar``` method. To understand what it does, we add the callee of this method in the prompt, which in this case is ```baz```. We also remove imports, comments, etc. All of these are done using a single call to ```sanitize_focal_class``` API. In this process, we also use Treesitter to analyze the code. Once the input code has been sanitized, we call the ```format_inst``` method to create the LLM prompt, which has been passed to ```prompt_ollama``` method to generate the summary using LLM." |
| 186 | + ], |
| 187 | + "metadata": { |
| 188 | + "collapsed": false |
| 189 | + }, |
| 190 | + "id": "f148325e92781e13" |
| 191 | + }, |
| 192 | + { |
| 193 | + "cell_type": "code", |
| 194 | + "execution_count": null, |
| 195 | + "outputs": [], |
| 196 | + "source": [ |
| 197 | + "\n", |
| 198 | + " # Iterate over all the files in the project\n", |
| 199 | + " for file_path, class_file in analysis.get_symbol_table().items():\n", |
| 200 | + " class_file_path = Path(file_path).absolute().resolve()\n", |
| 201 | + " # Iterate over all the classes in the file\n", |
| 202 | + " for type_name, type_declaration in class_file.type_declarations.items():\n", |
| 203 | + " # Iterate over all the methods in the class\n", |
| 204 | + " for method in type_declaration.callable_declarations.values():\n", |
| 205 | + " # Get code body of the method\n", |
| 206 | + " code_body = class_file_path.read_text()\n", |
| 207 | + "\n", |
| 208 | + " # Initialize the treesitter utils for the class file content\n", |
| 209 | + " tree_sitter_utils = cldk.tree_sitter_utils(source_code=code_body)\n", |
| 210 | + "\n", |
| 211 | + " # Sanitize the class for analysis\n", |
| 212 | + " sanitized_class = tree_sitter_utils.sanitize_focal_class(method.declaration)\n", |
| 213 | + "\n", |
| 214 | + " # Format the instruction for the given focal method and class\n", |
| 215 | + " instruction = format_inst(\n", |
| 216 | + " code=sanitized_class,\n", |
| 217 | + " focal_method=method.declaration,\n", |
| 218 | + " focal_class=type_name,\n", |
| 219 | + " language=\"java\"\n", |
| 220 | + " )\n", |
| 221 | + "\n", |
| 222 | + " # Prompt the local model on Ollama\n", |
| 223 | + " llm_output = prompt_ollama(\n", |
| 224 | + " message=instruction,\n", |
| 225 | + " model_id=\"granite-code:20b-instruct\",\n", |
| 226 | + " )\n", |
| 227 | + "\n", |
| 228 | + " # Print the instruction and LLM output\n", |
| 229 | + " print(f\"Instruction:\\n{instruction}\")\n", |
| 230 | + " print(f\"LLM Output:\\n{llm_output}\")" |
| 231 | + ], |
| 232 | + "metadata": { |
| 233 | + "collapsed": false |
| 234 | + }, |
| 235 | + "id": "462ef7dceae367ad" |
| 236 | + } |
| 237 | + ], |
| 238 | + "metadata": { |
| 239 | + "kernelspec": { |
| 240 | + "display_name": "Python 3", |
| 241 | + "language": "python", |
| 242 | + "name": "python3" |
| 243 | + }, |
| 244 | + "language_info": { |
| 245 | + "codemirror_mode": { |
| 246 | + "name": "ipython", |
| 247 | + "version": 2 |
| 248 | + }, |
| 249 | + "file_extension": ".py", |
| 250 | + "mimetype": "text/x-python", |
| 251 | + "name": "python", |
| 252 | + "nbconvert_exporter": "python", |
| 253 | + "pygments_lexer": "ipython2", |
| 254 | + "version": "2.7.6" |
| 255 | + } |
| 256 | + }, |
| 257 | + "nbformat": 4, |
| 258 | + "nbformat_minor": 5 |
| 259 | +} |
0 commit comments