reverse-engineering-assistant 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (21) hide show
  1. reverse-engineering-assistant-1.0.0/PKG-INFO +152 -0
  2. reverse-engineering-assistant-1.0.0/README.md +133 -0
  3. reverse-engineering-assistant-1.0.0/pyproject.toml +34 -0
  4. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/__init__.py +0 -0
  5. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/api_server_tools/__init__.py +0 -0
  6. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/api_server_tools/function_tools.py +122 -0
  7. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/assistant.py +437 -0
  8. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/assistant_api_server.py +341 -0
  9. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/configuration.py +151 -0
  10. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/documents.py +249 -0
  11. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/model.py +142 -0
  12. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/reva_exceptions.py +8 -0
  13. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/tool.py +127 -0
  14. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant/tool_protocol.py +365 -0
  15. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant.egg-info/PKG-INFO +152 -0
  16. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant.egg-info/SOURCES.txt +19 -0
  17. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant.egg-info/dependency_links.txt +1 -0
  18. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant.egg-info/entry_points.txt +4 -0
  19. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant.egg-info/requires.txt +8 -0
  20. reverse-engineering-assistant-1.0.0/reverse_engineering_assistant.egg-info/top_level.txt +1 -0
  21. reverse-engineering-assistant-1.0.0/setup.cfg +4 -0
@@ -0,0 +1,152 @@
1
+ Metadata-Version: 2.1
2
+ Name: reverse-engineering-assistant
3
+ Version: 1.0.0
4
+ Summary: An AI assistant for reverse engineering tasks
5
+ Author: サイバーカイダ (cyberkaida)
6
+ Classifier: License :: OSI Approved :: Apache Software License
7
+ Classifier: Natural Language :: English
8
+ Classifier: Operating System :: OS Independent
9
+ Classifier: Programming Language :: Python
10
+ Description-Content-Type: text/markdown
11
+ Requires-Dist: langchain
12
+ Requires-Dist: llama-cpp-python
13
+ Requires-Dist: prompt_toolkit
14
+ Requires-Dist: sentence_transformers
15
+ Requires-Dist: PyYAML
16
+ Requires-Dist: pydantic
17
+ Requires-Dist: rich
18
+ Requires-Dist: Flask
19
+
20
+ # ReVA - Reverse Engineering Assistant
21
+
22
+ [✨ A quick demo! ✨](https://asciinema.org/a/626197)
23
+
24
+ The reverse engineering assistant (ReVA) is a project to build a disassembler agnostic AI assistant for
25
+ reverse engineering tasks. This includes both _offline_ and online inference and a simple architecture.
26
+
27
+ RevA is different to other efforts at building AI assistants for RE tasks because it uses a technique
28
+ called [embedding](https://openai.com/blog/introducing-text-and-code-embeddings)
29
+ to give the AI assistant a sort of "long term memory". The model also is given access to a number of tools
30
+ that are tweaked to perform well with queries provided by the LLM. This allows the model to reason about the whole
31
+ program, rather than just a single function. The tools are tweaked to lead the AI to examine deeper.
32
+
33
+ Using this technique you can ask general questions and get relevant answers. The model prioritises
34
+ information from the embeddings and tools, but when there is no information it can still respond to generic
35
+ questions from its training.
36
+
37
+ You can ask questions like:
38
+ - Does this program use encryption?
39
+ - Draw a class diagram using plantuml syntax.
40
+ - Rename all the variables in main with descriptive names.
41
+ - Explain the purpose of the `__mod_init` segment.
42
+ - What does `mmap` return?
43
+ - What does the function at address 0x80000 do?
44
+
45
+ ## Large Language Model Support
46
+
47
+ RevA is based on [llama-index](https://github.com/jerryjliu/llama_index),
48
+ which supports a number of models.
49
+
50
+ Built in support is provided for:
51
+ - [OpenAI](https://platform.openai.com/overview) for online inference and easy setup (Needs an OpenAI API key)
52
+ - [Ollama](https://ollama.ai) and any model it supports for local on-device inference or connecting to a self hosted remote inference server.
53
+
54
+ Limited support is provided for:
55
+ - [llama-cpp](https://llama-cpp-python.readthedocs.io/en/latest/) and any model it supports for local on-device inference
56
+ - [text-generation-webui](https://github.com/oobabooga/text-generation-webui) and any model it supports for self-hosted remote inference
57
+
58
+ Adding additional inference servers is easy if it is supported by llama-index or langchain (on which llama-index is based).
59
+
60
+ See the configuration section for more information about setting the model.
61
+
62
+ ## Configuration
63
+
64
+ Configuration for the reverse engineering assistant is stored at
65
+ `~/.config/reverse-engineering-assistant/config.yaml`. If this
66
+ is not present on first start, a default configuration using
67
+ OpenAI for inference and the `OPENAI_API_TOKEN` environment
68
+ variable will be used.
69
+
70
+ The most important setting is the `type` top level setting.
71
+ This controls what inference service you use. These are the
72
+ same as the configuration keys, for example to use Ollama,
73
+ set type to `ollama` and configure the settings in the `ollama:`
74
+ section.
75
+
76
+ The configuration also contains the prompts used for the models.
77
+ If you use Ollama or OpenAI these will be processed to fit the
78
+ model specific prompt pattern (placing the system prompt in the
79
+ correct tags, etc).
80
+
81
+ For `llama-cpp` and `text-generation-webui` these may need to be
82
+ configured for your specific model. For this reason Ollama is
83
+ preferred for self hosting.
84
+
85
+ ## Workflow
86
+
87
+ RevA has a two step workflow.
88
+ 1. Generate knowledge base
89
+ 2. Perform inference
90
+
91
+ To generate the knowledge base, use the plugin for your disassembler and run the Assistant script.
92
+ See [Ghidra Support](#ghidra-support) and [BinaryNinja Support](#binary-ninja-support) below.
93
+
94
+ First your disassembler extracts the information required for the knowledge base and embeddings.
95
+ This involes extracting each function, it's decompilation and some metadata. These are written to a "project". This allows
96
+ multiple programs and data sources to be combined into one set of knowledge for the assistant. For example multiple malware
97
+ samples, or a program and its libraries could be included along with previous RE notes.
98
+
99
+ Projects are stored in `~/.cache/reverse-engineering-assistant/projects`. If you make significant changes to your
100
+ annotations or analysis in your disassembler, you should delete and regenerate your project directory. This cache
101
+ is a _snapshot_ of the state of your disassembler.
102
+
103
+ To ask questions and run the inference a command line tool is provided. Run `revassistant --project ${NAME_OF_YOUR_PROJECT}` to begin the chat session.
104
+
105
+ `revassistant` will hash the knowledge base and generate and combine the embeddings into a searchable
106
+ index. Once this is complete the index is saved to disk and the chat session begins.
107
+
108
+ ## Installation
109
+
110
+ To install the particular extension for your disassembler see:
111
+ - [Ghidra Support](#ghidra-support)
112
+ - [Binary Ninja Support](#binary-ninja-support)
113
+
114
+ To install the chat component you can do the following:
115
+
116
+ ```sh
117
+ python3 -m pip install ./reverse-engineering-assistant
118
+ ```
119
+
120
+ The chat can be started with:
121
+
122
+ ```sh
123
+ revassistant --project ${NAME_OF_YOUR_PROJECT}
124
+ ```
125
+
126
+
127
+ # Ghidra Support
128
+
129
+ ## Prerequisites
130
+ - [Ghidrathon](https://github.com/mandiant/Ghidrathon) >= 2.2.0 installed into Ghidra
131
+
132
+ ## Usage
133
+
134
+ After installation, enable the [Ghidrathon extension](https://github.com/mandiant/Ghidrathon#installing-ghidrathon)
135
+ and the Ghidra Assistant Extension.
136
+
137
+ You can generate the knowledge base by running the Ghidra Assistant analysis from the Analysis menu in the Code Browser.
138
+
139
+ # Binary Ninja Support
140
+
141
+ Install the ReVA BinaryNinja plugin by opening your BinaryNinja plugin directory (Plugins -> Open Plugin Folder)
142
+ and copying or symbolic linking the [binary-ninja-assistant](./binary-ninja-assistant) directory into the plugin
143
+ directory.
144
+
145
+ Restart Binary Ninja and "ReVA Push" will be available in the Plugin menu.
146
+ Press this to push data from BinaryNinja to ReVA, then follow the instructions in the [Workflow section](#workflow).
147
+ The project name will be the name of the current open file.
148
+
149
+ # Support
150
+
151
+ Do you like my work? Want to support this project and others? Interested in how this project was designed and built?
152
+ This project and many others are built live on my stream at https://twitch.tv/cyberkaida !
@@ -0,0 +1,133 @@
1
+ # ReVA - Reverse Engineering Assistant
2
+
3
+ [✨ A quick demo! ✨](https://asciinema.org/a/626197)
4
+
5
+ The reverse engineering assistant (ReVA) is a project to build a disassembler agnostic AI assistant for
6
+ reverse engineering tasks. This includes both _offline_ and online inference and a simple architecture.
7
+
8
+ RevA is different to other efforts at building AI assistants for RE tasks because it uses a technique
9
+ called [embedding](https://openai.com/blog/introducing-text-and-code-embeddings)
10
+ to give the AI assistant a sort of "long term memory". The model also is given access to a number of tools
11
+ that are tweaked to perform well with queries provided by the LLM. This allows the model to reason about the whole
12
+ program, rather than just a single function. The tools are tweaked to lead the AI to examine deeper.
13
+
14
+ Using this technique you can ask general questions and get relevant answers. The model prioritises
15
+ information from the embeddings and tools, but when there is no information it can still respond to generic
16
+ questions from its training.
17
+
18
+ You can ask questions like:
19
+ - Does this program use encryption?
20
+ - Draw a class diagram using plantuml syntax.
21
+ - Rename all the variables in main with descriptive names.
22
+ - Explain the purpose of the `__mod_init` segment.
23
+ - What does `mmap` return?
24
+ - What does the function at address 0x80000 do?
25
+
26
+ ## Large Language Model Support
27
+
28
+ RevA is based on [llama-index](https://github.com/jerryjliu/llama_index),
29
+ which supports a number of models.
30
+
31
+ Built in support is provided for:
32
+ - [OpenAI](https://platform.openai.com/overview) for online inference and easy setup (Needs an OpenAI API key)
33
+ - [Ollama](https://ollama.ai) and any model it supports for local on-device inference or connecting to a self hosted remote inference server.
34
+
35
+ Limited support is provided for:
36
+ - [llama-cpp](https://llama-cpp-python.readthedocs.io/en/latest/) and any model it supports for local on-device inference
37
+ - [text-generation-webui](https://github.com/oobabooga/text-generation-webui) and any model it supports for self-hosted remote inference
38
+
39
+ Adding additional inference servers is easy if it is supported by llama-index or langchain (on which llama-index is based).
40
+
41
+ See the configuration section for more information about setting the model.
42
+
43
+ ## Configuration
44
+
45
+ Configuration for the reverse engineering assistant is stored at
46
+ `~/.config/reverse-engineering-assistant/config.yaml`. If this
47
+ is not present on first start, a default configuration using
48
+ OpenAI for inference and the `OPENAI_API_TOKEN` environment
49
+ variable will be used.
50
+
51
+ The most important setting is the `type` top level setting.
52
+ This controls what inference service you use. These are the
53
+ same as the configuration keys, for example to use Ollama,
54
+ set type to `ollama` and configure the settings in the `ollama:`
55
+ section.
56
+
57
+ The configuration also contains the prompts used for the models.
58
+ If you use Ollama or OpenAI these will be processed to fit the
59
+ model specific prompt pattern (placing the system prompt in the
60
+ correct tags, etc).
61
+
62
+ For `llama-cpp` and `text-generation-webui` these may need to be
63
+ configured for your specific model. For this reason Ollama is
64
+ preferred for self hosting.
65
+
66
+ ## Workflow
67
+
68
+ RevA has a two step workflow.
69
+ 1. Generate knowledge base
70
+ 2. Perform inference
71
+
72
+ To generate the knowledge base, use the plugin for your disassembler and run the Assistant script.
73
+ See [Ghidra Support](#ghidra-support) and [BinaryNinja Support](#binary-ninja-support) below.
74
+
75
+ First your disassembler extracts the information required for the knowledge base and embeddings.
76
+ This involes extracting each function, it's decompilation and some metadata. These are written to a "project". This allows
77
+ multiple programs and data sources to be combined into one set of knowledge for the assistant. For example multiple malware
78
+ samples, or a program and its libraries could be included along with previous RE notes.
79
+
80
+ Projects are stored in `~/.cache/reverse-engineering-assistant/projects`. If you make significant changes to your
81
+ annotations or analysis in your disassembler, you should delete and regenerate your project directory. This cache
82
+ is a _snapshot_ of the state of your disassembler.
83
+
84
+ To ask questions and run the inference a command line tool is provided. Run `revassistant --project ${NAME_OF_YOUR_PROJECT}` to begin the chat session.
85
+
86
+ `revassistant` will hash the knowledge base and generate and combine the embeddings into a searchable
87
+ index. Once this is complete the index is saved to disk and the chat session begins.
88
+
89
+ ## Installation
90
+
91
+ To install the particular extension for your disassembler see:
92
+ - [Ghidra Support](#ghidra-support)
93
+ - [Binary Ninja Support](#binary-ninja-support)
94
+
95
+ To install the chat component you can do the following:
96
+
97
+ ```sh
98
+ python3 -m pip install ./reverse-engineering-assistant
99
+ ```
100
+
101
+ The chat can be started with:
102
+
103
+ ```sh
104
+ revassistant --project ${NAME_OF_YOUR_PROJECT}
105
+ ```
106
+
107
+
108
+ # Ghidra Support
109
+
110
+ ## Prerequisites
111
+ - [Ghidrathon](https://github.com/mandiant/Ghidrathon) >= 2.2.0 installed into Ghidra
112
+
113
+ ## Usage
114
+
115
+ After installation, enable the [Ghidrathon extension](https://github.com/mandiant/Ghidrathon#installing-ghidrathon)
116
+ and the Ghidra Assistant Extension.
117
+
118
+ You can generate the knowledge base by running the Ghidra Assistant analysis from the Analysis menu in the Code Browser.
119
+
120
+ # Binary Ninja Support
121
+
122
+ Install the ReVA BinaryNinja plugin by opening your BinaryNinja plugin directory (Plugins -> Open Plugin Folder)
123
+ and copying or symbolic linking the [binary-ninja-assistant](./binary-ninja-assistant) directory into the plugin
124
+ directory.
125
+
126
+ Restart Binary Ninja and "ReVA Push" will be available in the Plugin menu.
127
+ Press this to push data from BinaryNinja to ReVA, then follow the instructions in the [Workflow section](#workflow).
128
+ The project name will be the name of the current open file.
129
+
130
+ # Support
131
+
132
+ Do you like my work? Want to support this project and others? Interested in how this project was designed and built?
133
+ This project and many others are built live on my stream at https://twitch.tv/cyberkaida !
@@ -0,0 +1,34 @@
1
+ [build-system]
2
+ requires = ["setuptools", "setuptools-scm"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "reverse-engineering-assistant"
7
+ readme = "README.md"
8
+ version = "1.0.0"
9
+ authors = [
10
+ {name="サイバーカイダ (cyberkaida)"},
11
+ ]
12
+ description = "An AI assistant for reverse engineering tasks"
13
+ classifiers = [
14
+ "License :: OSI Approved :: Apache Software License",
15
+ "Natural Language :: English",
16
+ "Operating System :: OS Independent",
17
+ "Programming Language :: Python",
18
+ ]
19
+ dependencies = [
20
+ "langchain",
21
+ "llama-cpp-python",
22
+ "prompt_toolkit",
23
+ "sentence_transformers",
24
+ "PyYAML",
25
+ "pydantic",
26
+ "rich",
27
+ "Flask",
28
+ ]
29
+
30
+ [project.scripts]
31
+ revassistant = "reverse_engineering_assistant.assistant:main"
32
+ reva-serve = "reverse_engineering_assistant.assistant_api_server:main"
33
+ reva-server = "reverse_engineering_assistant.assistant_api_server:main"
34
+
@@ -0,0 +1,122 @@
1
+
2
+ from pathlib import Path
3
+ from typing import Dict, List, Optional
4
+ from ..assistant import AssistantProject, RevaTool, BaseLLM, register_tool
5
+ from ..tool_protocol import RevaGetDecompilation, RevaGetDecompilationResponse, RevaGetFunctionCount, RevaGetFunctionCountResponse, RevaGetDefinedFunctionList, RevaGetDefinedFunctionListResponse
6
+
7
+ from ..reva_exceptions import RevaToolException
8
+
9
+ import logging
10
+
11
+
12
+ @register_tool
13
+ class RevaDecompilationIndex(RevaTool):
14
+ """
15
+ An index of decompiled functions available to the
16
+ reverse engineering assistant.
17
+ """
18
+ index_name = "decompilation"
19
+ description = "Used for retrieving decompiled functions"
20
+ logger = logging.getLogger("reverse_engineering_assistant.RevaDecompilationIndex")
21
+
22
+ def __init__(self, project: AssistantProject, llm: BaseLLM) -> None:
23
+ super().__init__(project, llm)
24
+ self.description = "Used for retrieveing decompiled functions"
25
+ self.tool_functions = [
26
+ self.get_decompilation_for_function,
27
+ self.get_defined_function_list_paginated,
28
+ self.get_defined_function_count,
29
+ ]
30
+
31
+ def get_decompilation_for_function(self, function_name_or_address: str | int) -> Dict[str, str]:
32
+ """
33
+ Return the decompilation for the given function. The function can be specified by name or address.
34
+ Hint: It is too slow to decompile _all_ functions, so use get_defined_function_list_paginated to get a list of functions
35
+ and be sure to specify the function name or address exactly.
36
+ """
37
+ from ..assistant_api_server import RevaCallbackHandler, to_send_to_tool
38
+
39
+
40
+ # First normalise the argument
41
+ address: Optional[int] = None
42
+ name: Optional[str] = None
43
+ try:
44
+ address = int(function_name_or_address, 16)
45
+ if address <= 0:
46
+ raise RevaToolException("Address must be > 0 and in hex format")
47
+ except ValueError:
48
+ name = function_name_or_address
49
+
50
+ if address is None and name is None:
51
+ raise RevaToolException("function_name_or_address must be an address or function name")
52
+
53
+ # Now we can ask the tool
54
+ get_decompilation_message = RevaGetDecompilation(address=address, function=name)
55
+ callback_handler = RevaCallbackHandler(self.project, get_decompilation_message)
56
+ to_send_to_tool.put(callback_handler)
57
+ self.logger.debug(f"Waiting for response to {get_decompilation_message.json()}")
58
+ response: RevaGetDecompilationResponse = callback_handler.wait()
59
+
60
+ if response.error_message:
61
+ raise RevaToolException(response.error_message, send_to_llm=True)
62
+
63
+ if not isinstance(response, RevaGetDecompilationResponse):
64
+ raise ValueError(f"Expected a RevaGetDecompilationResponse, got {response}")
65
+
66
+ # Finally we can return the response
67
+ return {
68
+ "function": response.function,
69
+ "function_signature": response.function_signature,
70
+ "address": hex(response.address),
71
+ "decompilation": response.decompilation,
72
+ "variables": response.variables,
73
+ }
74
+
75
+
76
+ def get_defined_function_list_paginated(self, page: int, page_size: int = 20) -> List[str]:
77
+ """
78
+ Return a paginated list of functions in the index. Use get_defined_function_count to get the total number of functions.
79
+ page is 1 indexed. To get the first page, set page to 1. Do not set page to 0.
80
+ """
81
+ from ..assistant_api_server import RevaCallbackHandler, to_send_to_tool
82
+
83
+ if isinstance(page, str):
84
+ page = int(page)
85
+ if isinstance(page_size, str):
86
+ page_size = int(page_size)
87
+ if page == 0:
88
+ raise ValueError("`page` is 1 indexed, page cannot be 0")
89
+
90
+ get_function_list_message = RevaGetDefinedFunctionList(page=page, page_size=page_size)
91
+ callback_handler = RevaCallbackHandler(self.project, get_function_list_message)
92
+ to_send_to_tool.put(callback_handler)
93
+
94
+ self.logger.debug(f"Waiting for response to {get_function_list_message.json()}")
95
+ response = callback_handler.wait()
96
+ if response.error_message:
97
+ raise RevaToolException(response.error_message, send_to_llm=True)
98
+
99
+ if not isinstance(response, RevaGetDefinedFunctionListResponse):
100
+ raise RevaToolException(f"Expected a RevaGetDefinedFunctionListResponse, got {response}")
101
+
102
+ return response.function_list
103
+
104
+ def get_defined_function_count(self) -> int:
105
+ """
106
+ Return the total number of defined functions in the program.
107
+ """
108
+ from ..assistant_api_server import RevaCallbackHandler, to_send_to_tool
109
+
110
+ get_function_count_message = RevaGetFunctionCount()
111
+ callback_handler = RevaCallbackHandler(self.project, get_function_count_message)
112
+ to_send_to_tool.put(callback_handler)
113
+ self.logger.debug(f"Waiting for response to {get_function_count_message.json()}")
114
+ response = callback_handler.wait()
115
+
116
+ if response.error_message:
117
+ raise RevaToolException(response.error_message, send_to_llm=True)
118
+
119
+ if not isinstance(response, RevaGetFunctionCountResponse):
120
+ raise ValueError(f"Expected a RevaGetFunctionCountResponse, got {response}")
121
+
122
+ return response.function_count