PyPI - chainforge - Versions diffs - 0.1.0__tar.gz - Mend

chainforge 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (42) hide show

chainforge-0.1.0/LICENSE.md ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2023 Ian Arawjo
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

chainforge-0.1.0/MANIFEST.in ADDED Viewed

	@@ -0,0 +1 @@
1	+ graft chainforge/react-server/build

chainforge-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,15 @@
+Metadata-Version: 2.1
+Name: chainforge
+Version: 0.1.0
+Summary: A Visual Programming Environment for Prompt Engineering
+Home-page: https://github.com/ianarawjo/ChainForge/
+Author: Ian Arawjo
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: MIT License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Requires-Python: >=3.7
+License-File: LICENSE.md

chainforge-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,77 @@
+# ⛓️🛠️ ChainForge
+**An open-source visual programming environment for battle-testing prompts to LLMs.**
+<img width="1615" alt="Screen Shot 2023-05-17 at 2 45 17 PM" src="https://github.com/ianarawjo/ChainForge/assets/5251713/96aecea7-cf05-4064-8f83-20a524449af7">
+ChainForge is a data flow prompt engineering environment for analyzing and evaluating LLM responses. Like Jupyter Notebooks are geared towards early-stage exploration, ChainForge is geared towards early-stage, quick-and-dirty exploration of prompts and response quality that goes beyond ad-hoc chatting with individual LLMs. With CF, you can:
+ - Query multiple LLMs at once to sketch prompt ideas and test variations quickly and effectively.
+ - Compare response quality across prompt variations and across models to choose the best prompt and model for your use case.
+ - Setup an evaluation metric (scoring function) and immediately visualize results across prompts, prompt parameters, and models.
+**This is an open alpha of Chainforge.** We currently support models GPT3.5, GPT4, Claude, and Alpaca 7B (through [Dalai](https://github.com/cocktailpeanut/dalai)) at default settings. Try it and let us know what you think! :)
+ChainForge is built on [ReactFlow](https://reactflow.dev) and [Flask](https://flask.palletsprojects.com/en/2.3.x/).
+# Installation
+To get started with Chainforge alpha, see the [Installation Guide](https://github.com/ianarawjo/ChainForge/blob/main/GUIDE.md). In the near future, we will upload to PyPI as an official package.
+## Example evaluation flows
+We've prepared a couple Example flows to give you a sense of what's possible with Chainforge.
+Import them, then:
+ - Run any Prompt node(s) to query the LLM(s),
+ - Run any Evaluator nodes to score responses.
+Note that right now, **exporting a CF flow does not save cache'd responses.**
+# Features
+A key goal of ChainForge is facilitating **comparison** and **evaluation** of prompts and models, and (in the near future) prompt chains. Basic features are:
+- **Prompt permutations**: Setup a prompt template and feed it variations of input variables. ChainForge will prompt all selected LLMs with all possible permutations of the input prompt, so that you can get a better sense of prompt quality. You can also chain prompt templates at arbitrary depth (e.g., to compare templates).
+- **Evaluation nodes**: Probe LLM responses in a chain and test them (classically) for some desired behavior. At a basic level, this is Python script based. We plan to add preset evaluator nodes for common use cases in the near future (e.g., name-entity recognition). Note that you can also chain LLM responses into prompt templates to help evaluate outputs cheaply before more extensive evaluation methods.
+- **Visualization nodes**: Visualize evaluation results on plots like box-and-whisker and 3D scatterplots.
+Taken together, these three features let you easily:
+  - **Compare across prompts and prompt parameters**: Choose the best set of prompts that maximizes your eval target metrics (e.g., lowest code error rate). Or, see how changing parameters in a prompt template affects the quality of responses.
+  - **Compare across models**: Compare responses for every prompt across models.
+# Development
+ChainForge is being developed by research scientists at Harvard University in the [Harvard HCI](https://hci.seas.harvard.edu) group:
+- [Ian Arawjo](http://ianarawjo.com/index.html)
+- [Priyan Vaithilingam](https://priyan.info)
+- [Elena Glassman](https://glassmanlab.seas.harvard.edu/glassman.html)
+We provide ongoing releases of this tool in the hopes that others find it useful for their projects.
+## Future Planned Features
+- **Model settings**: Change settings for individual models, so one can test across the same model with different settings
+- **Compare across response batches**: Run an evaluator over all N responses generated for each prompt, to measure factors like variability or parseability (e.g., how many code outputs pass a basic smell test?)
+- **System prompts**: Ability to change the system prompt for models that support it (e.g., ChatGPT). Try out different system prompts and compare response quality.
+- **Collapse nodes**: Nodes should be collapseable, to save screen space.
+- **LMQL and Microsoft guidance nodes**: Support for prompt pipelines that involve LMQL and {{guidance}} code, esp. inspecting masked response variables.
+- **AI assistance for prompt engineering**: Spur creative ideas and quickly iterate on variations of prompts through interaction with GPT4.
+- **Compare fine-tuned to base models**: Beyond comparing between different models like Alpaca and ChatGPT, support comparison between versions of the same model (e.g., a base model and a fine-tuned one). Helper users detect where fine-tuning resulted in any 'breaking changes' elsewhere.
+- **Export to code**: In the future, export prompt and (potentially) chains using a programming API like LangChain.
+- **Dark mode**: A dark mode theme
+- **Compare across chains**: If a prompt P is used *across* chains C1 C2 etc, how does changing it affect all downstream events?
+See a feature you'd like that isn't here? Open an [Issue](https://github.com/ianarawjo/ChainForge/issues).
+## Inspiration and Links
+ChainForge is meant to be general-purpose, and is not developed for a specific API or LLM back-end. Our ultimate goal is integration into other tools for the systematic evaluation and auditing of LLMs. We hope to help others who are developing prompt-analysis flows in LLMs, or otherwise auditing LLM outputs. This project was inspired by own our use case, but also shares some comraderie with two related (closed-source) research projects, both led by [Sherry Wu](https://www.cs.cmu.edu/~sherryw/):
+- "PromptChainer: Chaining Large Language Model Prompts through Visual Programming" (Wu et al., CHI ’22 LBW) [Video](https://www.youtube.com/watch?v=p6MA8q19uo0)
+- "AI Chains: Transparent and Controllable Human-AI Interaction by Chaining Large Language Model Prompts" (Wu et al., CHI ’22)
+Unlike these projects, we are focusing on supporting evaluation across prompts, prompt parameters, and models.
+## How to collaborate?
+We are looking for open-source collaborators. The best way to do this, at the moment, is simply to implement the requested feature / bug fix and submit a Pull Request. If you want to report a bug or request a feature, open an [Issue](https://github.com/ianarawjo/ChainForge/issues).
+# License
+ChainForge is released under the MIT License.

chainforge-0.1.0/chainforge/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ from .app import main

chainforge-0.1.0/chainforge/app.py ADDED Viewed

@@ -0,0 +1,123 @@
+import json, os, asyncio, sys, argparse, threading
+from dataclasses import dataclass
+from statistics import mean, median, stdev
+from flask import Flask, request, jsonify
+from flask_cors import CORS
+from flask_socketio import SocketIO
+from chainforge.flask_app import run_server
+from chainforge.promptengine.query import PromptLLM, PromptLLMDummy
+from chainforge.promptengine.template import PromptTemplate, PromptPermutationGenerator
+from chainforge.promptengine.utils import LLM, is_valid_filepath, get_files_at_dir, create_dir_if_not_exists
+# Setup the socketio app
+app = Flask(__name__)
+# Initialize Socket.IO with CORS enabled
+socketio = SocketIO(app, cors_allowed_origins="*", async_mode="gevent")
+# The cache base directory
+CACHE_DIR = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'cache')
+# Wait a max of a full 3 minutes (180 seconds) for the response count to update, before exiting.
+MAX_WAIT_TIME = 180
+def countdown():
+    n = 10
+    while n > 0:
+        socketio.sleep(0.5)
+        socketio.emit('response', n, namespace='/queryllm')
+        n -= 1
+@socketio.on('queryllm', namespace='/queryllm')
+def readCounts(data):
+    id = data['id']
+    max_count = data['max']
+    tempfilepath = os.path.join(CACHE_DIR, f'_temp_{id}.txt')
+    # Check that temp file exists. If it doesn't, something went wrong with setup on Flask's end:
+    if not os.path.exists(tempfilepath):
+        print(f"Error: Temp file not found at path {tempfilepath}. Cannot stream querying progress.")
+        socketio.emit('finish', 'temp file not found', namespace='/queryllm')
+    i = 0
+    last_n = 0
+    init_run = True
+    while i < MAX_WAIT_TIME and last_n < max_count:
+        # Open the temp file to read the progress so far:
+        try:
+            with open(tempfilepath, 'r') as f:
+                queries = json.load(f)
+        except FileNotFoundError as e:
+             # If the temp file was deleted during executing, the Flask 'queryllm' func must've terminated successfully:
+             socketio.emit('finish', 'success', namespace='/queryllm')
+             return
+        # Calculate the total sum of responses
+        # TODO: This is a naive approach; we need to make this more complex and factor in cache'ing in future
+        n = sum([int(n) for llm, n in queries.items()])
+        # If something's changed...
+        if init_run or last_n != n:
+            i = 0
+            last_n = n
+            init_run = False
+            # Update the React front-end with the current progress
+            socketio.emit('response', queries, namespace='/queryllm')
+        else:
+            i += 0.1
+        # Wait a bit before reading the file again
+        socketio.sleep(0.1)
+    if i >= MAX_WAIT_TIME:
+        print(f"Error: Waited maximum {MAX_WAIT_TIME} seconds for response count to update. Exited prematurely.")
+        socketio.emit('finish', 'max_wait_reached', namespace='/queryllm')
+    else:
+        print("All responses loaded!")
+        socketio.emit('finish', 'success', namespace='/queryllm')
+# Start socketio server
+def run_socketio_server(socketio, port):
+    socketio.run(app, host="localhost", port=8001)
+# Main Chainforge start
+def main():
+    parser = argparse.ArgumentParser(description='Chainforge command line tool')
+    # Serve command
+    subparsers = parser.add_subparsers(dest='serve')
+    serve_parser = subparsers.add_parser('serve', help='Start Chainforge server')
+    # Turn on to disable all outbound LLM API calls and replace them with dummy calls
+    # that return random strings of ASCII characters. Useful for testing interface without wasting $$.
+    serve_parser.add_argument('--dummy-responses',
+        help="""Disables queries to LLMs, replacing them with spoofed responses composed of random ASCII characters.
+                Produces each dummy response at random intervals between 0.1 and 3 seconds.""",
+        dest='dummy_responses',
+        action='store_true')
+    # TODO: Reimplement this where the React server is given the backend's port before loading.
+    # serve_parser.add_argument('--port', help='The port to run the server on. Defaults to 8000.', type=int, default=8000, nargs='?')
+    args = parser.parse_args()
+    # Currently only support the 'serve' command...
+    if not args.serve:
+        parser.print_help()
+        exit(0)
+    port = 8000 # args.port if args.port else 8000
+    # Spin up separate thread for socketio app, on port+1 (8001 default)
+    print(f"Serving SocketIO server on port {port+1}...")
+    t1 = threading.Thread(target=run_socketio_server, args=[socketio, port+1])
+    t1.start()
+    print(f"Serving Flask server on port {port}...")
+    run_server(host="localhost", port=port, cmd_args=args)
+if __name__ == "__main__":
+    main()