PyPI - graphrag - Versions diffs - 0.1.1.dev4__tar.gz → 0.1.2.dev48__tar.gz - Mend

graphrag 0.1.1.dev4tar.gz → 0.1.2.dev48tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (399) hide show

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: graphrag
-Version: 0.1.1.dev4
+Version: 0.1.2.dev48
 Summary:
 License: MIT
 Author: Alonso Guevara Fernández
@@ -21,7 +21,7 @@ Requires-Dist: devtools (>=0.12.2,<0.13.0)
 Requires-Dist: environs (>=11.0.0,<12.0.0)
 Requires-Dist: fastparquet (>=2024.2.0,<2025.0.0)
 Requires-Dist: graspologic (>=3.4.1,<4.0.0)
-Requires-Dist: lancedb (>=0.9.0,<0.10.0)
+Requires-Dist: lancedb (>=0.10.0,<0.11.0)
 Requires-Dist: nest-asyncio (>=1.6.0,<2.0.0) ; platform_system == "Windows"
 Requires-Dist: networkx (>=3,<4)
 Requires-Dist: nltk (==3.8.1)
@@ -35,8 +35,8 @@ Requires-Dist: pyyaml (>=6.0.1,<7.0.0)
 Requires-Dist: rich (>=13.6.0,<14.0.0)
 Requires-Dist: scipy (==1.12.0)
 Requires-Dist: swifter (>=1.4.0,<2.0.0)
-Requires-Dist: tenacity (>=8.2.3,<9.0.0)
-Requires-Dist: textual (>=0.70.0,<0.71.0)
+Requires-Dist: tenacity (>=8.5.0,<9.0.0)
+Requires-Dist: textual (>=0.72.0,<0.73.0)
 Requires-Dist: tiktoken (>=0.7.0,<0.8.0)
 Requires-Dist: typing-extensions (>=4.12.2,<5.0.0)
 Requires-Dist: uvloop (>=0.19.0,<0.20.0) ; platform_system != "Windows"
@@ -44,9 +44,25 @@ Description-Content-Type: text/markdown
 # GraphRAG
-👉 [Use the GraphRAG Accelerator solution](https://github.com/Azure-Samples/graphrag-accelerator)
-👉 [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/)
-👉 [Read the docs](https://microsoft.github.io/graphrag)
+👉 [Use the GraphRAG Accelerator solution](https://github.com/Azure-Samples/graphrag-accelerator) <br/>
+👉 [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/)<br/>
+👉 [Read the docs](https://microsoft.github.io/graphrag)<br/>
+👉 [GraphRAG Arxiv](https://arxiv.org/pdf/2404.16130)
+<div align="left">
+  <a href="https://pypi.org/project/graphrag/">
+    <img alt="PyPI - Version" src="https://img.shields.io/pypi/v/graphrag">
+  </a>
+  <a href="https://pypi.org/project/graphrag/">
+    <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/graphrag">
+  </a>
+  <a href="https://github.com/microsoft/graphrag/issues">
+    <img alt="GitHub Issues" src="https://img.shields.io/github/issues/microsoft/graphrag">
+  </a>
+  <a href="https://github.com/microsoft/graphrag/discussions">
+    <img alt="GitHub Discussions" src="https://img.shields.io/github/discussions/microsoft/graphrag">
+  </a>
+</div>
 ## Overview
@@ -62,10 +78,13 @@ To get started with the GraphRAG system we recommend trying the [Solution Accele
 This repository presents a methodology for using knowledge graph memory structures to enhance LLM outputs. Please note that the provided code serves as a demonstration and is not an officially supported Microsoft offering.
+⚠️ *Warning: GraphRAG indexing can be an expensive operation, please read all of the documentation to understand the process and costs involved, and start small.*
 ## Diving Deeper
 - To learn about our contribution guidelines, see [CONTRIBUTING.md](./CONTRIBUTING.md)
 - To start developing _GraphRAG_, see [DEVELOPING.md](./DEVELOPING.md)
+- Join the conversation and provide feedback in the [GitHub Discussions tab!](https://github.com/microsoft/graphrag/discussions)
 ## Prompt Tuning

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/README.md RENAMED Viewed

@@ -1,8 +1,24 @@
 # GraphRAG
-👉 [Use the GraphRAG Accelerator solution](https://github.com/Azure-Samples/graphrag-accelerator)
-👉 [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/)
-👉 [Read the docs](https://microsoft.github.io/graphrag)
+👉 [Use the GraphRAG Accelerator solution](https://github.com/Azure-Samples/graphrag-accelerator) <br/>
+👉 [Microsoft Research Blog Post](https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/)<br/>
+👉 [Read the docs](https://microsoft.github.io/graphrag)<br/>
+👉 [GraphRAG Arxiv](https://arxiv.org/pdf/2404.16130)
+<div align="left">
+  <a href="https://pypi.org/project/graphrag/">
+    <img alt="PyPI - Version" src="https://img.shields.io/pypi/v/graphrag">
+  </a>
+  <a href="https://pypi.org/project/graphrag/">
+    <img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/graphrag">
+  </a>
+  <a href="https://github.com/microsoft/graphrag/issues">
+    <img alt="GitHub Issues" src="https://img.shields.io/github/issues/microsoft/graphrag">
+  </a>
+  <a href="https://github.com/microsoft/graphrag/discussions">
+    <img alt="GitHub Discussions" src="https://img.shields.io/github/discussions/microsoft/graphrag">
+  </a>
+</div>
 ## Overview
@@ -18,10 +34,13 @@ To get started with the GraphRAG system we recommend trying the [Solution Accele
 This repository presents a methodology for using knowledge graph memory structures to enhance LLM outputs. Please note that the provided code serves as a demonstration and is not an officially supported Microsoft offering.
+⚠️ *Warning: GraphRAG indexing can be an expensive operation, please read all of the documentation to understand the process and costs involved, and start small.*
 ## Diving Deeper
 - To learn about our contribution guidelines, see [CONTRIBUTING.md](./CONTRIBUTING.md)
 - To start developing _GraphRAG_, see [DEVELOPING.md](./DEVELOPING.md)
+- Join the conversation and provide feedback in the [GitHub Discussions tab!](https://github.com/microsoft/graphrag/discussions)
 ## Prompt Tuning

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/create_graphrag_config.py RENAMED Viewed

@@ -112,6 +112,9 @@ def create_graphrag_config(
                 proxy=reader.str("proxy") or base.proxy,
                 model=reader.str("model") or base.model,
                 max_tokens=reader.int(Fragment.max_tokens) or base.max_tokens,
+                temperature=reader.float(Fragment.temperature) or base.temperature,
+                top_p=reader.float(Fragment.top_p) or base.top_p,
+                n=reader.int(Fragment.n) or base.n,
                 model_supports_json=reader.bool(Fragment.model_supports_json)
                 or base.model_supports_json,
                 request_timeout=reader.float(Fragment.request_timeout)
@@ -134,13 +137,27 @@ def create_graphrag_config(
         config: LLMConfigInput, base: LLMParameters
     ) -> LLMParameters:
         with reader.use(config.get("llm")):
+            api_type = reader.str(Fragment.type) or defs.EMBEDDING_TYPE
+            api_type = LLMType(api_type) if api_type else defs.LLM_TYPE
             api_key = reader.str(Fragment.api_key) or base.api_key
-            api_base = reader.str(Fragment.api_base) or base.api_base
-            api_version = reader.str(Fragment.api_version) or base.api_version
+            # In a unique events where:
+            # - same api_bases for LLM and embeddings (both Azure)
+            # - different api_bases for LLM and embeddings (both Azure)
+            # - LLM uses Azure OpenAI, while embeddings uses base OpenAI (this one is important)
+            # - LLM uses Azure OpenAI, while embeddings uses third-party OpenAI-like API
+            api_base = (
+                reader.str(Fragment.api_base) or base.api_base
+                if _is_azure(api_type)
+                else reader.str(Fragment.api_base)
+            )
+            api_version = (
+                reader.str(Fragment.api_version) or base.api_version
+                if _is_azure(api_type)
+                else reader.str(Fragment.api_version)
+            )
             api_organization = reader.str("organization") or base.organization
             api_proxy = reader.str("proxy") or base.proxy
-            api_type = reader.str(Fragment.type) or defs.EMBEDDING_TYPE
-            api_type = LLMType(api_type) if api_type else defs.LLM_TYPE
             cognitive_services_endpoint = (
                 reader.str(Fragment.cognitive_services_endpoint)
                 or base.cognitive_services_endpoint
@@ -246,6 +263,10 @@ def create_graphrag_config(
                     type=llm_type,
                     model=reader.str(Fragment.model) or defs.LLM_MODEL,
                     max_tokens=reader.int(Fragment.max_tokens) or defs.LLM_MAX_TOKENS,
+                    temperature=reader.float(Fragment.temperature)
+                    or defs.LLM_TEMPERATURE,
+                    top_p=reader.float(Fragment.top_p) or defs.LLM_TOP_P,
+                    n=reader.int(Fragment.n) or defs.LLM_N,
                     model_supports_json=reader.bool(Fragment.model_supports_json),
                     request_timeout=reader.float(Fragment.request_timeout)
                     or defs.LLM_REQUEST_TIMEOUT,
@@ -420,7 +441,7 @@ def create_graphrag_config(
         community_report_config = values.get("community_reports") or {}
         with (
-            reader.envvar_prefix(Section.community_report),
+            reader.envvar_prefix(Section.community_reports),
             reader.use(community_report_config),
         ):
             community_reports_model = CommunityReportsConfig(
@@ -474,6 +495,10 @@ def create_graphrag_config(
                 or defs.LOCAL_SEARCH_TOP_K_MAPPED_ENTITIES,
                 top_k_relationships=reader.int("top_k_relationships")
                 or defs.LOCAL_SEARCH_TOP_K_RELATIONSHIPS,
+                temperature=reader.float("llm_temperature")
+                or defs.LOCAL_SEARCH_LLM_TEMPERATURE,
+                top_p=reader.float("llm_top_p") or defs.LOCAL_SEARCH_LLM_TOP_P,
+                n=reader.int("llm_n") or defs.LOCAL_SEARCH_LLM_N,
                 max_tokens=reader.int(Fragment.max_tokens)
                 or defs.LOCAL_SEARCH_MAX_TOKENS,
                 llm_max_tokens=reader.int("llm_max_tokens")
@@ -485,6 +510,10 @@ def create_graphrag_config(
             reader.envvar_prefix(Section.global_search),
         ):
             global_search_model = GlobalSearchConfig(
+                temperature=reader.float("llm_temperature")
+                or defs.GLOBAL_SEARCH_LLM_TEMPERATURE,
+                top_p=reader.float("llm_top_p") or defs.GLOBAL_SEARCH_LLM_TOP_P,
+                n=reader.int("llm_n") or defs.GLOBAL_SEARCH_LLM_N,
                 max_tokens=reader.int(Fragment.max_tokens)
                 or defs.GLOBAL_SEARCH_MAX_TOKENS,
                 data_max_tokens=reader.int("data_max_tokens")
@@ -550,16 +579,19 @@ class Fragment(str, Enum):
     max_retries = "MAX_RETRIES"
     max_retry_wait = "MAX_RETRY_WAIT"
     max_tokens = "MAX_TOKENS"
+    temperature = "TEMPERATURE"
+    top_p = "TOP_P"
+    n = "N"
     model = "MODEL"
     model_supports_json = "MODEL_SUPPORTS_JSON"
     prompt_file = "PROMPT_FILE"
     request_timeout = "REQUEST_TIMEOUT"
-    rpm = "RPM"
+    rpm = "REQUESTS_PER_MINUTE"
     sleep_recommendation = "SLEEP_ON_RATE_LIMIT_RECOMMENDATION"
     storage_account_blob_url = "STORAGE_ACCOUNT_BLOB_URL"
     thread_count = "THREAD_COUNT"
     thread_stagger = "THREAD_STAGGER"
-    tpm = "TPM"
+    tpm = "TOKENS_PER_MINUTE"
     type = "TYPE"
@@ -570,7 +602,7 @@ class Section(str, Enum):
     cache = "CACHE"
     chunk = "CHUNK"
     claim_extraction = "CLAIM_EXTRACTION"
-    community_report = "COMMUNITY_REPORT"
+    community_reports = "COMMUNITY_REPORTS"
     embedding = "EMBEDDING"
     entity_extraction = "ENTITY_EXTRACTION"
     graphrag = "GRAPHRAG"

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/defaults.py RENAMED Viewed

@@ -23,6 +23,9 @@ ENCODING_MODEL = "cl100k_base"
 LLM_TYPE = LLMType.OpenAIChat
 LLM_MODEL = "gpt-4-turbo-preview"
 LLM_MAX_TOKENS = 4000
+LLM_TEMPERATURE = 0
+LLM_TOP_P = 1
+LLM_N = 1
 LLM_REQUEST_TIMEOUT = 180.0
 LLM_TOKENS_PER_MINUTE = 0
 LLM_REQUESTS_PER_MINUTE = 0
@@ -42,19 +45,19 @@ EMBEDDING_TARGET = TextEmbeddingTarget.required
 CACHE_TYPE = CacheType.file
 CACHE_BASE_DIR = "cache"
-CHUNK_SIZE = 300
+CHUNK_SIZE = 1200
 CHUNK_OVERLAP = 100
 CHUNK_GROUP_BY_COLUMNS = ["id"]
 CLAIM_DESCRIPTION = (
     "Any claims or facts that could be relevant to information discovery."
 )
-CLAIM_MAX_GLEANINGS = 0
+CLAIM_MAX_GLEANINGS = 1
 CLAIM_EXTRACTION_ENABLED = False
 MAX_CLUSTER_SIZE = 10
 COMMUNITY_REPORT_MAX_LENGTH = 2000
 COMMUNITY_REPORT_MAX_INPUT_LENGTH = 8000
 ENTITY_EXTRACTION_ENTITY_TYPES = ["organization", "person", "geo", "event"]
-ENTITY_EXTRACTION_MAX_GLEANINGS = 0
+ENTITY_EXTRACTION_MAX_GLEANINGS = 1
 INPUT_FILE_TYPE = InputFileType.text
 INPUT_TYPE = InputType.file
 INPUT_BASE_DIR = "input"
@@ -87,9 +90,15 @@ LOCAL_SEARCH_CONVERSATION_HISTORY_MAX_TURNS = 5
 LOCAL_SEARCH_TOP_K_MAPPED_ENTITIES = 10
 LOCAL_SEARCH_TOP_K_RELATIONSHIPS = 10
 LOCAL_SEARCH_MAX_TOKENS = 12_000
+LOCAL_SEARCH_LLM_TEMPERATURE = 0
+LOCAL_SEARCH_LLM_TOP_P = 1
+LOCAL_SEARCH_LLM_N = 1
 LOCAL_SEARCH_LLM_MAX_TOKENS = 2000
 # Global Search
+GLOBAL_SEARCH_LLM_TEMPERATURE = 0
+GLOBAL_SEARCH_LLM_TOP_P = 1
+GLOBAL_SEARCH_LLM_N = 1
 GLOBAL_SEARCH_MAX_TOKENS = 12_000
 GLOBAL_SEARCH_DATA_MAX_TOKENS = 12_000
 GLOBAL_SEARCH_MAP_MAX_TOKENS = 1000

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/models/claim_extraction_config.py RENAMED Viewed

@@ -43,7 +43,9 @@ class ClaimExtractionConfig(LLMConfig):
             "type": ExtractClaimsStrategyType.graph_intelligence,
             "llm": self.llm.model_dump(),
             **self.parallelization.model_dump(),
-            "extraction_prompt": (Path(root_dir) / self.prompt).read_text()
+            "extraction_prompt": (Path(root_dir) / self.prompt)
+            .read_bytes()
+            .decode(encoding="utf-8")
             if self.prompt
             else None,
             "claim_description": self.description,

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/models/community_reports_config.py RENAMED Viewed

@@ -38,7 +38,9 @@ class CommunityReportsConfig(LLMConfig):
             "type": CreateCommunityReportsStrategyType.graph_intelligence,
             "llm": self.llm.model_dump(),
             **self.parallelization.model_dump(),
-            "extraction_prompt": (Path(root_dir) / self.prompt).read_text()
+            "extraction_prompt": (Path(root_dir) / self.prompt)
+            .read_bytes()
+            .decode(encoding="utf-8")
             if self.prompt
             else None,
             "max_report_length": self.max_length,

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/models/entity_extraction_config.py RENAMED Viewed

@@ -38,7 +38,9 @@ class EntityExtractionConfig(LLMConfig):
             "type": ExtractEntityStrategyType.graph_intelligence,
             "llm": self.llm.model_dump(),
             **self.parallelization.model_dump(),
-            "extraction_prompt": (Path(root_dir) / self.prompt).read_text()
+            "extraction_prompt": (Path(root_dir) / self.prompt)
+            .read_bytes()
+            .decode(encoding="utf-8")
             if self.prompt
             else None,
             "max_gleanings": self.max_gleanings,

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/models/global_search_config.py RENAMED Viewed

@@ -11,6 +11,18 @@ import graphrag.config.defaults as defs
 class GlobalSearchConfig(BaseModel):
     """The default configuration section for Cache."""
+    temperature: float | None = Field(
+        description="The temperature to use for token generation.",
+        default=defs.GLOBAL_SEARCH_LLM_TEMPERATURE,
+    )
+    top_p: float | None = Field(
+        description="The top-p value to use for token generation.",
+        default=defs.GLOBAL_SEARCH_LLM_TOP_P,
+    )
+    n: int | None = Field(
+        description="The number of completions to generate.",
+        default=defs.GLOBAL_SEARCH_LLM_N,
+    )
     max_tokens: int = Field(
         description="The maximum context size in tokens.",
         default=defs.GLOBAL_SEARCH_MAX_TOKENS,

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/models/llm_parameters.py RENAMED Viewed

@@ -25,6 +25,18 @@ class LLMParameters(BaseModel):
         description="The maximum number of tokens to generate.",
         default=defs.LLM_MAX_TOKENS,
     )
+    temperature: float | None = Field(
+        description="The temperature to use for token generation.",
+        default=defs.LLM_TEMPERATURE,
+    )
+    top_p: float | None = Field(
+        description="The top-p value to use for token generation.",
+        default=defs.LLM_TOP_P,
+    )
+    n: int | None = Field(
+        description="The number of completions to generate.",
+        default=defs.LLM_N,
+    )
     request_timeout: float = Field(
         description="The request timeout to use.", default=defs.LLM_REQUEST_TIMEOUT
     )

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/models/local_search_config.py RENAMED Viewed

@@ -31,6 +31,18 @@ class LocalSearchConfig(BaseModel):
         description="The top k mapped relations.",
         default=defs.LOCAL_SEARCH_TOP_K_RELATIONSHIPS,
     )
+    temperature: float | None = Field(
+        description="The temperature to use for token generation.",
+        default=defs.LOCAL_SEARCH_LLM_TEMPERATURE,
+    )
+    top_p: float | None = Field(
+        description="The top-p value to use for token generation.",
+        default=defs.LOCAL_SEARCH_LLM_TOP_P,
+    )
+    n: int | None = Field(
+        description="The number of completions to generate.",
+        default=defs.LOCAL_SEARCH_LLM_N,
+    )
     max_tokens: int = Field(
         description="The maximum tokens.", default=defs.LOCAL_SEARCH_MAX_TOKENS
     )

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/config/models/summarize_descriptions_config.py RENAMED Viewed

@@ -34,7 +34,9 @@ class SummarizeDescriptionsConfig(LLMConfig):
             "type": SummarizeStrategyType.graph_intelligence,
             "llm": self.llm.model_dump(),
             **self.parallelization.model_dump(),
-            "summarize_prompt": (Path(root_dir) / self.prompt).read_text()
+            "summarize_prompt": (Path(root_dir) / self.prompt)
+            .read_bytes()
+            .decode(encoding="utf-8")
             if self.prompt
             else None,
             "max_summary_length": self.max_length,

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/index/__main__.py RENAMED Viewed

@@ -28,9 +28,10 @@ if __name__ == "__main__":
     )
     parser.add_argument(
         "--root",
-        help="If no configuration is defined, the root directory to use for input data and output data",
+        help="If no configuration is defined, the root directory to use for input data and output data. Default value: the current directory",
         # Only required if config is not defined
         required=False,
+        default=".",
         type=str,
     )
     parser.add_argument(
@@ -62,9 +63,16 @@ if __name__ == "__main__":
         help="Create an initial configuration in the given path.",
         action="store_true",
     )
+    parser.add_argument(
+        "--overlay-defaults",
+        help="Overlay default configuration values on a provided configuration file (--config).",
+        action="store_true",
+    )
     args = parser.parse_args()
+    if args.overlay_defaults and not args.config:
+        parser.error("--overlay-defaults requires --config")
     index_cli(
         root=args.root,
         verbose=args.verbose or False,
@@ -76,5 +84,6 @@ if __name__ == "__main__":
         emit=args.emit,
         dryrun=args.dryrun or False,
         init=args.init or False,
+        overlay_defaults=args.overlay_defaults or False,
         cli=True,
     )

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/index/cli.py RENAMED Viewed

@@ -69,7 +69,7 @@ def redact(input: dict) -> str:
 def index_cli(
-    root: str | None,
+    root: str,
     init: bool,
     verbose: bool,
     resume: str | None,
@@ -79,19 +79,24 @@ def index_cli(
     config: str | None,
     emit: str | None,
     dryrun: bool,
+    overlay_defaults: bool,
     cli: bool = False,
 ):
     """Run the pipeline with the given config."""
-    root = root or ""
     run_id = resume or time.strftime("%Y%m%d-%H%M%S")
     _enable_logging(root, run_id, verbose)
     progress_reporter = _get_progress_reporter(reporter)
     if init:
         _initialize_project_at(root, progress_reporter)
         sys.exit(0)
-    pipeline_config: str | PipelineConfig = config or _create_default_config(
-        root, verbose, dryrun or False, progress_reporter
-    )
+    if overlay_defaults:
+        pipeline_config: str | PipelineConfig = _create_default_config(
+            root, config, verbose, dryrun or False, progress_reporter
+        )
+    else:
+        pipeline_config: str | PipelineConfig = config or _create_default_config(
+            root, None, verbose, dryrun or False, progress_reporter
+        )
     cache = NoopPipelineCache() if nocache else None
     pipeline_emit = emit.split(",") if emit else None
     encountered_errors = False
@@ -180,11 +185,11 @@ def _initialize_project_at(path: str, reporter: ProgressReporter) -> None:
     dotenv = root / ".env"
     if not dotenv.exists():
-        with settings_yaml.open("w") as file:
-            file.write(INIT_YAML)
+        with settings_yaml.open("wb") as file:
+            file.write(INIT_YAML.encode(encoding="utf-8", errors="strict"))
-    with dotenv.open("w") as file:
-        file.write(INIT_DOTENV)
+    with dotenv.open("wb") as file:
+        file.write(INIT_DOTENV.encode(encoding="utf-8", errors="strict"))
     prompts_dir = root / "prompts"
     if not prompts_dir.exists():
@@ -192,34 +197,48 @@ def _initialize_project_at(path: str, reporter: ProgressReporter) -> None:
     entity_extraction = prompts_dir / "entity_extraction.txt"
     if not entity_extraction.exists():
-        with entity_extraction.open("w") as file:
-            file.write(GRAPH_EXTRACTION_PROMPT)
+        with entity_extraction.open("wb") as file:
+            file.write(
+                GRAPH_EXTRACTION_PROMPT.encode(encoding="utf-8", errors="strict")
+            )
     summarize_descriptions = prompts_dir / "summarize_descriptions.txt"
     if not summarize_descriptions.exists():
-        with summarize_descriptions.open("w") as file:
-            file.write(SUMMARIZE_PROMPT)
+        with summarize_descriptions.open("wb") as file:
+            file.write(SUMMARIZE_PROMPT.encode(encoding="utf-8", errors="strict"))
     claim_extraction = prompts_dir / "claim_extraction.txt"
     if not claim_extraction.exists():
-        with claim_extraction.open("w") as file:
-            file.write(CLAIM_EXTRACTION_PROMPT)
+        with claim_extraction.open("wb") as file:
+            file.write(
+                CLAIM_EXTRACTION_PROMPT.encode(encoding="utf-8", errors="strict")
+            )
     community_report = prompts_dir / "community_report.txt"
     if not community_report.exists():
-        with community_report.open("w") as file:
-            file.write(COMMUNITY_REPORT_PROMPT)
+        with community_report.open("wb") as file:
+            file.write(
+                COMMUNITY_REPORT_PROMPT.encode(encoding="utf-8", errors="strict")
+            )
 def _create_default_config(
-    root: str, verbose: bool, dryrun: bool, reporter: ProgressReporter
+    root: str,
+    config: str | None,
+    verbose: bool,
+    dryrun: bool,
+    reporter: ProgressReporter,
 ) -> PipelineConfig:
-    """Create a default config if none is provided."""
+    """Overlay default values on an existing config or create a default config if none is provided."""
+    if config and not Path(config).exists():
+        msg = f"Configuration file {config} does not exist"
+        raise ValueError
     if not Path(root).exists():
         msg = f"Root directory {root} does not exist"
         raise ValueError(msg)
-    parameters = _read_config_parameters(root, reporter)
+    parameters = _read_config_parameters(root, config, reporter)
     log.info(
         "using default configuration: %s",
         redact(parameters.model_dump()),
@@ -237,27 +256,35 @@ def _create_default_config(
     return result
-def _read_config_parameters(root: str, reporter: ProgressReporter):
+def _read_config_parameters(root: str, config: str | None, reporter: ProgressReporter):
     _root = Path(root)
-    settings_yaml = _root / "settings.yaml"
+    settings_yaml = (
+        Path(config)
+        if config and Path(config).suffix in [".yaml", ".yml"]
+        else _root / "settings.yaml"
+    )
     if not settings_yaml.exists():
         settings_yaml = _root / "settings.yml"
-    settings_json = _root / "settings.json"
+    settings_json = (
+        Path(config)
+        if config and Path(config).suffix == ".json"
+        else _root / "settings.json"
+    )
     if settings_yaml.exists():
         reporter.success(f"Reading settings from {settings_yaml}")
-        with settings_yaml.open("r") as file:
+        with settings_yaml.open("rb") as file:
             import yaml
-            data = yaml.safe_load(file)
+            data = yaml.safe_load(file.read().decode(encoding="utf-8", errors="strict"))
             return create_graphrag_config(data, root)
     if settings_json.exists():
         reporter.success(f"Reading settings from {settings_json}")
-        with settings_json.open("r") as file:
+        with settings_json.open("rb") as file:
             import json
-            data = json.loads(file.read())
+            data = json.loads(file.read().decode(encoding="utf-8", errors="strict"))
             return create_graphrag_config(data, root)
     reporter.success("Reading settings from environment variables")

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/index/graph/extractors/claims/claim_extractor.py RENAMED Viewed

@@ -10,6 +10,7 @@ from typing import Any
 import tiktoken
+import graphrag.config.defaults as defs
 from graphrag.index.typing import ErrorHandlerFn
 from graphrag.llm import CompletionLLM
@@ -80,7 +81,9 @@ class ClaimExtractor:
         self._input_resolved_entities_key = (
             input_resolved_entities_key or "resolved_entities"
         )
-        self._max_gleanings = max_gleanings if max_gleanings is not None else 0
+        self._max_gleanings = (
+            max_gleanings if max_gleanings is not None else defs.CLAIM_MAX_GLEANINGS
+        )
         self._on_error = on_error or (lambda _e, _s, _d: None)
         # Construct the looping arguments

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/index/graph/extractors/graph/graph_extractor.py RENAMED Viewed

@@ -14,6 +14,7 @@ from typing import Any
 import networkx as nx
 import tiktoken
+import graphrag.config.defaults as defs
 from graphrag.index.typing import ErrorHandlerFn
 from graphrag.index.utils import clean_str
 from graphrag.llm import CompletionLLM
@@ -78,7 +79,11 @@ class GraphExtractor:
         )
         self._entity_types_key = entity_types_key or "entity_types"
         self._extraction_prompt = prompt or GRAPH_EXTRACTION_PROMPT
-        self._max_gleanings = max_gleanings if max_gleanings is not None else 0
+        self._max_gleanings = (
+            max_gleanings
+            if max_gleanings is not None
+            else defs.ENTITY_EXTRACTION_MAX_GLEANINGS
+        )
         self._on_error = on_error or (lambda _e, _s, _d: None)
         # Construct the looping arguments

{graphrag-0.1.1.dev4 → graphrag-0.1.2.dev48}/graphrag/index/init_content.py RENAMED Viewed

@@ -24,6 +24,9 @@ llm:
   # max_retry_wait: {defs.LLM_MAX_RETRY_WAIT}
   # sleep_on_rate_limit_recommendation: true # whether to sleep when azure suggests wait-times
   # concurrent_requests: {defs.LLM_CONCURRENT_REQUESTS} # the number of parallel inflight requests that may be made
+  # temperature: {defs.LLM_TEMPERATURE} # temperature for sampling
+  # top_p: {defs.LLM_TOP_P} # top-p sampling
+  # n: {defs.LLM_N} # Number of completions to generate
 parallelization:
   stagger: {defs.PARALLELIZATION_STAGGER}
@@ -90,7 +93,7 @@ entity_extraction:
   ## async_mode: override the global async_mode settings for this task
   prompt: "prompts/entity_extraction.txt"
   entity_types: [{",".join(defs.ENTITY_EXTRACTION_ENTITY_TYPES)}]
-  max_gleanings: 0
+  max_gleanings: {defs.ENTITY_EXTRACTION_MAX_GLEANINGS}
 summarize_descriptions:
   ## llm: override the global llm settings for this task
@@ -108,7 +111,7 @@ claim_extraction:
   description: "{defs.CLAIM_DESCRIPTION}"
   max_gleanings: {defs.CLAIM_MAX_GLEANINGS}
-community_report:
+community_reports:
   ## llm: override the global llm settings for this task
   ## parallelization: override the global parallelization settings for this task
   ## async_mode: override the global async_mode settings for this task
@@ -141,9 +144,15 @@ local_search:
   # conversation_history_max_turns: {defs.LOCAL_SEARCH_CONVERSATION_HISTORY_MAX_TURNS}
   # top_k_mapped_entities: {defs.LOCAL_SEARCH_TOP_K_MAPPED_ENTITIES}
   # top_k_relationships: {defs.LOCAL_SEARCH_TOP_K_RELATIONSHIPS}
+  # llm_temperature: {defs.LOCAL_SEARCH_LLM_TEMPERATURE} # temperature for sampling
+  # llm_top_p: {defs.LOCAL_SEARCH_LLM_TOP_P} # top-p sampling
+  # llm_n: {defs.LOCAL_SEARCH_LLM_N} # Number of completions to generate
   # max_tokens: {defs.LOCAL_SEARCH_MAX_TOKENS}
 global_search:
+  # llm_temperature: {defs.GLOBAL_SEARCH_LLM_TEMPERATURE} # temperature for sampling
+  # llm_top_p: {defs.GLOBAL_SEARCH_LLM_TOP_P} # top-p sampling
+  # llm_n: {defs.GLOBAL_SEARCH_LLM_N} # Number of completions to generate
   # max_tokens: {defs.GLOBAL_SEARCH_MAX_TOKENS}
   # data_max_tokens: {defs.GLOBAL_SEARCH_DATA_MAX_TOKENS}
   # map_max_tokens: {defs.GLOBAL_SEARCH_MAP_MAX_TOKENS}

graphrag 0.1.1.dev4__tar.gz → 0.1.2.dev48__tar.gz

graphrag 0.1.1.dev4tar.gz → 0.1.2.dev48tar.gz