PyPI - greenmining - Versions diffs - 1.1.7__tar.gz → 1.1.9__tar.gz - Mend

greenmining 1.1.7tar.gz → 1.1.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (51) hide show

{greenmining-1.1.7 → greenmining-1.1.9}/CHANGELOG.md RENAMED Viewed

@@ -1,6 +1,6 @@
 # Changelog
-## [1.1.7] - 2026-01-31
+## [1.1.9] - 2026-01-31
 ### Removed
 - Web dashboard module (`greenmining/dashboard/`) and Flask dependency

{greenmining-1.1.7/greenmining.egg-info → greenmining-1.1.9}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: greenmining
-Version: 1.1.7
+Version: 1.1.9
 Summary: An empirical Python library for Mining Software Repositories (MSR) in Green IT research
 Author-email: Adam Bouafia <a.bouafia@student.vu.nl>
 License: MIT
@@ -68,9 +68,9 @@ An empirical Python library for Mining Software Repositories (MSR) in Green IT r
 `greenmining` is a research-grade Python library designed for **empirical Mining Software Repositories (MSR)** studies in **Green IT**. It enables researchers and practitioners to:
-- **Mine repositories at scale** - Fetch and analyze GitHub repositories via GraphQL API with configurable filters
-- **Batch analysis with parallelism** - Analyze multiple repositories concurrently with configurable worker pools
-- **Classify green commits** - Detect 122 sustainability patterns from the Green Software Foundation (GSF) catalog
+- **Mine repositories at scale** - Search, Fetch and analyze GitHub repositories via GraphQL API with configurable filters
+- **Classify green commits** - Detect 124 sustainability patterns from the Green Software Foundation (GSF) catalog
 - **Analyze any repository by URL** - Direct Git-based analysis with support for private repositories
 - **Measure energy consumption** - RAPL, CodeCarbon, and CPU Energy Meter backends for power profiling
 - **Carbon footprint reporting** - CO2 emissions calculation with 20+ country profiles and cloud region support
@@ -113,7 +113,7 @@ docker pull adambouafia/greenmining:latest
 from greenmining import GSF_PATTERNS, is_green_aware, get_pattern_by_keywords
 # Check available patterns
-print(f"Total patterns: {len(GSF_PATTERNS)}")  # 122 patterns across 15 categories
+print(f"Total patterns: {len(GSF_PATTERNS)}")  # 124 patterns across 15 categories
 # Detect green awareness in commit messages
 commit_msg = "Optimize Redis caching to reduce energy consumption"
@@ -670,8 +670,8 @@ config = Config(
 ### Core Capabilities
-- **Pattern Detection**: 122 sustainability patterns across 15 categories from the GSF catalog
-- **Keyword Analysis**: 321 green software detection keywords
+- **Pattern Detection**: 124 sustainability patterns across 15 categories from the GSF catalog
+- **Keyword Analysis**: 332 green software detection keywords
 - **Repository Fetching**: GraphQL API with date, star, and language filters
 - **URL-Based Analysis**: Direct Git-based analysis from GitHub URLs (HTTPS and SSH)
 - **Batch Processing**: Parallel analysis of multiple repositories with configurable workers
@@ -739,7 +739,7 @@ print(f"Equivalent: {report.tree_months:.2f} tree-months to offset")
 ### Pattern Database
-**122 green software patterns based on:**
+**124 green software patterns based on:**
 - Green Software Foundation (GSF) Patterns Catalog
 - VU Amsterdam 2024 research on ML system sustainability
 - ICSE 2024 conference papers on sustainable software
@@ -749,11 +749,11 @@ print(f"Equivalent: {report.tree_months:.2f} tree-months to offset")
 - **Coverage**: 67% of patterns actively detect in real-world commits
 - **Accuracy**: 100% true positive rate for green-aware commits
 - **Categories**: 15 distinct sustainability domains covered
-- **Keywords**: 321 detection terms across all patterns
+- **Keywords**: 332 detection terms across all patterns
 ## GSF Pattern Categories
-**122 patterns across 15 categories:**
+**124 patterns across 15 categories:**
 ### 1. Cloud (40 patterns)
 Auto-scaling, serverless computing, right-sizing instances, region selection for renewable energy, spot instances, idle resource detection, cloud-native architectures

{greenmining-1.1.7 → greenmining-1.1.9}/README.md RENAMED Viewed

@@ -11,9 +11,9 @@ An empirical Python library for Mining Software Repositories (MSR) in Green IT r
 `greenmining` is a research-grade Python library designed for **empirical Mining Software Repositories (MSR)** studies in **Green IT**. It enables researchers and practitioners to:
-- **Mine repositories at scale** - Fetch and analyze GitHub repositories via GraphQL API with configurable filters
-- **Batch analysis with parallelism** - Analyze multiple repositories concurrently with configurable worker pools
-- **Classify green commits** - Detect 122 sustainability patterns from the Green Software Foundation (GSF) catalog
+- **Mine repositories at scale** - Search, Fetch and analyze GitHub repositories via GraphQL API with configurable filters
+- **Classify green commits** - Detect 124 sustainability patterns from the Green Software Foundation (GSF) catalog
 - **Analyze any repository by URL** - Direct Git-based analysis with support for private repositories
 - **Measure energy consumption** - RAPL, CodeCarbon, and CPU Energy Meter backends for power profiling
 - **Carbon footprint reporting** - CO2 emissions calculation with 20+ country profiles and cloud region support
@@ -56,7 +56,7 @@ docker pull adambouafia/greenmining:latest
 from greenmining import GSF_PATTERNS, is_green_aware, get_pattern_by_keywords
 # Check available patterns
-print(f"Total patterns: {len(GSF_PATTERNS)}")  # 122 patterns across 15 categories
+print(f"Total patterns: {len(GSF_PATTERNS)}")  # 124 patterns across 15 categories
 # Detect green awareness in commit messages
 commit_msg = "Optimize Redis caching to reduce energy consumption"
@@ -613,8 +613,8 @@ config = Config(
 ### Core Capabilities
-- **Pattern Detection**: 122 sustainability patterns across 15 categories from the GSF catalog
-- **Keyword Analysis**: 321 green software detection keywords
+- **Pattern Detection**: 124 sustainability patterns across 15 categories from the GSF catalog
+- **Keyword Analysis**: 332 green software detection keywords
 - **Repository Fetching**: GraphQL API with date, star, and language filters
 - **URL-Based Analysis**: Direct Git-based analysis from GitHub URLs (HTTPS and SSH)
 - **Batch Processing**: Parallel analysis of multiple repositories with configurable workers
@@ -682,7 +682,7 @@ print(f"Equivalent: {report.tree_months:.2f} tree-months to offset")
 ### Pattern Database
-**122 green software patterns based on:**
+**124 green software patterns based on:**
 - Green Software Foundation (GSF) Patterns Catalog
 - VU Amsterdam 2024 research on ML system sustainability
 - ICSE 2024 conference papers on sustainable software
@@ -692,11 +692,11 @@ print(f"Equivalent: {report.tree_months:.2f} tree-months to offset")
 - **Coverage**: 67% of patterns actively detect in real-world commits
 - **Accuracy**: 100% true positive rate for green-aware commits
 - **Categories**: 15 distinct sustainability domains covered
-- **Keywords**: 321 detection terms across all patterns
+- **Keywords**: 332 detection terms across all patterns
 ## GSF Pattern Categories
-**122 patterns across 15 categories:**
+**124 patterns across 15 categories:**
 ### 1. Cloud (40 patterns)
 Auto-scaling, serverless computing, right-sizing instances, region selection for renewable energy, spot instances, idle resource detection, cloud-native architectures

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/__init__.py RENAMED Viewed

@@ -9,7 +9,7 @@ from greenmining.gsf_patterns import (
     is_green_aware,
 )
-__version__ = "1.1.7"
+__version__ = "1.1.9"
 def fetch_repositories(

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/analyzers/metrics_power_correlator.py RENAMED Viewed

@@ -4,7 +4,7 @@
 from __future__ import annotations
 from dataclasses import dataclass, field
-from typing import Any, Dict, List, Optional, Tuple
+from typing import Any, Dict, List, Optional
 import numpy as np
 from scipy import stats

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/analyzers/power_regression.py RENAMED Viewed

@@ -4,7 +4,6 @@
 from __future__ import annotations
 import subprocess
-import time
 from dataclasses import dataclass, field
 from typing import Any, Dict, List, Optional

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/analyzers/qualitative_analyzer.py RENAMED Viewed

@@ -3,7 +3,7 @@
 from __future__ import annotations
 import random
-from typing import Dict, List, Optional, Set, Tuple
+from typing import Dict, List, Optional
 from dataclasses import dataclass
 from collections import defaultdict
 import json

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/analyzers/statistical_analyzer.py RENAMED Viewed

@@ -135,38 +135,6 @@ class StatisticalAnalyzer:
             "significant": bool(p_value < 0.05),
         }
-    def pattern_adoption_rate_analysis(self, commits_df: pd.DataFrame) -> Dict[str, Any]:
-        # Analyze pattern adoption rates over repository lifetime.
-        results = {}
-        for pattern in commits_df["pattern"].unique():
-            pattern_commits = commits_df[commits_df["pattern"] == pattern].sort_values("date")
-            if len(pattern_commits) == 0:
-                continue
-            # Time to first adoption
-            first_adoption = pattern_commits.iloc[0]["date"]
-            repo_start = commits_df["date"].min()
-            ttfa_days = (first_adoption - repo_start).days
-            # Adoption frequency over time
-            monthly_adoption = pattern_commits.set_index("date").resample("ME").size()
-            # Pattern stickiness (months with at least one adoption)
-            total_months = len(commits_df.set_index("date").resample("ME").size())
-            active_months = len(monthly_adoption[monthly_adoption > 0])
-            stickiness = active_months / total_months if total_months > 0 else 0
-            results[pattern] = {
-                "ttfa_days": ttfa_days,
-                "total_adoptions": len(pattern_commits),
-                "stickiness": stickiness,
-                "monthly_adoption_rate": monthly_adoption.mean(),
-            }
-        return results
     def _interpret_correlations(self, significant_pairs: List[Dict[str, Any]]) -> str:
         # Generate interpretation of correlation results.
         if not significant_pairs:

greenmining-1.1.9/greenmining/config.py ADDED Viewed

@@ -0,0 +1,91 @@
+import os
+from pathlib import Path
+from typing import Any, Dict, List
+from dotenv import load_dotenv
+def _load_yaml_config(yaml_path: Path) -> Dict[str, Any]:
+    # Load configuration from YAML file if it exists.
+    if not yaml_path.exists():
+        return {}
+    try:
+        import yaml
+        with open(yaml_path, "r") as f:
+            return yaml.safe_load(f) or {}
+    except ImportError:
+        return {}
+    except Exception:
+        return {}
+class Config:
+    # Configuration class for loading from env vars and YAML.
+    def __init__(self, env_file: str = ".env", yaml_file: str = "greenmining.yaml"):
+        # Initialize configuration from environment and YAML file.
+        env_path = Path(env_file)
+        if env_path.exists():
+            load_dotenv(env_path)
+        else:
+            load_dotenv()
+        # Load YAML config
+        yaml_path = Path(yaml_file)
+        self._yaml_config = _load_yaml_config(yaml_path)
+        # GitHub API Configuration
+        self.GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")
+        if not self.GITHUB_TOKEN or self.GITHUB_TOKEN == "your_github_pat_here":
+            raise ValueError("GITHUB_TOKEN not set. Please set it in .env file or environment.")
+        # Search Configuration (YAML: sources.search.*)
+        yaml_search = self._yaml_config.get("sources", {}).get("search", {})
+        self.SUPPORTED_LANGUAGES: List[str] = yaml_search.get(
+            "languages",
+            [
+                "Python",
+                "JavaScript",
+                "TypeScript",
+                "Java",
+                "C++",
+                "C#",
+                "Go",
+                "Rust",
+                "PHP",
+                "Ruby",
+                "Swift",
+                "Kotlin",
+                "Scala",
+                "R",
+                "MATLAB",
+                "Dart",
+                "Lua",
+                "Perl",
+                "Haskell",
+                "Elixir",
+            ],
+        )
+        # Repository Limits
+        self.MIN_STARS = yaml_search.get("min_stars", int(os.getenv("MIN_STARS", "100")))
+        self.MAX_REPOS = int(os.getenv("MAX_REPOS", "100"))
+        # Output Configuration (YAML: output.directory)
+        yaml_output = self._yaml_config.get("output", {})
+        self.OUTPUT_DIR = Path(yaml_output.get("directory", os.getenv("OUTPUT_DIR", "./data")))
+        self.OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
+        # File Paths
+        self.REPOS_FILE = self.OUTPUT_DIR / "repositories.json"
+    def __repr__(self) -> str:
+        # String representation of configuration (hiding sensitive data).
+        return (
+            f"Config("
+            f"MAX_REPOS={self.MAX_REPOS}, "
+            f"OUTPUT_DIR={self.OUTPUT_DIR}"
+            f")"
+        )

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/controllers/repository_controller.py RENAMED Viewed

@@ -1,6 +1,9 @@
-# Repository Controller - Handles repository fetching operations.
-from tqdm import tqdm
+# Repository Controller - Handles repository fetching + cloning operations.
+import os
+import re
+import shutil
+from pathlib import Path
+from typing import List, Dict
 from greenmining.config import Config
 from greenmining.models.repository import Repository
@@ -15,23 +18,81 @@ class RepositoryController:
         # Initialize controller with configuration.
         self.config = config
         self.graphql_fetcher = GitHubGraphQLFetcher(config.GITHUB_TOKEN)
-    def fetch_repositories(
-        self,
-        max_repos: int = None,
-        min_stars: int = None,
-        languages: list[str] = None,
-        keywords: str = None,
-        created_after: str = None,
-        created_before: str = None,
-        pushed_after: str = None,
-        pushed_before: str = None,
-    ) -> list[Repository]:
+        self.repos_dir = Path.cwd() / "greenmining_repos"
+    def _sanitize_repo_name(self, repo: Repository, index: int = 0) -> str:
+        """Safe unique dir name: owner_repo[_index]. Handles case collisions."""
+        base = re.sub(r'[^a-z0-9-]', '_', repo.full_name.replace('/', '_').lower())
+        name = f"{base}_{index}" if index else base
+        path = self.repos_dir / name
+        counter = 1
+        while path.exists():
+            name = f"{base}_{counter}"
+            path = self.repos_dir / name
+            counter += 1
+        return name
+    def clone_repositories(
+        self,
+        repositories: List[Repository],
+        github_token: str = None,
+        cleanup: bool = True,
+        depth: int = 1  # Shallow clone
+    ) -> List[Dict]:
+        """Clone repos to ./greenmining_repos/ with unique sanitized names."""
+        self.repos_dir.mkdir(exist_ok=True)
+        if cleanup:
+            shutil.rmtree(self.repos_dir, ignore_errors=True)
+            self.repos_dir.mkdir(exist_ok=True)
+            colored_print(f"Cleaned {self.repos_dir}", "yellow")
+        results = []
+        for i, repo in enumerate(repositories, 1):
+            safe_name = self._sanitize_repo_name(repo, i)
+            clone_path = self.repos_dir / safe_name
+            colored_print(f"[{i}/{len(repositories)}] Cloning {repo.full_name} → {safe_name}", "cyan")
+            url = f"https://{github_token}@github.com/{repo.full_name}.git" if github_token else repo.url
+            cmd = ["git", "clone", f"--depth={depth}", "-v", url, str(clone_path)]
+            import subprocess
+            try:
+                subprocess.check_call(cmd, cwd=self.repos_dir.parent)
+                colored_print(f"{safe_name}", "green")
+                results.append({
+                    "full_name": repo.full_name,
+                    "local_path": str(clone_path),
+                    "success": True
+                })
+            except subprocess.CalledProcessError as e:
+                colored_print(f"{safe_name}: {e}", "red")
+                results.append({
+                    "full_name": repo.full_name,
+                    "local_path": str(clone_path),
+                    "success": False,
+                    "error": str(e)
+                })
+        # Save map for analyze_repositories
+        save_json_file(results, self.repos_dir / "clone_results.json")
+        success_rate = sum(1 for r in results if r["success"]) / len(results) * 100
+        colored_print(f"Cloned: {success_rate:.1f}% ({self.repos_dir}/clone_results.json)", "green")
+        return results
+    def fetch_repositories(self, max_repos: int = None, min_stars: int = None,
+                          languages: list[str] = None, keywords: str = None,
+                          created_after: str = None, created_before: str = None,
+                          pushed_after: str = None, pushed_before: str = None) -> list[Repository]:
         # Fetch repositories from GitHub using GraphQL API.
         max_repos = max_repos or self.config.MAX_REPOS
         min_stars = min_stars or self.config.MIN_STARS
         languages = languages or self.config.SUPPORTED_LANGUAGES
-        keywords = keywords or "microservices"
+        keywords = keywords
         colored_print(f"Fetching up to {max_repos} repositories...", "cyan")
         colored_print(f"   Keywords: {keywords}", "cyan")

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/energy/codecarbon_meter.py RENAMED Viewed

@@ -124,24 +124,3 @@ class CodeCarbonMeter(EnergyMeter):
             end_time=datetime.fromtimestamp(end_time),
         )
-    def get_carbon_intensity(self) -> Optional[float]:
-        # Get current carbon intensity for the configured region.
-        if not self._codecarbon_available:
-            return None
-        try:
-            from codecarbon import EmissionsTracker
-            # Create temporary tracker to get carbon intensity
-            tracker = EmissionsTracker(
-                project_name="carbon_check",
-                country_iso_code=self.country_iso_code,
-                save_to_file=False,
-                log_level="error",
-            )
-            tracker.start()
-            tracker.stop()
-            return getattr(tracker, "_carbon_intensity", None)
-        except Exception:
-            return None

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/energy/cpu_meter.py RENAMED Viewed

@@ -6,7 +6,7 @@ from __future__ import annotations
 import time
 import platform
 from datetime import datetime
-from typing import Dict, List, Optional
+from typing import List, Optional
 from .base import EnergyMeter, EnergyMetrics, EnergyBackend

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/gsf_patterns.py RENAMED Viewed

@@ -254,6 +254,35 @@ GSF_PATTERNS = {
         "description": "Choose hardware optimized for energy efficiency",
         "sci_impact": "Direct reduction in energy consumption",
     },
+    "match_preconfigured_server": {
+        "name": "Match Utilization Requirements with Pre-configured Servers",
+        "category": "cloud",
+        "keywords": [
+            "pre-configured server",
+            "energy proportionality",
+            "server utilization",
+            "oversized server",
+            "underutilized server",
+            "server consolidation",
+        ],
+        "description": "Select pre-configured servers that match utilization needs; one highly utilized server is more energy-efficient than two underutilized ones",
+        "sci_impact": "Higher utilization improves energy proportionality; fewer servers reduces embodied carbon",
+    },
+    "optimize_customer_device_impact": {
+        "name": "Optimize Impact on Customer Devices and Equipment",
+        "category": "cloud",
+        "keywords": [
+            "customer device",
+            "backward compatible",
+            "backwards compatible",
+            "older hardware",
+            "device lifetime",
+            "older browser",
+            "end-of-life hardware",
+        ],
+        "description": "Design software to extend customer hardware lifetimes through backward compatibility with older devices, browsers, and operating systems",
+        "sci_impact": "Extending device lifetimes reduces embodied carbon; optimizing for older hardware may also reduce energy intensity",
+    },
     # ==================== WEB PATTERNS (15+) ====================
     "avoid_chaining_requests": {
         "name": "Avoid Chaining Critical Requests",
@@ -1555,6 +1584,18 @@ GREEN_KEYWORDS = [
     "workload",
     "overhead",
     "footprint",
+    # Server utilization & customer device patterns
+    "pre-configured server",
+    "energy proportionality",
+    "server consolidation",
+    "underutilized server",
+    "oversized server",
+    "backward compatible",
+    "backwards compatible",
+    "customer device",
+    "device lifetime",
+    "older browser",
+    "end-of-life hardware",
 ]

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/models/aggregated_stats.py RENAMED Viewed

@@ -3,7 +3,7 @@
 from __future__ import annotations
 from dataclasses import dataclass, field
-from typing import Dict, List, Optional
+from typing import Optional
 @dataclass

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/models/commit.py RENAMED Viewed

@@ -3,7 +3,6 @@
 from __future__ import annotations
 from dataclasses import dataclass, field
-from typing import List
 @dataclass

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/models/repository.py RENAMED Viewed

@@ -3,7 +3,7 @@
 from __future__ import annotations
 from dataclasses import dataclass, field
-from typing import List, Optional
+from typing import Optional
 @dataclass

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/services/commit_extractor.py RENAMED Viewed

@@ -2,21 +2,17 @@
 from __future__ import annotations
-import json
 from datetime import datetime, timedelta
 from pathlib import Path
-from typing import Any, Dict, List, Optional
+from typing import Any
 from github import Github
 from tqdm import tqdm
-from greenmining.config import get_config
 from greenmining.models.repository import Repository
 from greenmining.utils import (
     colored_print,
     format_timestamp,
-    load_json_file,
-    print_banner,
     retry_on_exception,
     save_json_file,
 )
@@ -110,8 +106,7 @@ class CommitExtractor:
         try:
             # Get repository from GitHub API
             if not self.github:
-                config = get_config()
-                self.github = Github(config.GITHUB_TOKEN)
+                raise ValueError("github_token is required for commit extraction")
             gh_repo = self.github.get_repo(repo_name)
@@ -143,40 +138,6 @@ class CommitExtractor:
         return commits
-    def _extract_commit_metadata(self, commit, repo_name: str) -> dict[str, Any]:
-        # Extract metadata from commit object.
-        # Get modified files
-        files_changed = []
-        lines_added = 0
-        lines_deleted = 0
-        try:
-            for modified_file in commit.modified_files:
-                files_changed.append(modified_file.filename)
-                lines_added += modified_file.added_lines
-                lines_deleted += modified_file.deleted_lines
-        except Exception:
-            pass
-        return {
-            "commit_id": commit.hash,
-            "repo_name": repo_name,
-            "date": commit.committer_date.isoformat(),
-            "author": commit.author.name,
-            "author_email": commit.author.email,
-            "message": commit.msg.strip(),
-            "files_changed": files_changed[:20],  # Limit to 20 files
-            "lines_added": lines_added,
-            "lines_deleted": lines_deleted,
-            "insertions": lines_added,
-            "deletions": lines_deleted,
-            "is_merge": commit.merge,
-            "branches": (
-                list(commit.branches) if hasattr(commit, "branches") and commit.branches else []
-            ),
-            "in_main_branch": commit.in_main_branch if hasattr(commit, "in_main_branch") else True,
-        }
     def _extract_commit_metadata_from_github(self, commit, repo_name: str) -> dict[str, Any]:
         # Extract metadata from GitHub API commit object.
         # Get modified files and stats

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/services/data_aggregator.py RENAMED Viewed

@@ -2,26 +2,21 @@
 from __future__ import annotations
-import json
 from collections import defaultdict
 from pathlib import Path
-from typing import Any, Dict, List, Optional
+from typing import Any
 import pandas as pd
 from greenmining.analyzers import (
     StatisticalAnalyzer,
     TemporalAnalyzer,
-    QualitativeAnalyzer,
 )
-from greenmining.config import get_config
 from greenmining.models.repository import Repository
 from greenmining.utils import (
     colored_print,
     format_number,
     format_percentage,
-    load_json_file,
-    print_banner,
     save_csv_file,
     save_json_file,
 )

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/services/data_analyzer.py RENAMED Viewed

@@ -2,18 +2,15 @@
 from __future__ import annotations
-import json
-import re
 from collections import Counter
 from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple
+from typing import Any
 from tqdm import tqdm
 from greenmining.analyzers import (
     CodeDiffAnalyzer,
 )
-from greenmining.config import get_config
 from greenmining.gsf_patterns import (
     GREEN_KEYWORDS,
     GSF_PATTERNS,
@@ -22,11 +19,7 @@ from greenmining.gsf_patterns import (
 )
 from greenmining.utils import (
     colored_print,
-    create_checkpoint,
     format_timestamp,
-    load_checkpoint,
-    load_json_file,
-    print_banner,
     save_json_file,
 )
@@ -156,55 +149,6 @@ class DataAnalyzer:
         return result
-    def _check_green_awareness(self, message: str, files: list[str]) -> tuple[bool, Optional[str]]:
-        # Check if commit explicitly mentions green/energy concerns.
-        # Check message for green keywords
-        for keyword in self.GREEN_KEYWORDS:
-            if keyword in message:
-                # Extract context around keyword
-                pattern = rf".{{0,30}}{re.escape(keyword)}.{{0,30}}"
-                match = re.search(pattern, message, re.IGNORECASE)
-                if match:
-                    evidence = match.group(0).strip()
-                    return True, f"Keyword '{keyword}': {evidence}"
-        # Check file names for patterns
-        cache_files = [f for f in files if "cache" in f or "redis" in f]
-        if cache_files:
-            return True, f"Modified cache-related file: {cache_files[0]}"
-        perf_files = [f for f in files if "performance" in f or "optimization" in f]
-        if perf_files:
-            return True, f"Modified performance file: {perf_files[0]}"
-        return False, None
-    def _detect_known_pattern(self, message: str, files: list[str]) -> tuple[Optional[str], str]:
-        # Detect known green software pattern.
-        matches = []
-        # Check each pattern
-        for pattern_name, keywords in self.GREEN_PATTERNS.items():
-            for keyword in keywords:
-                if keyword in message:
-                    # Calculate confidence based on specificity
-                    confidence = "HIGH" if len(keyword) > 10 else "MEDIUM"
-                    matches.append((pattern_name, confidence, len(keyword)))
-        # Check file names for pattern hints
-        all_files = " ".join(files)
-        for pattern_name, keywords in self.GREEN_PATTERNS.items():
-            for keyword in keywords:
-                if keyword in all_files:
-                    matches.append((pattern_name, "MEDIUM", len(keyword)))
-        if not matches:
-            return "NONE DETECTED", "NONE"
-        # Return most specific match (longest keyword)
-        matches.sort(key=lambda x: x[2], reverse=True)
-        return matches[0][0], matches[0][1]
     def save_results(self, results: list[dict[str, Any]], output_file: Path):
         # Save analysis results to JSON file.
         # Calculate summary statistics

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/services/local_repo_analyzer.py RENAMED Viewed

@@ -5,13 +5,12 @@ from __future__ import annotations
 import os
 import re
 import shutil
-import subprocess
 import tempfile
 from concurrent.futures import ThreadPoolExecutor, as_completed
 from dataclasses import dataclass, field
 from datetime import datetime, timedelta
 from pathlib import Path
-from typing import Any, Dict, List, Optional, Generator
+from typing import Any, Dict, List, Optional
 from pydriller import Repository
 from pydriller.metrics.process.change_set import ChangeSet

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/services/reports.py RENAMED Viewed

@@ -1,20 +1,15 @@
 # Report generation for green mining analysis.
-"""Report generation module for GreenMining analysis results."""
 from __future__ import annotations
-import json
 from datetime import datetime
 from pathlib import Path
-from typing import Any, Dict, Optional
+from typing import Any
-from greenmining.config import get_config
 from greenmining.utils import (
     colored_print,
     format_number,
     format_percentage,
-    load_json_file,
-    print_banner,
 )

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining/utils.py RENAMED Viewed

@@ -38,41 +38,12 @@ def save_json_file(data: dict[str, Any], path: Path, indent: int = 2) -> None:
         json.dump(data, f, indent=indent, ensure_ascii=False)
-def load_csv_file(path: Path) -> pd.DataFrame:
-    # Load CSV file as pandas DataFrame.
-    if not path.exists():
-        raise FileNotFoundError(f"File not found: {path}")
-    return pd.read_csv(path)
 def save_csv_file(df: pd.DataFrame, path: Path) -> None:
     # Save DataFrame to CSV file.
     path.parent.mkdir(parents=True, exist_ok=True)
     df.to_csv(path, index=False, encoding="utf-8")
-def estimate_tokens(text: str) -> int:
-    # Estimate number of tokens in text.
-    return len(text) // 4
-def estimate_cost(tokens: int, model: str = "claude-sonnet-4-20250514") -> float:
-    # Estimate API cost based on token usage.
-    # Claude Sonnet 4 pricing (as of Dec 2024)
-    # Input: $3 per million tokens
-    # Output: $15 per million tokens
-    # Average estimate: assume 50% input, 50% output
-    if "sonnet" in model.lower():
-        input_cost = 3.0 / 1_000_000  # per token
-        output_cost = 15.0 / 1_000_000  # per token
-        avg_cost = (input_cost + output_cost) / 2
-        return tokens * avg_cost
-    return 0.0
 def retry_on_exception(
     max_retries: int = 3,
     delay: float = 2.0,
@@ -124,14 +95,6 @@ def colored_print(text: str, color: str = "white") -> None:
     print(f"{color_code}{text}{Style.RESET_ALL}")
-def handle_github_rate_limit(response) -> None:
-    # Handle GitHub API rate limiting.
-    if hasattr(response, "status") and response.status == 403:
-        colored_print("GitHub API rate limit exceeded!", "red")
-        colored_print("Please wait or use an authenticated token.", "yellow")
-        raise Exception("GitHub API rate limit exceeded")
 def format_number(num: int) -> str:
     # Format large numbers with thousand separators.
     return f"{num:,}"
@@ -140,53 +103,3 @@ def format_number(num: int) -> str:
 def format_percentage(value: float, decimals: int = 1) -> str:
     # Format percentage value.
     return f"{value:.{decimals}f}%"
-def format_duration(seconds: float) -> str:
-    # Format duration in human-readable format.
-    if seconds < 60:
-        return f"{int(seconds)}s"
-    elif seconds < 3600:
-        minutes = int(seconds / 60)
-        secs = int(seconds % 60)
-        return f"{minutes}m {secs}s"
-    else:
-        hours = int(seconds / 3600)
-        minutes = int((seconds % 3600) / 60)
-        return f"{hours}h {minutes}m"
-def truncate_text(text: str, max_length: int = 100) -> str:
-    # Truncate text to maximum length.
-    if len(text) <= max_length:
-        return text
-    return text[: max_length - 3] + "..."
-def create_checkpoint(checkpoint_file: Path, data: dict[str, Any]) -> None:
-    # Create checkpoint file for resuming operations.
-    save_json_file(data, checkpoint_file)
-    colored_print(f"Checkpoint saved: {checkpoint_file}", "green")
-def load_checkpoint(checkpoint_file: Path) -> Optional[dict[str, Any]]:
-    # Load checkpoint data if exists.
-    if checkpoint_file.exists():
-        try:
-            return load_json_file(checkpoint_file)
-        except Exception as e:
-            colored_print(f"Failed to load checkpoint: {e}", "yellow")
-    return None
-def print_banner(title: str) -> None:
-    # Print formatted banner.
-    colored_print("\n" + "=" * 60, "cyan")
-    colored_print(f" {title}", "cyan")
-    colored_print("=" * 60 + "\n", "cyan")
-def print_section(title: str) -> None:
-    # Print section header.
-    colored_print(f"\n {title}", "blue")
-    colored_print("-" * 60, "blue")

{greenmining-1.1.7 → greenmining-1.1.9/greenmining.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: greenmining
-Version: 1.1.7
+Version: 1.1.9
 Summary: An empirical Python library for Mining Software Repositories (MSR) in Green IT research
 Author-email: Adam Bouafia <a.bouafia@student.vu.nl>
 License: MIT
@@ -68,9 +68,9 @@ An empirical Python library for Mining Software Repositories (MSR) in Green IT r
 `greenmining` is a research-grade Python library designed for **empirical Mining Software Repositories (MSR)** studies in **Green IT**. It enables researchers and practitioners to:
-- **Mine repositories at scale** - Fetch and analyze GitHub repositories via GraphQL API with configurable filters
-- **Batch analysis with parallelism** - Analyze multiple repositories concurrently with configurable worker pools
-- **Classify green commits** - Detect 122 sustainability patterns from the Green Software Foundation (GSF) catalog
+- **Mine repositories at scale** - Search, Fetch and analyze GitHub repositories via GraphQL API with configurable filters
+- **Classify green commits** - Detect 124 sustainability patterns from the Green Software Foundation (GSF) catalog
 - **Analyze any repository by URL** - Direct Git-based analysis with support for private repositories
 - **Measure energy consumption** - RAPL, CodeCarbon, and CPU Energy Meter backends for power profiling
 - **Carbon footprint reporting** - CO2 emissions calculation with 20+ country profiles and cloud region support
@@ -113,7 +113,7 @@ docker pull adambouafia/greenmining:latest
 from greenmining import GSF_PATTERNS, is_green_aware, get_pattern_by_keywords
 # Check available patterns
-print(f"Total patterns: {len(GSF_PATTERNS)}")  # 122 patterns across 15 categories
+print(f"Total patterns: {len(GSF_PATTERNS)}")  # 124 patterns across 15 categories
 # Detect green awareness in commit messages
 commit_msg = "Optimize Redis caching to reduce energy consumption"
@@ -670,8 +670,8 @@ config = Config(
 ### Core Capabilities
-- **Pattern Detection**: 122 sustainability patterns across 15 categories from the GSF catalog
-- **Keyword Analysis**: 321 green software detection keywords
+- **Pattern Detection**: 124 sustainability patterns across 15 categories from the GSF catalog
+- **Keyword Analysis**: 332 green software detection keywords
 - **Repository Fetching**: GraphQL API with date, star, and language filters
 - **URL-Based Analysis**: Direct Git-based analysis from GitHub URLs (HTTPS and SSH)
 - **Batch Processing**: Parallel analysis of multiple repositories with configurable workers
@@ -739,7 +739,7 @@ print(f"Equivalent: {report.tree_months:.2f} tree-months to offset")
 ### Pattern Database
-**122 green software patterns based on:**
+**124 green software patterns based on:**
 - Green Software Foundation (GSF) Patterns Catalog
 - VU Amsterdam 2024 research on ML system sustainability
 - ICSE 2024 conference papers on sustainable software
@@ -749,11 +749,11 @@ print(f"Equivalent: {report.tree_months:.2f} tree-months to offset")
 - **Coverage**: 67% of patterns actively detect in real-world commits
 - **Accuracy**: 100% true positive rate for green-aware commits
 - **Categories**: 15 distinct sustainability domains covered
-- **Keywords**: 321 detection terms across all patterns
+- **Keywords**: 332 detection terms across all patterns
 ## GSF Pattern Categories
-**122 patterns across 15 categories:**
+**124 patterns across 15 categories:**
 ### 1. Cloud (40 patterns)
 Auto-scaling, serverless computing, right-sizing instances, region selection for renewable energy, spot instances, idle resource detection, cloud-native architectures

{greenmining-1.1.7 → greenmining-1.1.9}/greenmining.egg-info/SOURCES.txt RENAMED Viewed

@@ -6,7 +6,6 @@ pyproject.toml
 setup.py
 ./greenmining/__init__.py
 ./greenmining/__main__.py
-./greenmining/__version__.py
 ./greenmining/config.py
 ./greenmining/gsf_patterns.py
 ./greenmining/utils.py
@@ -37,13 +36,11 @@ setup.py
 ./greenmining/services/commit_extractor.py
 ./greenmining/services/data_aggregator.py
 ./greenmining/services/data_analyzer.py
-./greenmining/services/github_fetcher.py
 ./greenmining/services/github_graphql_fetcher.py
 ./greenmining/services/local_repo_analyzer.py
 ./greenmining/services/reports.py
 greenmining/__init__.py
 greenmining/__main__.py
-greenmining/__version__.py
 greenmining/config.py
 greenmining/gsf_patterns.py
 greenmining/utils.py
@@ -79,7 +76,6 @@ greenmining/services/__init__.py
 greenmining/services/commit_extractor.py
 greenmining/services/data_aggregator.py
 greenmining/services/data_analyzer.py
-greenmining/services/github_fetcher.py
 greenmining/services/github_graphql_fetcher.py
 greenmining/services/local_repo_analyzer.py
 greenmining/services/reports.py

{greenmining-1.1.7 → greenmining-1.1.9}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "greenmining"
-version = "1.1.7"
+version = "1.1.9"
 description = "An empirical Python library for Mining Software Repositories (MSR) in Green IT research"
 readme = "README.md"
 requires-python = ">=3.9"

greenmining-1.1.7/greenmining/__version__.py DELETED Viewed

@@ -1,3 +0,0 @@
-# Version information for greenmining.
-__version__ = "1.0.5"

greenmining-1.1.7/greenmining/config.py DELETED Viewed

@@ -1,200 +0,0 @@
-import os
-from pathlib import Path
-from typing import Any, Dict, List, Optional
-from dotenv import load_dotenv
-def _load_yaml_config(yaml_path: Path) -> Dict[str, Any]:
-    # Load configuration from YAML file if it exists.
-    if not yaml_path.exists():
-        return {}
-    try:
-        import yaml
-        with open(yaml_path, "r") as f:
-            return yaml.safe_load(f) or {}
-    except ImportError:
-        return {}
-    except Exception:
-        return {}
-class Config:
-    # Configuration class for loading from env vars and YAML.
-    def __init__(self, env_file: str = ".env", yaml_file: str = "greenmining.yaml"):
-        # Initialize configuration from environment and YAML file.
-        # Load environment variables
-        env_path = Path(env_file)
-        if env_path.exists():
-            load_dotenv(env_path)
-        else:
-            load_dotenv()  # Load from system environment
-        # Load YAML config (takes precedence for certain options)
-        yaml_path = Path(yaml_file)
-        self._yaml_config = _load_yaml_config(yaml_path)
-        # GitHub API Configuration
-        self.GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")
-        if not self.GITHUB_TOKEN or self.GITHUB_TOKEN == "your_github_pat_here":
-            raise ValueError("GITHUB_TOKEN not set. Please set it in .env file or environment.")
-        # Analysis Type
-        self.ANALYSIS_TYPE = "keyword_heuristic"
-        # Search and Processing Configuration (YAML: sources.search.keywords)
-        yaml_search = self._yaml_config.get("sources", {}).get("search", {})
-        self.GITHUB_SEARCH_KEYWORDS = yaml_search.get(
-            "keywords", ["microservices", "microservice-architecture", "cloud-native"]
-        )
-        # Supported Languages (YAML: sources.search.languages)
-        self.SUPPORTED_LANGUAGES = yaml_search.get(
-            "languages",
-            [
-                "Java",
-                "Python",
-                "Go",
-                "JavaScript",
-                "TypeScript",
-                "C#",
-                "Rust",
-            ],
-        )
-        # Repository and Commit Limits (YAML: extraction.*)
-        yaml_extraction = self._yaml_config.get("extraction", {})
-        self.MIN_STARS = yaml_search.get("min_stars", int(os.getenv("MIN_STARS", "100")))
-        self.MAX_REPOS = int(os.getenv("MAX_REPOS", "100"))
-        self.COMMITS_PER_REPO = yaml_extraction.get(
-            "max_commits", int(os.getenv("COMMITS_PER_REPO", "50"))
-        )
-        self.DAYS_BACK = yaml_extraction.get("days_back", int(os.getenv("DAYS_BACK", "730")))
-        self.SKIP_MERGES = yaml_extraction.get("skip_merges", True)
-        # Analysis Configuration (YAML: analysis.*)
-        yaml_analysis = self._yaml_config.get("analysis", {})
-        self.ENABLE_NLP_ANALYSIS = os.getenv("ENABLE_NLP_ANALYSIS", "false").lower() == "true"
-        self.ENABLE_TEMPORAL_ANALYSIS = (
-            os.getenv("ENABLE_TEMPORAL_ANALYSIS", "false").lower() == "true"
-        )
-        self.TEMPORAL_GRANULARITY = os.getenv("TEMPORAL_GRANULARITY", "quarter")
-        self.ENABLE_ML_FEATURES = os.getenv("ENABLE_ML_FEATURES", "false").lower() == "true"
-        self.VALIDATION_SAMPLE_SIZE = int(os.getenv("VALIDATION_SAMPLE_SIZE", "30"))
-        # PyDriller options (YAML: analysis.process_metrics, etc.)
-        self.PROCESS_METRICS_ENABLED = yaml_analysis.get(
-            "process_metrics", os.getenv("PROCESS_METRICS_ENABLED", "true").lower() == "true"
-        )
-        self.STRUCTURAL_METRICS_ENABLED = yaml_analysis.get(
-            "structural_metrics", os.getenv("STRUCTURAL_METRICS_ENABLED", "true").lower() == "true"
-        )
-        self.DMM_ENABLED = yaml_analysis.get(
-            "delta_maintainability", os.getenv("DMM_ENABLED", "true").lower() == "true"
-        )
-        # Temporal Filtering
-        self.CREATED_AFTER = os.getenv("CREATED_AFTER")
-        self.CREATED_BEFORE = os.getenv("CREATED_BEFORE")
-        self.PUSHED_AFTER = os.getenv("PUSHED_AFTER")
-        self.PUSHED_BEFORE = os.getenv("PUSHED_BEFORE")
-        self.COMMIT_DATE_FROM = os.getenv("COMMIT_DATE_FROM")
-        self.COMMIT_DATE_TO = os.getenv("COMMIT_DATE_TO")
-        self.MIN_COMMITS = int(os.getenv("MIN_COMMITS", "0"))
-        self.ACTIVITY_WINDOW_DAYS = int(os.getenv("ACTIVITY_WINDOW_DAYS", "730"))
-        # Analysis Configuration
-        self.BATCH_SIZE = int(os.getenv("BATCH_SIZE", "10"))
-        # Processing Configuration
-        self.TIMEOUT_SECONDS = int(os.getenv("TIMEOUT_SECONDS", "30"))
-        self.MAX_RETRIES = int(os.getenv("MAX_RETRIES", "3"))
-        self.RETRY_DELAY = 2
-        self.EXPONENTIAL_BACKOFF = True
-        # Output Configuration (YAML: output.directory)
-        yaml_output = self._yaml_config.get("output", {})
-        self.OUTPUT_DIR = Path(yaml_output.get("directory", os.getenv("OUTPUT_DIR", "./data")))
-        self.OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
-        # File Paths
-        self.REPOS_FILE = self.OUTPUT_DIR / "repositories.json"
-        self.COMMITS_FILE = self.OUTPUT_DIR / "commits.json"
-        self.ANALYSIS_FILE = self.OUTPUT_DIR / "analysis_results.json"
-        self.AGGREGATED_FILE = self.OUTPUT_DIR / "aggregated_statistics.json"
-        self.CSV_FILE = self.OUTPUT_DIR / "green_analysis_results.csv"
-        self.REPORT_FILE = self.OUTPUT_DIR / "green_microservices_analysis.md"
-        self.CHECKPOINT_FILE = self.OUTPUT_DIR / "checkpoint.json"
-        # Direct Repository URL Support (YAML: sources.urls)
-        yaml_urls = self._yaml_config.get("sources", {}).get("urls", [])
-        env_urls = self._parse_repository_urls(os.getenv("REPOSITORY_URLS", ""))
-        self.REPOSITORY_URLS: List[str] = yaml_urls if yaml_urls else env_urls
-        # Clone path (YAML: extraction.clone_path)
-        self.CLONE_PATH = Path(
-            yaml_extraction.get("clone_path", os.getenv("CLONE_PATH", "/tmp/greenmining_repos"))
-        )
-        self.CLEANUP_AFTER_ANALYSIS = os.getenv("CLEANUP_AFTER_ANALYSIS", "true").lower() == "true"
-        # Energy Measurement (YAML: energy.*)
-        yaml_energy = self._yaml_config.get("energy", {})
-        self.ENERGY_ENABLED = yaml_energy.get(
-            "enabled", os.getenv("ENERGY_ENABLED", "false").lower() == "true"
-        )
-        self.ENERGY_BACKEND = yaml_energy.get("backend", os.getenv("ENERGY_BACKEND", "rapl"))
-        self.CARBON_TRACKING = yaml_energy.get(
-            "carbon_tracking", os.getenv("CARBON_TRACKING", "false").lower() == "true"
-        )
-        self.COUNTRY_ISO = yaml_energy.get("country_iso", os.getenv("COUNTRY_ISO", "USA"))
-        # Power profiling (YAML: energy.power_profiling.*)
-        yaml_power = yaml_energy.get("power_profiling", {})
-        self.POWER_PROFILING_ENABLED = yaml_power.get("enabled", False)
-        self.POWER_TEST_COMMAND = yaml_power.get("test_command", None)
-        self.POWER_REGRESSION_THRESHOLD = yaml_power.get("regression_threshold", 5.0)
-        # Logging
-        self.VERBOSE = os.getenv("VERBOSE", "false").lower() == "true"
-        self.LOG_FILE = self.OUTPUT_DIR / "mining.log"
-    def _parse_repository_urls(self, urls_str: str) -> List[str]:
-        # Parse comma-separated repository URLs from environment variable.
-        if not urls_str:
-            return []
-        return [url.strip() for url in urls_str.split(",") if url.strip()]
-    def validate(self) -> bool:
-        # Validate that all required configuration is present.
-        required_attrs = ["GITHUB_TOKEN", "MAX_REPOS", "COMMITS_PER_REPO"]
-        for attr in required_attrs:
-            if not getattr(self, attr, None):
-                raise ValueError(f"Missing required configuration: {attr}")
-        return True
-    def __repr__(self) -> str:
-        # String representation of configuration (hiding sensitive data).
-        return (
-            f"Config("
-            f"MAX_REPOS={self.MAX_REPOS}, "
-            f"COMMITS_PER_REPO={self.COMMITS_PER_REPO}, "
-            f"BATCH_SIZE={self.BATCH_SIZE}, "
-            f"OUTPUT_DIR={self.OUTPUT_DIR}"
-            f")"
-        )
-# Global config instance
-_config_instance = None
-def get_config(env_file: str = ".env") -> Config:
-    # Get or create global configuration instance.
-    global _config_instance
-    if _config_instance is None:
-        _config_instance = Config(env_file)
-    return _config_instance