greenmining 1.1.9__py3-none-any.whl → 1.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,865 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: greenmining
3
- Version: 1.1.9
4
- Summary: An empirical Python library for Mining Software Repositories (MSR) in Green IT research
5
- Author-email: Adam Bouafia <a.bouafia@student.vu.nl>
6
- License: MIT
7
- Project-URL: Homepage, https://github.com/adam-bouafia/greenmining
8
- Project-URL: Documentation, https://github.com/adam-bouafia/greenmining#readme
9
- Project-URL: Linkedin, https://www.linkedin.com/in/adam-bouafia/
10
- Project-URL: Repository, https://github.com/adam-bouafia/greenmining
11
- Project-URL: Issues, https://github.com/adam-bouafia/greenmining/issues
12
- Project-URL: Changelog, https://github.com/adam-bouafia/greenmining/blob/main/CHANGELOG.md
13
- Keywords: green-software,gsf,msr,mining-software-repositories,green-it,sustainability,carbon-footprint,energy-efficiency,repository-analysis,github-analysis,pydriller,empirical-software-engineering
14
- Classifier: Development Status :: 3 - Alpha
15
- Classifier: Intended Audience :: Developers
16
- Classifier: Intended Audience :: Science/Research
17
- Classifier: Topic :: Software Development :: Quality Assurance
18
- Classifier: Topic :: Scientific/Engineering :: Information Analysis
19
- Classifier: License :: OSI Approved :: MIT License
20
- Classifier: Programming Language :: Python :: 3
21
- Classifier: Programming Language :: Python :: 3.9
22
- Classifier: Programming Language :: Python :: 3.10
23
- Classifier: Programming Language :: Python :: 3.11
24
- Classifier: Programming Language :: Python :: 3.12
25
- Classifier: Programming Language :: Python :: 3.13
26
- Classifier: Operating System :: OS Independent
27
- Requires-Python: >=3.9
28
- Description-Content-Type: text/markdown
29
- License-File: LICENSE
30
- Requires-Dist: PyGithub
31
- Requires-Dist: PyDriller
32
- Requires-Dist: pandas
33
- Requires-Dist: colorama
34
- Requires-Dist: tabulate
35
- Requires-Dist: tqdm
36
- Requires-Dist: matplotlib
37
- Requires-Dist: plotly
38
- Requires-Dist: python-dotenv
39
- Requires-Dist: requests
40
- Provides-Extra: dev
41
- Requires-Dist: pytest; extra == "dev"
42
- Requires-Dist: pytest-cov; extra == "dev"
43
- Requires-Dist: pytest-mock; extra == "dev"
44
- Requires-Dist: black; extra == "dev"
45
- Requires-Dist: ruff; extra == "dev"
46
- Requires-Dist: mypy; extra == "dev"
47
- Requires-Dist: build; extra == "dev"
48
- Requires-Dist: twine; extra == "dev"
49
- Provides-Extra: energy
50
- Requires-Dist: psutil; extra == "energy"
51
- Requires-Dist: codecarbon; extra == "energy"
52
- Provides-Extra: docs
53
- Requires-Dist: sphinx; extra == "docs"
54
- Requires-Dist: sphinx-rtd-theme; extra == "docs"
55
- Requires-Dist: myst-parser; extra == "docs"
56
- Dynamic: license-file
57
-
58
- # greenmining
59
-
60
- An empirical Python library for Mining Software Repositories (MSR) in Green IT research.
61
-
62
- [![PyPI](https://img.shields.io/pypi/v/greenmining)](https://pypi.org/project/greenmining/)
63
- [![Python](https://img.shields.io/pypi/pyversions/greenmining)](https://pypi.org/project/greenmining/)
64
- [![License](https://img.shields.io/github/license/adam-bouafia/greenmining)](LICENSE)
65
- [![Documentation](https://img.shields.io/badge/docs-readthedocs-blue)](https://greenmining.readthedocs.io/)
66
-
67
- ## Overview
68
-
69
- `greenmining` is a research-grade Python library designed for **empirical Mining Software Repositories (MSR)** studies in **Green IT**. It enables researchers and practitioners to:
70
-
71
- - **Mine repositories at scale** - Search, Fetch and analyze GitHub repositories via GraphQL API with configurable filters
72
-
73
- - **Classify green commits** - Detect 124 sustainability patterns from the Green Software Foundation (GSF) catalog
74
- - **Analyze any repository by URL** - Direct Git-based analysis with support for private repositories
75
- - **Measure energy consumption** - RAPL, CodeCarbon, and CPU Energy Meter backends for power profiling
76
- - **Carbon footprint reporting** - CO2 emissions calculation with 20+ country profiles and cloud region support
77
- - **Power regression detection** - Identify commits that increased energy consumption
78
- - **Method-level analysis** - Per-method complexity and metrics via Lizard integration
79
- - **Version power comparison** - Compare power consumption across software versions
80
- - **Generate research datasets** - Statistical analysis, temporal trends, and publication-ready reports
81
-
82
- Whether you're conducting MSR research, analyzing green software adoption, or measuring the energy footprint of codebases, GreenMining provides the empirical toolkit you need.
83
-
84
- ## Installation
85
-
86
- ### Via pip
87
-
88
- ```bash
89
- pip install greenmining
90
- ```
91
-
92
- ### From source
93
-
94
- ```bash
95
- git clone https://github.com/adam-bouafia/greenmining.git
96
- cd greenmining
97
- pip install -e .
98
- ```
99
-
100
- ### With Docker
101
-
102
- ```bash
103
- docker pull adambouafia/greenmining:latest
104
- ```
105
-
106
- ## Quick Start
107
-
108
- ### Python API
109
-
110
- #### Basic Pattern Detection
111
-
112
- ```python
113
- from greenmining import GSF_PATTERNS, is_green_aware, get_pattern_by_keywords
114
-
115
- # Check available patterns
116
- print(f"Total patterns: {len(GSF_PATTERNS)}") # 124 patterns across 15 categories
117
-
118
- # Detect green awareness in commit messages
119
- commit_msg = "Optimize Redis caching to reduce energy consumption"
120
- if is_green_aware(commit_msg):
121
- patterns = get_pattern_by_keywords(commit_msg)
122
- print(f"Matched patterns: {patterns}")
123
- # Output: ['Cache Static Data', 'Use Efficient Cache Strategies']
124
- ```
125
-
126
- #### Fetch Repositories with Custom Keywords
127
-
128
- ```python
129
- from greenmining import fetch_repositories
130
-
131
- # Fetch repositories with custom search keywords
132
- repos = fetch_repositories(
133
- github_token="your_github_token", # Required: GitHub personal access token
134
- max_repos=50, # Maximum number of repositories to fetch
135
- min_stars=500, # Minimum star count filter
136
- keywords="kubernetes cloud-native", # Search keywords (space-separated)
137
- languages=["Python", "Go"], # Programming language filters
138
- created_after="2020-01-01", # Filter by creation date (YYYY-MM-DD)
139
- created_before="2024-12-31", # Filter by creation date (YYYY-MM-DD)
140
- pushed_after="2023-01-01", # Filter by last push date (YYYY-MM-DD)
141
- pushed_before="2024-12-31" # Filter by last push date (YYYY-MM-DD)
142
- )
143
-
144
- print(f"Found {len(repos)} repositories")
145
- for repo in repos[:5]:
146
- print(f"- {repo.full_name} ({repo.stars} stars)")
147
- ```
148
-
149
- **Parameters:**
150
- - `github_token` (str, required): GitHub personal access token for API authentication
151
- - `max_repos` (int, default=100): Maximum number of repositories to fetch
152
- - `min_stars` (int, default=100): Minimum GitHub stars filter
153
- - `keywords` (str, default="microservices"): Space-separated search keywords
154
- - `languages` (list[str], optional): Programming language filters (e.g., ["Python", "Go", "Java"])
155
- - `created_after` (str, optional): Filter repos created after date (format: "YYYY-MM-DD")
156
- - `created_before` (str, optional): Filter repos created before date (format: "YYYY-MM-DD")
157
- - `pushed_after` (str, optional): Filter repos pushed after date (format: "YYYY-MM-DD")
158
- - `pushed_before` (str, optional): Filter repos pushed before date (format: "YYYY-MM-DD")
159
-
160
- #### Analyze Repository Commits
161
-
162
- ```python
163
- from greenmining.services.commit_extractor import CommitExtractor
164
- from greenmining.services.data_analyzer import DataAnalyzer
165
- from greenmining import fetch_repositories
166
-
167
- # Fetch repositories with custom keywords
168
- repos = fetch_repositories(
169
- github_token="your_token",
170
- max_repos=10,
171
- keywords="serverless edge-computing"
172
- )
173
-
174
- # Initialize commit extractor with parameters
175
- extractor = CommitExtractor(
176
- exclude_merge_commits=True, # Skip merge commits (default: True)
177
- exclude_bot_commits=True, # Skip bot commits (default: True)
178
- min_message_length=10 # Minimum commit message length (default: 10)
179
- )
180
-
181
- # Initialize analyzer with advanced features
182
- analyzer = DataAnalyzer(
183
- enable_diff_analysis=False, # Enable code diff analysis (slower but more accurate)
184
- patterns=None, # Custom pattern dict (default: GSF_PATTERNS)
185
- batch_size=10 # Batch processing size (default: 10)
186
- )
187
-
188
- # Extract commits from first repo
189
- commits = extractor.extract_commits(
190
- repository=repos[0], # PyGithub Repository object
191
- max_commits=50, # Maximum commits to extract per repository
192
- since=None, # Start date filter (datetime object, optional)
193
- until=None # End date filter (datetime object, optional)
194
- )
195
-
196
- **CommitExtractor Parameters:**
197
- - `exclude_merge_commits` (bool, default=True): Skip merge commits during extraction
198
- - `exclude_bot_commits` (bool, default=True): Skip commits from bot accounts
199
- - `min_message_length` (int, default=10): Minimum length for commit message to be included
200
-
201
- **DataAnalyzer Parameters:**
202
- - `enable_diff_analysis` (bool, default=False): Enable code diff analysis (slower)
203
- - `patterns` (dict, optional): Custom pattern dictionary (default: GSF_PATTERNS)
204
- - `batch_size` (int, default=10): Number of commits to process in each batch
205
-
206
- # Analyze commits for green patterns
207
- results = []
208
- for commit in commits:
209
- result = analyzer.analyze_commit(commit)
210
- if result['green_aware']:
211
- results.append(result)
212
- print(f"Green commit found: {commit.message[:50]}...")
213
- print(f" Patterns: {result['known_pattern']}")
214
- ```
215
-
216
- #### Access Sustainability Patterns Data
217
-
218
- ```python
219
- from greenmining import GSF_PATTERNS
220
-
221
- # Get all patterns by category
222
- cloud_patterns = {
223
- pid: pattern for pid, pattern in GSF_PATTERNS.items()
224
- if pattern['category'] == 'cloud'
225
- }
226
- print(f"Cloud patterns: {len(cloud_patterns)}") # 40 patterns
227
-
228
- ai_patterns = {
229
- pid: pattern for pid, pattern in GSF_PATTERNS.items()
230
- if pattern['category'] == 'ai'
231
- }
232
- print(f"AI/ML patterns: {len(ai_patterns)}") # 19 patterns
233
-
234
- # Get pattern details
235
- cache_pattern = GSF_PATTERNS['gsf_001']
236
- print(f"Pattern: {cache_pattern['name']}")
237
- print(f"Category: {cache_pattern['category']}")
238
- print(f"Keywords: {cache_pattern['keywords']}")
239
- print(f"Impact: {cache_pattern['sci_impact']}")
240
-
241
- # List all available categories
242
- categories = set(p['category'] for p in GSF_PATTERNS.values())
243
- print(f"Available categories: {sorted(categories)}")
244
- # Output: ['ai', 'async', 'caching', 'cloud', 'code', 'data',
245
- # 'database', 'general', 'infrastructure', 'microservices',
246
- # 'monitoring', 'network', 'networking', 'resource', 'web']
247
- ```
248
-
249
- #### Advanced Analysis: Temporal Trends
250
-
251
- ```python
252
- from greenmining.services.data_aggregator import DataAggregator
253
- from greenmining.analyzers.temporal_analyzer import TemporalAnalyzer
254
- from greenmining.analyzers.qualitative_analyzer import QualitativeAnalyzer
255
-
256
- # Initialize aggregator with all advanced features
257
- aggregator = DataAggregator(
258
- config=None, # Config object (optional)
259
- enable_stats=True, # Enable statistical analysis (correlations, trends)
260
- enable_temporal=True, # Enable temporal trend analysis
261
- temporal_granularity="quarter" # Time granularity: day/week/month/quarter/year
262
- )
263
-
264
- # Optional: Configure temporal analyzer separately
265
- temporal_analyzer = TemporalAnalyzer(
266
- granularity="quarter" # Time period granularity for grouping commits
267
- )
268
-
269
- # Optional: Configure qualitative analyzer for validation sampling
270
- qualitative_analyzer = QualitativeAnalyzer(
271
- sample_size=30, # Number of samples for manual validation
272
- stratify_by="pattern" # Stratification method: pattern/repository/time/random
273
- )
274
-
275
- # Aggregate results with temporal insights
276
- aggregated = aggregator.aggregate(
277
- analysis_results=analysis_results, # List of analysis result dictionaries
278
- repositories=repositories # List of PyGithub repository objects
279
- )
280
-
281
- **DataAggregator Parameters:**
282
- - `config` (Config, optional): Configuration object
283
- - `enable_stats` (bool, default=False): Enable pattern correlations and effect size analysis
284
- - `enable_temporal` (bool, default=False): Enable temporal trend analysis over time
285
- - `temporal_granularity` (str, default="quarter"): Time granularity (day/week/month/quarter/year)
286
-
287
- **TemporalAnalyzer Parameters:**
288
- - `granularity` (str, default="quarter"): Time period for grouping (day/week/month/quarter/year)
289
-
290
- **QualitativeAnalyzer Parameters:**
291
- - `sample_size` (int, default=30): Number of commits to sample for validation
292
- - `stratify_by` (str, default="pattern"): Stratification method (pattern/repository/time/random)
293
-
294
- # Access temporal analysis results
295
- temporal = aggregated['temporal_analysis']
296
- print(f"Time periods analyzed: {len(temporal['periods'])}")
297
-
298
- # View pattern adoption trends over time
299
- for period_data in temporal['periods']:
300
- print(f"{period_data['period']}: {period_data['commit_count']} commits, "
301
- f"{period_data['green_awareness_rate']:.1%} green awareness")
302
-
303
- # Access pattern evolution insights
304
- evolution = temporal.get('pattern_evolution', {})
305
- print(f"Emerging patterns: {evolution.get('emerging', [])}")
306
- print(f"Stable patterns: {evolution.get('stable', [])}")
307
- ```
308
-
309
- #### Generate Custom Reports
310
-
311
- ```python
312
- from greenmining.services.data_aggregator import DataAggregator
313
- from greenmining.config import Config
314
-
315
- config = Config()
316
- aggregator = DataAggregator(config)
317
-
318
- # Load analysis results
319
- results = aggregator.load_analysis_results()
320
-
321
- # Generate statistics
322
- stats = aggregator.calculate_statistics(results)
323
- print(f"Total commits analyzed: {stats['total_commits']}")
324
- print(f"Green-aware commits: {stats['green_aware_count']}")
325
- print(f"Top patterns: {stats['top_patterns'][:5]}")
326
-
327
- # Export to CSV
328
- aggregator.export_to_csv(results, "output.csv")
329
- ```
330
-
331
- #### URL-Based Repository Analysis
332
-
333
- ```python
334
- from greenmining.services.local_repo_analyzer import LocalRepoAnalyzer
335
-
336
- analyzer = LocalRepoAnalyzer(
337
- max_commits=200,
338
- cleanup_after=True,
339
- )
340
-
341
- result = analyzer.analyze_repository("https://github.com/pallets/flask")
342
-
343
- print(f"Repository: {result.name}")
344
- print(f"Commits analyzed: {result.total_commits}")
345
- print(f"Green-aware: {result.green_commits} ({result.green_commit_rate:.1%})")
346
-
347
- for commit in result.commits[:5]:
348
- if commit.green_aware:
349
- print(f" {commit.message[:60]}...")
350
- ```
351
-
352
- #### Batch Analysis with Parallelism
353
-
354
- ```python
355
- from greenmining import analyze_repositories
356
-
357
- results = analyze_repositories(
358
- urls=[
359
- "https://github.com/kubernetes/kubernetes",
360
- "https://github.com/istio/istio",
361
- "https://github.com/envoyproxy/envoy",
362
- ],
363
- max_commits=100,
364
- parallel_workers=3,
365
- energy_tracking=True,
366
- energy_backend="auto",
367
- )
368
-
369
- for result in results:
370
- print(f"{result.name}: {result.green_commit_rate:.1%} green")
371
- ```
372
-
373
- #### Private Repository Analysis
374
-
375
- ```python
376
- from greenmining.services.local_repo_analyzer import LocalRepoAnalyzer
377
-
378
- # HTTPS with token
379
- analyzer = LocalRepoAnalyzer(github_token="ghp_xxxx")
380
- result = analyzer.analyze_repository("https://github.com/company/private-repo")
381
-
382
- # SSH with key
383
- analyzer = LocalRepoAnalyzer(ssh_key_path="~/.ssh/id_rsa")
384
- result = analyzer.analyze_repository("git@github.com:company/private-repo.git")
385
- ```
386
-
387
- #### Power Regression Detection
388
-
389
- ```python
390
- from greenmining.analyzers import PowerRegressionDetector
391
-
392
- detector = PowerRegressionDetector(
393
- test_command="pytest tests/ -x",
394
- energy_backend="rapl",
395
- threshold_percent=5.0,
396
- iterations=5,
397
- )
398
-
399
- regressions = detector.detect(
400
- repo_path="/path/to/repo",
401
- baseline_commit="v1.0.0",
402
- target_commit="HEAD",
403
- )
404
-
405
- for regression in regressions:
406
- print(f"Commit {regression.sha[:8]}: +{regression.power_increase:.1f}%")
407
- ```
408
-
409
- #### Version Power Comparison
410
-
411
- ```python
412
- from greenmining.analyzers import VersionPowerAnalyzer
413
-
414
- analyzer = VersionPowerAnalyzer(
415
- test_command="pytest tests/",
416
- energy_backend="rapl",
417
- iterations=10,
418
- warmup_iterations=2,
419
- )
420
-
421
- report = analyzer.analyze_versions(
422
- repo_path="/path/to/repo",
423
- versions=["v1.0", "v1.1", "v1.2", "v2.0"],
424
- )
425
-
426
- print(report.summary())
427
- print(f"Trend: {report.trend}")
428
- print(f"Most efficient: {report.most_efficient}")
429
- ```
430
-
431
- #### Metrics-to-Power Correlation
432
-
433
- ```python
434
- from greenmining.analyzers import MetricsPowerCorrelator
435
-
436
- correlator = MetricsPowerCorrelator()
437
- correlator.fit(
438
- metrics=["complexity", "nloc", "code_churn"],
439
- metrics_values={
440
- "complexity": [10, 20, 30, 40],
441
- "nloc": [100, 200, 300, 400],
442
- "code_churn": [50, 100, 150, 200],
443
- },
444
- power_measurements=[5.0, 8.0, 12.0, 15.0],
445
- )
446
-
447
- print(f"Pearson: {correlator.pearson}")
448
- print(f"Spearman: {correlator.spearman}")
449
- print(f"Feature importance: {correlator.feature_importance}")
450
- ```
451
-
452
- #### Pipeline Batch Analysis
453
-
454
- ```python
455
- from greenmining.controllers.repository_controller import RepositoryController
456
- from greenmining.config import Config
457
-
458
- config = Config()
459
- controller = RepositoryController(config)
460
-
461
- # Run full pipeline programmatically
462
- controller.fetch_repositories(max_repos=50)
463
- controller.extract_commits(max_commits=100)
464
- controller.analyze_commits()
465
- controller.aggregate_results()
466
- controller.generate_report()
467
-
468
- print("Analysis complete! Check data/ directory for results.")
469
- ```
470
-
471
- #### Complete Working Example: Full Pipeline
472
-
473
- This is a complete, production-ready example that demonstrates the entire analysis pipeline. This example successfully analyzed 100 repositories with 30,543 commits in our testing.
474
-
475
- ```python
476
- import os
477
- from pathlib import Path
478
- from dotenv import load_dotenv
479
-
480
- # Load environment variables
481
- load_dotenv()
482
-
483
- # Import from greenmining package
484
- from greenmining import fetch_repositories
485
- from greenmining.services.commit_extractor import CommitExtractor
486
- from greenmining.services.data_analyzer import DataAnalyzer
487
- from greenmining.services.data_aggregator import DataAggregator
488
-
489
- # Configuration
490
- token = os.getenv("GITHUB_TOKEN")
491
- output_dir = Path("results")
492
- output_dir.mkdir(exist_ok=True)
493
-
494
- # STAGE 1: Fetch Repositories
495
- print("Fetching repositories...")
496
- repositories = fetch_repositories(
497
- github_token=token,
498
- max_repos=100,
499
- min_stars=10,
500
- keywords="software engineering",
501
- )
502
- print(f"Fetched {len(repositories)} repositories")
503
-
504
- # STAGE 2: Extract Commits
505
- print("\nExtracting commits...")
506
- extractor = CommitExtractor(
507
- github_token=token,
508
- max_commits=1000,
509
- skip_merges=True,
510
- days_back=730,
511
- timeout=120,
512
- )
513
- all_commits = extractor.extract_from_repositories(repositories)
514
- print(f"Extracted {len(all_commits)} commits")
515
-
516
- # Save commits
517
- extractor.save_results(
518
- all_commits,
519
- output_dir / "commits.json",
520
- len(repositories)
521
- )
522
-
523
- # STAGE 3: Analyze Commits
524
- print("\nAnalyzing commits...")
525
- analyzer = DataAnalyzer(
526
- enable_diff_analysis=False, # Set to True for detailed code analysis (slower)
527
- )
528
- analyzed_commits = analyzer.analyze_commits(all_commits)
529
-
530
- # Count green-aware commits
531
- green_count = sum(1 for c in analyzed_commits if c.get("green_aware", False))
532
- green_percentage = (green_count / len(analyzed_commits) * 100) if analyzed_commits else 0
533
- print(f"Analyzed {len(analyzed_commits)} commits")
534
- print(f"Green-aware: {green_count} ({green_percentage:.1f}%)")
535
-
536
- # Save analysis
537
- analyzer.save_results(analyzed_commits, output_dir / "analyzed.json")
538
-
539
- # STAGE 4: Aggregate Results
540
- print("\nAggregating results...")
541
- aggregator = DataAggregator(
542
- enable_stats=True,
543
- enable_temporal=True,
544
- temporal_granularity="quarter",
545
- )
546
- results = aggregator.aggregate(analyzed_commits, repositories)
547
-
548
- # STAGE 5: Save Results
549
- print("\nSaving results...")
550
- aggregator.save_results(
551
- results,
552
- output_dir / "aggregated.json",
553
- output_dir / "aggregated.csv",
554
- analyzed_commits
555
- )
556
-
557
- # Print summary
558
- print("\n" + "="*80)
559
- print("ANALYSIS COMPLETE")
560
- print("="*80)
561
- aggregator.print_summary(results)
562
- print(f"\nResults saved in: {output_dir.absolute()}")
563
- ```
564
-
565
- **What this example does:**
566
-
567
- 1. **Fetches repositories** from GitHub based on keywords and filters
568
- 2. **Extracts commits** from each repository (up to 1000 per repo)
569
- 3. **Analyzes commits** for green software patterns
570
- 4. **Aggregates results** with temporal analysis and statistics
571
- 5. **Saves results** to JSON and CSV files for further analysis
572
-
573
- **Expected output files:**
574
- - `commits.json` - All extracted commits with metadata
575
- - `analyzed.json` - Commits analyzed for green patterns
576
- - `aggregated.json` - Summary statistics and pattern distributions
577
- - `aggregated.csv` - Tabular format for spreadsheet analysis
578
- - `metadata.json` - Experiment configuration and timing
579
-
580
- **Performance:** This pipeline successfully processed 100 repositories (30,543 commits) in approximately 6.4 hours, identifying 7,600 green-aware commits (24.9%).
581
-
582
- ### Docker Usage
583
-
584
- ```bash
585
- # Interactive shell with Python
586
- docker run -it -v $(pwd)/data:/app/data \
587
- adambouafia/greenmining:latest python
588
-
589
- # Run Python script
590
- docker run -v $(pwd)/data:/app/data \
591
- adambouafia/greenmining:latest python your_script.py
592
- ```
593
-
594
- ## Configuration
595
-
596
- ### Environment Variables
597
-
598
- Create a `.env` file or set environment variables:
599
-
600
- ```bash
601
- # Required
602
- GITHUB_TOKEN=your_github_personal_access_token
603
-
604
- # Optional - Repository Fetching
605
- MAX_REPOS=100
606
- MIN_STARS=100
607
- SUPPORTED_LANGUAGES=Python,Java,Go,JavaScript,TypeScript
608
- SEARCH_KEYWORDS=microservices
609
-
610
- # Optional - Commit Extraction
611
- COMMITS_PER_REPO=50
612
- EXCLUDE_MERGE_COMMITS=true
613
- EXCLUDE_BOT_COMMITS=true
614
-
615
- # Optional - Analysis Features
616
- ENABLE_DIFF_ANALYSIS=false
617
- BATCH_SIZE=10
618
-
619
- # Optional - Temporal Analysis
620
- ENABLE_TEMPORAL=true
621
- TEMPORAL_GRANULARITY=quarter
622
- ENABLE_STATS=true
623
-
624
- # Optional - Output
625
- OUTPUT_DIR=./data
626
- REPORT_FORMAT=markdown
627
- ```
628
-
629
- ### Config Object Parameters
630
-
631
- ```python
632
- from greenmining.config import Config
633
-
634
- config = Config(
635
- # GitHub API
636
- github_token="your_token", # GitHub personal access token (required)
637
-
638
- # Repository Fetching
639
- max_repos=100, # Maximum repositories to fetch
640
- min_stars=100, # Minimum star threshold
641
- supported_languages=["Python", "Go"], # Language filters
642
- search_keywords="microservices", # Default search keywords
643
-
644
- # Commit Extraction
645
- max_commits=50, # Commits per repository
646
- exclude_merge_commits=True, # Skip merge commits
647
- exclude_bot_commits=True, # Skip bot commits
648
- min_message_length=10, # Minimum commit message length
649
-
650
- # Analysis Options
651
- enable_diff_analysis=False, # Enable code diff analysis
652
- batch_size=10, # Batch processing size
653
-
654
- # Temporal Analysis
655
- enable_temporal=True, # Enable temporal trend analysis
656
- temporal_granularity="quarter", # day/week/month/quarter/year
657
- enable_stats=True, # Enable statistical analysis
658
-
659
- # Output Configuration
660
- output_dir="./data", # Output directory path
661
- repos_file="repositories.json", # Repositories filename
662
- commits_file="commits.json", # Commits filename
663
- analysis_file="analysis_results.json", # Analysis results filename
664
- stats_file="aggregated_statistics.json", # Statistics filename
665
- report_file="green_analysis.md" # Report filename
666
- )
667
- ```
668
-
669
- ## Features
670
-
671
- ### Core Capabilities
672
-
673
- - **Pattern Detection**: 124 sustainability patterns across 15 categories from the GSF catalog
674
- - **Keyword Analysis**: 332 green software detection keywords
675
- - **Repository Fetching**: GraphQL API with date, star, and language filters
676
- - **URL-Based Analysis**: Direct Git-based analysis from GitHub URLs (HTTPS and SSH)
677
- - **Batch Processing**: Parallel analysis of multiple repositories with configurable workers
678
- - **Private Repository Support**: Authentication via SSH keys or GitHub tokens
679
- - **Energy Measurement**: RAPL, CodeCarbon, and CPU Energy Meter backends
680
- - **Carbon Footprint Reporting**: CO2 emissions with 20+ country profiles and cloud region support (AWS, GCP, Azure)
681
- - **Power Regression Detection**: Identify commits that increased energy consumption
682
- - **Metrics-to-Power Correlation**: Pearson and Spearman analysis between code metrics and power
683
- - **Version Power Comparison**: Compare power consumption across software versions with trend detection
684
- - **Method-Level Analysis**: Per-method complexity metrics via Lizard integration
685
- - **Source Code Access**: Before/after source code for refactoring detection
686
- - **Full Process Metrics**: All 8 process metrics (ChangeSet, CodeChurn, CommitsCount, ContributorsCount, ContributorsExperience, HistoryComplexity, HunksCount, LinesCount)
687
- - **Statistical Analysis**: Correlations, effect sizes, and temporal trends
688
- - **Multi-format Output**: Markdown reports, CSV exports, JSON data
689
- - **Docker Support**: Pre-built images for containerized analysis
690
-
691
- ### Energy Measurement
692
-
693
- greenmining includes built-in energy measurement capabilities for tracking the carbon footprint of your analysis:
694
-
695
- #### Backend Options
696
-
697
- | Backend | Platform | Metrics | Requirements |
698
- |---------|----------|---------|--------------|
699
- | **RAPL** | Linux (Intel/AMD) | CPU/RAM energy (Joules) | `/sys/class/powercap/` access |
700
- | **CodeCarbon** | Cross-platform | Energy + Carbon emissions (gCO2) | `pip install codecarbon` |
701
- | **CPU Meter** | All platforms | Estimated CPU energy (Joules) | Optional: `pip install psutil` |
702
- | **Auto** | All platforms | Best available backend | Automatic detection |
703
-
704
- #### Python API
705
-
706
- ```python
707
- from greenmining.energy import RAPLEnergyMeter, CPUEnergyMeter, get_energy_meter
708
-
709
- # Auto-detect best backend
710
- meter = get_energy_meter("auto")
711
- meter.start()
712
- # ... run analysis ...
713
- result = meter.stop()
714
- print(f"Energy: {result.joules:.2f} J")
715
- print(f"Power: {result.watts_avg:.2f} W")
716
-
717
- # Integrated energy tracking during analysis
718
- from greenmining.services.local_repo_analyzer import LocalRepoAnalyzer
719
-
720
- analyzer = LocalRepoAnalyzer(energy_tracking=True, energy_backend="auto")
721
- result = analyzer.analyze_repository("https://github.com/pallets/flask")
722
- print(f"Analysis energy: {result.energy_metrics['joules']:.2f} J")
723
- ```
724
-
725
- #### Carbon Footprint Reporting
726
-
727
- ```python
728
- from greenmining.energy import CarbonReporter
729
-
730
- reporter = CarbonReporter(
731
- country_iso="USA",
732
- cloud_provider="aws",
733
- region="us-east-1",
734
- )
735
- report = reporter.generate_report(total_joules=3600.0)
736
- print(f"CO2: {report.total_emissions_kg * 1000:.4f} grams")
737
- print(f"Equivalent: {report.tree_months:.2f} tree-months to offset")
738
- ```
739
-
740
- ### Pattern Database
741
-
742
- **124 green software patterns based on:**
743
- - Green Software Foundation (GSF) Patterns Catalog
744
- - VU Amsterdam 2024 research on ML system sustainability
745
- - ICSE 2024 conference papers on sustainable software
746
-
747
- ### Detection Performance
748
-
749
- - **Coverage**: 67% of patterns actively detect in real-world commits
750
- - **Accuracy**: 100% true positive rate for green-aware commits
751
- - **Categories**: 15 distinct sustainability domains covered
752
- - **Keywords**: 332 detection terms across all patterns
753
-
754
- ## GSF Pattern Categories
755
-
756
- **124 patterns across 15 categories:**
757
-
758
- ### 1. Cloud (40 patterns)
759
- Auto-scaling, serverless computing, right-sizing instances, region selection for renewable energy, spot instances, idle resource detection, cloud-native architectures
760
-
761
- ### 2. Web (17 patterns)
762
- CDN usage, caching strategies, lazy loading, asset compression, image optimization, minification, code splitting, tree shaking, prefetching
763
-
764
- ### 3. AI/ML (19 patterns)
765
- Model optimization, pruning, quantization, edge inference, batch optimization, efficient training, model compression, hardware acceleration, green ML pipelines
766
-
767
- ### 4. Database (5 patterns)
768
- Indexing strategies, query optimization, connection pooling, prepared statements, database views, denormalization for efficiency
769
-
770
- ### 5. Networking (8 patterns)
771
- Protocol optimization, connection reuse, HTTP/2, gRPC, efficient serialization, compression, persistent connections
772
-
773
- ### 6. Network (6 patterns)
774
- Request batching, GraphQL optimization, API gateway patterns, circuit breakers, rate limiting, request deduplication
775
-
776
- ### 7. Caching (2 patterns)
777
- Multi-level caching, cache invalidation strategies, data deduplication, distributed caching
778
-
779
- ### 8. Resource (2 patterns)
780
- Resource limits, dynamic allocation, memory management, CPU throttling
781
-
782
- ### 9. Data (3 patterns)
783
- Efficient serialization formats, pagination, streaming, data compression
784
-
785
- ### 10. Async (3 patterns)
786
- Event-driven architecture, reactive streams, polling elimination, non-blocking I/O
787
-
788
- ### 11. Code (4 patterns)
789
- Algorithm optimization, code efficiency, garbage collection tuning, memory profiling
790
-
791
- ### 12. Monitoring (3 patterns)
792
- Energy monitoring, performance profiling, APM tools, observability patterns
793
-
794
- ### 13. Microservices (4 patterns)
795
- Service decomposition, colocation strategies, graceful shutdown, service mesh optimization
796
-
797
- ### 14. Infrastructure (4 patterns)
798
- Alpine containers, Infrastructure as Code, renewable energy regions, container optimization
799
-
800
- ### 15. General (8 patterns)
801
- Feature flags, incremental processing, precomputation, background jobs, workflow optimization
802
-
803
- ## Output Files
804
-
805
- All outputs are saved to the `data/` directory:
806
-
807
- - `repositories.json` - Repository metadata
808
- - `commits.json` - Extracted commit data
809
- - `analysis_results.json` - Pattern analysis results
810
- - `aggregated_statistics.json` - Summary statistics
811
- - `green_analysis_results.csv` - CSV export for spreadsheets
812
- - `green_microservices_analysis.md` - Final report
813
-
814
- ## Development
815
-
816
- ```bash
817
- # Clone repository
818
- git clone https://github.com/adam-bouafia/greenmining.git
819
- cd greenmining
820
-
821
- # Install development dependencies
822
- pip install -e ".[dev]"
823
-
824
- # Run tests
825
- pytest tests/
826
-
827
- # Run with coverage
828
- pytest --cov=greenmining tests/
829
-
830
- # Format code
831
- black greenmining/ tests/
832
- ruff check greenmining/ tests/
833
- ```
834
-
835
- ## Requirements
836
-
837
- - Python 3.9+
838
- - PyGithub >= 2.1.1
839
- - gitpython >= 3.1.0
840
- - lizard >= 1.17.0
841
- - pandas >= 2.2.0
842
-
843
- **Optional dependencies:**
844
-
845
- ```bash
846
- pip install greenmining[energy] # psutil, codecarbon (energy measurement)
847
- pip install greenmining[dev] # pytest, black, ruff, mypy (development)
848
- ```
849
-
850
- ## License
851
-
852
- MIT License - See [LICENSE](LICENSE) for details.
853
-
854
- ## Contributing
855
-
856
- Contributions are welcome! Please open an issue or submit a pull request.
857
-
858
- ## Links
859
-
860
- - **GitHub**: https://github.com/adam-bouafia/greenmining
861
- - **PyPI**: https://pypi.org/project/greenmining/
862
- - **Docker Hub**: https://hub.docker.com/r/adambouafia/greenmining
863
- - **Documentation**: https://github.com/adam-bouafia/greenmining#readme
864
-
865
-