PyPI - hdsp-jupyter-extension - Versions diffs - 2.0.11__py3-none-any.whl → 2.0.13__py3-none-any.whl - Mend

hdsp-jupyter-extension 2.0.11py3-none-any.whl → 2.0.13py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (81) hide show

agent_server/langchain/skills/data_analysis.md ADDED Viewed

@@ -0,0 +1,236 @@
+---
+name: data-analysis
+description: DataFrame 연산 최적화. groupby, merge, pivot 등 메모리 집약적 연산 시 사용. 벡터화 연산, query 최적화, 메모리 효율적 패턴 제공.
+---
+# Data Analysis Optimization Guide
+pandas DataFrame의 분석 연산을 메모리 효율적으로 수행하는 방법을 안내합니다.
+## Resource Tiers
+### TIER_SMALL: DataFrame < 1GB, RAM 여유 충분
+일반 pandas 연산 OK.
+### TIER_MEDIUM: DataFrame 1~5GB
+벡터화 연산 + query 최적화 필수
+### TIER_LARGE: DataFrame > 5GB 또는 메모리 부족
+Dask/Polars 사용 또는 청크 처리
+---
+## 1. 벡터화 연산 (Vectorization)
+### 나쁜 예: 반복문 사용
+```python
+# 느림 - 절대 사용 금지
+for i in range(len(df)):
+    df.loc[i, "new_col"] = df.loc[i, "col1"] * 2
+```
+### 좋은 예: 벡터화
+```python
+# 빠름 - 항상 이 방법 사용
+df["new_col"] = df["col1"] * 2
+```
+### 조건부 연산
+```python
+# 나쁜 예: apply 사용
+df["category"] = df["value"].apply(lambda x: "high" if x > 100 else "low")
+# 좋은 예: np.where 사용 (10x+ 빠름)
+import numpy as np
+df["category"] = np.where(df["value"] > 100, "high", "low")
+# 여러 조건: np.select 사용
+conditions = [
+    df["value"] > 100,
+    df["value"] > 50,
+]
+choices = ["high", "medium"]
+df["category"] = np.select(conditions, choices, default="low")
+```
+---
+## 2. GroupBy 최적화
+### 기본 최적화
+```python
+# sort=False로 정렬 비용 제거
+result = df.groupby("category", sort=False)["value"].sum()
+# 여러 집계 한 번에
+result = df.groupby("category", sort=False).agg({
+    "value": ["sum", "mean", "count"],
+    "amount": "sum"
+})
+```
+### 대용량 GroupBy (메모리 부족 시)
+```python
+# Option A: numba 가속 (설치 필요)
+@numba.jit
+def custom_agg(values):
+    return values.sum()
+result = df.groupby("category")["value"].agg(custom_agg)
+# Option B: Dask 사용
+import dask.dataframe as dd
+ddf = dd.from_pandas(df, npartitions=4)
+result = ddf.groupby("category")["value"].sum().compute()
+```
+---
+## 3. Merge/Join 최적화
+### 메모리 효율적 Merge
+```python
+# 1. 필요한 컬럼만 선택 후 merge
+df1_subset = df1[["key", "needed_col1", "needed_col2"]]
+df2_subset = df2[["key", "needed_col3"]]
+result = pd.merge(df1_subset, df2_subset, on="key")
+# 2. 작은 테이블을 왼쪽에 배치 (메모리 효율)
+result = pd.merge(small_df, large_df, on="key", how="left")
+```
+### 대용량 Merge (메모리 부족 시)
+```python
+# Chunked merge
+def chunked_merge(large_df, small_df, on, chunksize=100_000):
+    chunks = []
+    for start in range(0, len(large_df), chunksize):
+        chunk = large_df.iloc[start:start + chunksize]
+        merged = pd.merge(chunk, small_df, on=on, how="left")
+        chunks.append(merged)
+    return pd.concat(chunks, ignore_index=True)
+```
+---
+## 4. Query 최적화
+### eval() 사용 (대용량 DataFrame에서 빠름)
+```python
+# 일반 방식
+df["c"] = df["a"] + df["b"]
+df["d"] = df["c"] * 2
+# eval() 사용 (중간 결과 메모리 절약)
+df = df.eval("""
+    c = a + b
+    d = c * 2
+""")
+```
+### query() 사용 (필터링)
+```python
+# 일반 방식
+result = df[(df["col1"] > 10) & (df["col2"] == "active")]
+# query() 사용 (더 빠르고 가독성 좋음)
+result = df.query("col1 > 10 and col2 == 'active'")
+# 변수 사용
+threshold = 10
+status = "active"
+result = df.query("col1 > @threshold and col2 == @status")
+```
+---
+## 5. 피벗/언피벗 최적화
+### Pivot Table
+```python
+# 기본 사용
+pivot = df.pivot_table(
+    values="amount",
+    index="date",
+    columns="category",
+    aggfunc="sum",
+    fill_value=0
+)
+# 메모리 부족 시: 청크로 처리
+def chunked_pivot(df, chunksize=100_000):
+    results = []
+    for start in range(0, len(df), chunksize):
+        chunk = df.iloc[start:start + chunksize]
+        pivot = chunk.pivot_table(...)
+        results.append(pivot)
+    return pd.concat(results).groupby(level=0).sum()
+```
+---
+## 6. 메모리 관리
+### 불필요한 객체 삭제
+```python
+import gc
+# 중간 결과 삭제
+del intermediate_df
+gc.collect()
+# 컬럼 삭제 (inplace)
+df.drop(columns=["unneeded_col"], inplace=True)
+```
+### 원본 유지하면서 메모리 절약
+```python
+# 복사 대신 view 사용 (가능할 때)
+subset = df[["col1", "col2"]]  # view (메모리 공유)
+subset = df[["col1", "col2"]].copy()  # copy (별도 메모리)
+```
+---
+## 7. 연산 속도 비교표
+| Operation | Slow Method | Fast Method | Speedup |
+|-----------|-------------|-------------|---------|
+| 조건부 할당 | apply(lambda) | np.where | 10-100x |
+| 문자열 연산 | apply(str) | .str accessor | 5-20x |
+| 반복 계산 | for loop | vectorized | 100-1000x |
+| 다중 집계 | 여러 groupby | 단일 .agg() | 2-5x |
+| 필터링 | boolean indexing | .query() | 1.5-3x |
+---
+## 8. 성능 측정
+```python
+import time
+# 실행 시간 측정
+start = time.time()
+result = df.groupby("category")["value"].sum()
+print(f"Elapsed: {time.time() - start:.2f}s")
+# 메모리 프로파일링
+import memory_profiler
+%memit df.groupby("category")["value"].sum()
+```
+---
+## Quick Reference
+```python
+# 대용량 분석 체크리스트
+# 1. dtype 최적화 (data_loading 스킬 참조)
+# 2. 필요한 컬럼만 선택
+# 3. 벡터화 연산 사용 (apply 대신 np.where/np.select)
+# 4. eval()/query() 활용
+# 5. groupby에 sort=False 추가
+# 6. 중간 결과 삭제 (del + gc.collect())
+# 7. 메모리 부족 시 Dask/Polars 전환
+```

agent_server/langchain/skills/data_loading.md ADDED Viewed

@@ -0,0 +1,158 @@
+---
+name: data-loading
+description: 대용량 파일 로드 최적화. CSV/Parquet 파일이 100MB 이상이거나 메모리 부족 시 사용. chunking, sampling, dtype 최적화, Dask/Polars 전환 가이드 제공.
+---
+# Data Loading Optimization Guide
+대용량 데이터셋 로드 시 메모리 효율적인 방법을 안내합니다.
+## Resource Tiers
+### TIER_SMALL: 파일 < 100MB, RAM 여유 충분
+직접 로드 OK. 특별한 최적화 불필요.
+```python
+import pandas as pd
+df = pd.read_csv("data.csv")
+# 또는
+df = pd.read_parquet("data.parquet")
+```
+### TIER_MEDIUM: 파일 100MB ~ 1GB
+dtype 최적화 + 필요한 컬럼만 로드
+```python
+import pandas as pd
+# 1. 필요한 컬럼만 로드 (메모리 최대 90% 절약)
+df = pd.read_csv("data.csv", usecols=["col1", "col2", "col3"])
+# 2. dtype 최적화 지정
+dtype_map = {
+    "id": "int32",           # int64 → int32 (50% 절약)
+    "category_col": "category",  # string → category (90%+ 절약)
+    "float_col": "float32",  # float64 → float32 (50% 절약)
+}
+df = pd.read_csv("data.csv", dtype=dtype_map)
+# 3. Parquet 사용 시 (자동 압축, 컬럼 선택 지원)
+df = pd.read_parquet("data.parquet", columns=["col1", "col2"])
+```
+### TIER_LARGE: 파일 > 1GB 또는 메모리 부족
+Chunking 또는 Dask/Polars 사용
+#### Option A: Chunking (단순 집계용)
+```python
+import pandas as pd
+# 청크 단위 처리 (메모리: 청크 크기만 사용)
+chunks = pd.read_csv("large_data.csv", chunksize=100_000)
+# 예: 청크별 집계 후 합산
+total_count = 0
+for chunk in chunks:
+    total_count += len(chunk[chunk["status"] == "active"])
+print(f"Active records: {total_count}")
+```
+#### Option B: Dask (복잡한 연산, groupby 등)
+```python
+import dask.dataframe as dd
+# Dask로 로드 (lazy evaluation, 메모리 효율적)
+ddf = dd.read_csv("large_data.csv")
+# pandas처럼 사용 (내부적으로 청크 처리)
+result = ddf.groupby("category")["value"].mean().compute()
+```
+#### Option C: Polars (고성능 대안)
+```python
+import polars as pl
+# Polars: Rust 기반, pandas보다 30x 빠름
+df = pl.read_csv("large_data.csv")
+# 또는 lazy mode (메모리 최적화)
+df = pl.scan_csv("large_data.csv").filter(
+    pl.col("date") > "2024-01-01"
+).collect()
+```
+## dtype 최적화 상세
+| Original Type | Optimized Type | Memory Savings | When to Use |
+|---------------|----------------|----------------|-------------|
+| int64 | int32 | 50% | 값 범위가 ±2B 이내 |
+| int64 | int16 | 75% | 값 범위가 ±32K 이내 |
+| int64 | int8 | 87.5% | 값 범위가 ±127 이내 |
+| float64 | float32 | 50% | 소수점 7자리 정밀도 OK |
+| object (string) | category | 90%+ | 고유값 < 50% |
+### 자동 dtype 최적화 함수
+```python
+def optimize_dtypes(df):
+    """DataFrame의 dtype을 자동 최적화"""
+    for col in df.columns:
+        col_type = df[col].dtype
+        if col_type == "int64":
+            if df[col].min() >= 0:
+                if df[col].max() < 255:
+                    df[col] = df[col].astype("uint8")
+                elif df[col].max() < 65535:
+                    df[col] = df[col].astype("uint16")
+                else:
+                    df[col] = df[col].astype("uint32")
+            else:
+                if df[col].min() > -128 and df[col].max() < 127:
+                    df[col] = df[col].astype("int8")
+                elif df[col].min() > -32768 and df[col].max() < 32767:
+                    df[col] = df[col].astype("int16")
+                else:
+                    df[col] = df[col].astype("int32")
+        elif col_type == "float64":
+            df[col] = df[col].astype("float32")
+        elif col_type == "object":
+            num_unique = df[col].nunique()
+            num_total = len(df[col])
+            if num_unique / num_total < 0.5:  # 50% 미만 고유값
+                df[col] = df[col].astype("category")
+    return df
+```
+## 파일 포맷별 권장사항
+| Format | Read Speed | Write Speed | Compression | Best For |
+|--------|------------|-------------|-------------|----------|
+| CSV | Slow | Slow | None | 호환성, 간단한 데이터 |
+| Parquet | Fast | Fast | Excellent | 대용량 분석, 컬럼 선택 |
+| Feather | Fastest | Fastest | Good | pandas 간 데이터 교환 |
+| HDF5 | Fast | Fast | Good | 다차원 배열 |
+### CSV → Parquet 변환 (일회성 비용으로 이후 로드 빠름)
+```python
+# 최초 1회 변환
+df = pd.read_csv("data.csv")
+df.to_parquet("data.parquet", compression="snappy")
+# 이후 빠른 로드
+df = pd.read_parquet("data.parquet")
+```
+## 메모리 확인 방법
+```python
+# DataFrame 메모리 사용량 확인
+print(df.memory_usage(deep=True).sum() / 1024**2, "MB")
+# 컬럼별 메모리 사용량
+print(df.memory_usage(deep=True) / 1024**2)
+```

hdsp-jupyter-extension 2.0.11__py3-none-any.whl → 2.0.13__py3-none-any.whl

hdsp-jupyter-extension 2.0.11py3-none-any.whl → 2.0.13py3-none-any.whl