PyPI - agent-starter-pack - Versions diffs - 0.0.1b0__py3-none-any.whl - Mend

agent-starter-pack 0.0.1b0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of agent-starter-pack might be problematic. Click here for more details.

Files changed (162) hide show

src/data_ingestion/README.md ADDED Viewed

@@ -0,0 +1,79 @@
+# Data Ingestion Pipeline
+This pipeline automates the ingestion of data into Vertex AI Search, streamlining the process of building Retrieval Augmented Generation (RAG) applications.
+It orchestrates the complete workflow: loading data, chunking it into manageable segments, generating embeddings using Vertex AI Embeddings, and importing the processed data into your Vertex AI Search datastore.
+You can trigger the pipeline for an initial data load or schedule it to run periodically, ensuring your search index remains current. Vertex AI Pipelines provides the orchestration and monitoring capabilities for this process.
+## Prerequisites
+Before running the data ingestion pipeline, ensure you have completed the following:
+1. **Set up Dev Terraform:** Follow the instructions in the parent [deployment/README.md - Dev Deployment section](../deployment/README.md#dev-deployment) to provision the necessary resources in your development environment using Terraform. This includes deploying a datastore and configuring the required permissions.
+## Running the Data Ingestion Pipeline
+After setting up the Terraform infrastructure, you can test the data ingestion pipeline.
+> **Note:** The initial pipeline execution might take longer as your project is configured for Vertex AI Pipelines.
+**Steps:**
+**a. Navigate to the `data_ingestion` directory:**
+```bash
+cd data_ingestion
+```
+**b. Install Dependencies:**
+Install the required Python dependencies using uv:
+```bash
+uv sync --frozen
+```
+**c. Execute the Pipeline:**
+Run the following command to execute the data ingestion pipeline. Replace the placeholder values with your actual project details.
+```bash
+PROJECT_ID="YOUR_PROJECT_ID"
+REGION="us-central1"
+DATA_STORE_REGION="us"
+uv run data_ingestion_pipeline/submit_pipeline.py \
+    --project-id=$PROJECT_ID \
+    --region=$REGION \
+    --data-store-region=$DATA_STORE_REGION \
+    --data-store-id="sample-datastore" \
+    --service-account="vertexai-pipelines-sa@$PROJECT_ID.iam.gserviceaccount.com" \
+    --pipeline-root="gs://$PROJECT_ID-pipeline-artifacts" \
+    --pipeline-name="data-ingestion-pipeline"
+```
+**Parameter Explanation:**
+*   `--project-id`: Your Google Cloud project ID.
+*   `--region`: The region where Vertex AI Pipelines will run (e.g., `us-central1`).
+*   `--data-store-region`: The region for Vertex AI Search operations (e.g., `us` or `eu`).
+*   `--data-store-id`: The ID of your Vertex AI Search datastore.
+*   `--service-account`: The service account email used for pipeline execution.  Ensure this service account has the necessary permissions (e.g., Vertex AI User, Storage Object Admin).
+*   `--pipeline-root`: The Google Cloud Storage (GCS) bucket for storing pipeline artifacts.
+*   `--pipeline-name`: A descriptive name for your pipeline.
+*   `--schedule-only` (Optional): If specified, the pipeline will only be scheduled and not executed immediately. Requires `--cron-schedule`.
+*   `--cron-schedule` (Optional): A cron expression defining the pipeline's schedule (e.g., `"0 9 * * 1"` for every Monday at 9:00 AM UTC).
+**d. Pipeline Scheduling and Execution:**
+The pipeline, by default, executes immediately. To schedule the pipeline for periodic execution without immediate initiation, use the `--schedule-only` flag in conjunction with `--cron-schedule`. If a schedule doesn't exist, it will be created. If a schedule already exists, its cron expression will be updated to the provided value.
+**e. Monitoring Pipeline Progress:**
+The pipeline's configuration and execution status will be printed to the console. For detailed monitoring, use the Vertex AI Pipelines dashboard in the Google Cloud Console. This dashboard provides real-time insights into the pipeline's progress, logs, and any potential issues.
+## Testing Your RAG Application
+Once the data ingestion pipeline completes successfully, you can test your RAG application with Vertex AI Search.
+> **Troubleshooting:** If you encounter the error `"google.api_core.exceptions.InvalidArgument: 400 The embedding field path: embedding not found in schema"` after the initial data ingestion, wait a few minutes and try again. This delay allows Vertex AI Search to fully index the ingested data.

src/data_ingestion/data_ingestion_pipeline/components/ingest_data.py ADDED Viewed

@@ -0,0 +1,175 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from kfp.dsl import Dataset, Input, component
+@component(
+    base_image="us-docker.pkg.dev/production-ai-template/starter-pack/data_processing:0.1"
+)
+def ingest_data(
+    project_id: str,
+    data_store_region: str,
+    input_files: Input[Dataset],
+    data_store_id: str,
+    embedding_dimension: int = 768,
+    embedding_column: str = "embedding",
+) -> None:
+    """Process and ingest documents into Vertex AI Search datastore.
+    Args:
+        project_id: Google Cloud project ID
+        data_store_region: Region for Vertex AI Search
+        input_files: Input dataset containing documents
+        data_store_id: ID of target datastore
+        embedding_column: Name of embedding column in schema
+    """
+    import json
+    import logging
+    from google.api_core.client_options import ClientOptions
+    from google.cloud import discoveryengine
+    def update_schema_as_json(
+        original_schema: str,
+        embedding_dimension: int,
+        field_name: str | None = None,
+    ) -> str:
+        """Update datastore schema JSON to include embedding field.
+        Args:
+            original_schema: Original schema JSON string
+            field_name: Name of embedding field to add
+        Returns:
+            Updated schema JSON string
+        """
+        original_schema_dict = json.loads(original_schema)
+        if original_schema_dict.get("properties") is None:
+            original_schema_dict["properties"] = {}
+        if field_name:
+            field_schema = {
+                "type": "array",
+                "keyPropertyMapping": "embedding_vector",
+                "dimension": embedding_dimension,
+                "items": {"type": "number"},
+            }
+            original_schema_dict["properties"][field_name] = field_schema
+        return json.dumps(original_schema_dict)
+    def update_data_store_schema(
+        project_id: str,
+        location: str,
+        data_store_id: str,
+        field_name: str | None = None,
+        client_options: ClientOptions | None = None,
+    ) -> None:
+        """Update datastore schema to include embedding field.
+        Args:
+            project_id: Google Cloud project ID
+            location: Google Cloud location
+            data_store_id: Target datastore ID
+            embedding_column: Name of embedding column
+            client_options: Client options for API
+        """
+        schema_client = discoveryengine.SchemaServiceClient(
+            client_options=client_options
+        )
+        collection = "default_collection"
+        name = f"projects/{project_id}/locations/{location}/collections/{collection}/dataStores/{data_store_id}/schemas/default_schema"
+        schema = schema_client.get_schema(
+            request=discoveryengine.GetSchemaRequest(name=name)
+        )
+        new_schema_json = update_schema_as_json(
+            original_schema=schema.json_schema,
+            embedding_dimension=embedding_dimension,
+            field_name=field_name,
+        )
+        new_schema = discoveryengine.Schema(json_schema=new_schema_json, name=name)
+        operation = schema_client.update_schema(
+            request=discoveryengine.UpdateSchemaRequest(
+                schema=new_schema, allow_missing=True
+            )
+        )
+        logging.info(f"Waiting for schema update operation: {operation.operation.name}")
+        operation.result()
+    def add_data_in_store(
+        project_id: str,
+        location: str,
+        data_store_id: str,
+        input_files_uri: str,
+        client_options: ClientOptions | None = None,
+    ) -> None:
+        """Import documents into datastore.
+        Args:
+            project_id: Google Cloud project ID
+            location: Google Cloud location
+            data_store_id: Target datastore ID
+            input_files_uri: URI of input files
+            client_options: Client options for API
+        """
+        client = discoveryengine.DocumentServiceClient(client_options=client_options)
+        parent = client.branch_path(
+            project=project_id,
+            location=location,
+            data_store=data_store_id,
+            branch="default_branch",
+        )
+        request = discoveryengine.ImportDocumentsRequest(
+            parent=parent,
+            gcs_source=discoveryengine.GcsSource(
+                input_uris=[input_files_uri],
+                data_schema="document",
+            ),
+            reconciliation_mode=discoveryengine.ImportDocumentsRequest.ReconciliationMode.FULL,
+        )
+        operation = client.import_documents(request=request)
+        logging.info(f"Waiting for import operation: {operation.operation.name}")
+        operation.result()
+    client_options = ClientOptions(
+        api_endpoint=f"{data_store_region}-discoveryengine.googleapis.com"
+    )
+    logging.info("Updating data store schema...")
+    update_data_store_schema(
+        project_id=project_id,
+        location=data_store_region,
+        data_store_id=data_store_id,
+        field_name=embedding_column,
+        client_options=client_options,
+    )
+    logging.info("Schema updated successfully")
+    logging.info("Importing data into store...")
+    add_data_in_store(
+        project_id=project_id,
+        location=data_store_region,
+        data_store_id=data_store_id,
+        client_options=client_options,
+        input_files_uri=input_files.uri,
+    )
+    logging.info("Data import completed")

src/data_ingestion/data_ingestion_pipeline/components/process_data.py ADDED Viewed

@@ -0,0 +1,321 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This component is derived from the notebook:
+https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/retrieval-augmented_generation/scalable_rag_with_bigframes.ipynb
+It leverages BigQuery for data processing. We also suggest looking at remote functions for enhanced scalability.
+"""
+from kfp.dsl import Dataset, Output, component
+@component(
+    base_image="us-docker.pkg.dev/production-ai-template/starter-pack/data_processing:0.1"
+)
+def process_data(
+    project_id: str,
+    schedule_time: str,
+    output_files: Output[Dataset],
+    is_incremental: bool = True,
+    look_back_days: int = 1,
+    chunk_size: int = 1500,
+    chunk_overlap: int = 20,
+    destination_dataset: str = "stackoverflow_data",
+    destination_table: str = "incremental_questions_embeddings",
+    deduped_table: str = "questions_embeddings",
+    location: str = "us-central1",
+    embedding_column: str = "embedding",
+) -> None:
+    """Process StackOverflow questions and answers by:
+    1. Fetching data from BigQuery
+    2. Converting HTML to markdown
+    3. Splitting text into chunks
+    4. Generating embeddings
+    5. Storing results in BigQuery
+    6. Exporting to JSONL
+    Args:
+        output_files: Output dataset path
+        is_incremental: Whether to process only recent data
+        look_back_days: Number of days to look back for incremental processing
+        chunk_size: Size of text chunks
+        chunk_overlap: Overlap between chunks
+        destination_dataset: BigQuery dataset for storing results
+        destination_table: Table for storing incremental results
+        deduped_table: Table for storing deduplicated results
+        location: BigQuery location
+    """
+    import logging
+    from datetime import datetime, timedelta
+    import backoff
+    import bigframes.ml.llm as llm
+    import bigframes.pandas as bpd
+    import google.api_core.exceptions
+    import swifter
+    from google.cloud import bigquery
+    from langchain.text_splitter import RecursiveCharacterTextSplitter
+    from markdownify import markdownify
+    # Initialize logging
+    logging.basicConfig(level=logging.INFO)
+    logging.info(f"Using {swifter} for apply operations.")
+    # Initialize clients
+    logging.info("Initializing clients...")
+    bq_client = bigquery.Client(project=project_id, location=location)
+    bpd.options.bigquery.project = project_id
+    bpd.options.bigquery.location = location
+    logging.info("Clients initialized.")
+    # Set date range for data fetch
+    schedule_time_dt: datetime = datetime.fromisoformat(
+        schedule_time.replace("Z", "+00:00")
+    )
+    if schedule_time_dt.year == 1970:
+        logging.warning(
+            "Pipeline schedule not set. Setting schedule_time to current date."
+        )
+        schedule_time_dt = datetime.now()
+    # Note: The following line sets the schedule time 5 years back to allow sample data to be present.
+    # For your use case, please comment out the following line to use the actual schedule time.
+    schedule_time_dt = schedule_time_dt - timedelta(days=5 * 365)
+    START_DATE: datetime = schedule_time_dt - timedelta(
+        days=look_back_days
+    )  # Start date for data processing window
+    END_DATE: datetime = schedule_time_dt  # End date for data processing window
+    logging.info(f"Date range set: START_DATE={START_DATE}, END_DATE={END_DATE}")
+    def fetch_stackoverflow_data(
+        dataset_suffix: str, start_date: str, end_date: str
+    ) -> bpd.DataFrame:
+        """Fetch StackOverflow data from BigQuery."""
+        query = f"""
+            SELECT
+                creation_date,
+                last_edit_date,
+                question_id,
+                question_title,
+                question_body AS question_text,
+                answers
+            FROM `production-ai-template.stackoverflow_qa_{dataset_suffix}.stackoverflow_python_questions_and_answers`
+            WHERE TRUE
+                {f'AND TIMESTAMP_TRUNC(creation_date, DAY) BETWEEN TIMESTAMP("{start_date}") AND TIMESTAMP("{end_date}")' if is_incremental else ""}
+        """
+        logging.info("Fetching StackOverflow data from BigQuery...")
+        return bpd.read_gbq(query)
+    def convert_html_to_markdown(html: str) -> str:
+        """Convert HTML into Markdown for easier parsing and rendering after LLM response."""
+        return markdownify(html).strip()
+    def create_answers_markdown(answers: list) -> str:
+        """Convert each answer's HTML to markdown and concatenate into a single markdown text."""
+        answers_md = ""
+        for index, answer_record in enumerate(answers):
+            answers_md += (
+                f"\n\n## Answer {index + 1}:\n"  # Answer number is H2 heading size
+            )
+            answers_md += convert_html_to_markdown(answer_record["body"])
+        return answers_md
+    def create_table_if_not_exist(
+        df: bpd.DataFrame,
+        project_id: str,
+        dataset_id: str,
+        table_id: str,
+        partition_column: str,
+        location: str = location,
+    ) -> None:
+        """Create BigQuery table with time partitioning if it doesn't exist."""
+        table_schema = bq_client.get_table(df.head(0).to_gbq()).schema
+        table = bigquery.Table(
+            f"{project_id}.{dataset_id}.{table_id}", schema=table_schema
+        )
+        table.time_partitioning = bigquery.TimePartitioning(
+            type_=bigquery.TimePartitioningType.DAY, field=partition_column
+        )
+        dataset = bigquery.Dataset(f"{project_id}.{dataset_id}")
+        dataset.location = location
+        bq_client.create_dataset(dataset, exists_ok=True)
+        bq_client.create_table(table=table, exists_ok=True)
+    # Fetch and preprocess data
+    logging.info("Fetching and preprocessing data...")
+    df = fetch_stackoverflow_data(
+        start_date=START_DATE.strftime("%Y-%m-%d"),
+        end_date=END_DATE.strftime("%Y-%m-%d"),
+        dataset_suffix=location.lower().replace("-", "_"),
+    )
+    df = (
+        df.sort_values("last_edit_date", ascending=False)
+        .drop_duplicates("question_id")
+        .reset_index(drop=True)
+    )
+    logging.info("Data fetched and preprocessed.")
+    # Convert content to markdown
+    logging.info("Converting content to markdown...")
+    # Create markdown fields efficiently
+    df["question_title_md"] = (
+        "# " + df["question_title"] + "\n"
+    )  # Title is H1 heading size
+    df["question_text_md"] = (
+        df["question_text"].to_pandas().swifter.apply(convert_html_to_markdown) + "\n"
+    )
+    df["answers_md"] = df["answers"].to_pandas().swifter.apply(create_answers_markdown)
+    # Create a column containing the whole markdown text
+    df["full_text_md"] = (
+        df["question_title_md"] + df["question_text_md"] + df["answers_md"]
+    )
+    logging.info("Content converted to markdown.")
+    # Keep only necessary columns
+    df = df[["last_edit_date", "question_id", "question_text", "full_text_md"]]
+    # Split text into chunks
+    logging.info("Splitting text into chunks...")
+    text_splitter = RecursiveCharacterTextSplitter(
+        chunk_size=chunk_size,
+        chunk_overlap=chunk_overlap,
+        length_function=len,
+    )
+    df["text_chunk"] = (
+        df["full_text_md"]
+        .to_pandas()
+        .astype(object)
+        .swifter.apply(text_splitter.split_text)
+    )
+    logging.info("Text split into chunks.")
+    # Create chunk IDs and explode chunks into rows
+    logging.info("Creating chunk IDs and exploding chunks into rows...")
+    chunk_ids = [
+        str(idx) for text_chunk in df["text_chunk"] for idx in range(len(text_chunk))
+    ]
+    df = df.explode("text_chunk").reset_index(drop=True)
+    df["chunk_id"] = df["question_id"].astype("string") + "__" + chunk_ids
+    logging.info("Chunk IDs created and chunks exploded.")
+    # Generate embeddings
+    logging.info("Generating embeddings...")
+    # The first invocation in a new project might fail due to permission propagation.
+    @backoff.on_exception(
+        backoff.expo, google.api_core.exceptions.InvalidArgument, max_tries=10
+    )
+    def create_embedder() -> llm.TextEmbeddingGenerator:
+        return llm.TextEmbeddingGenerator(model_name="text-embedding-005")
+    embedder = create_embedder()
+    embeddings_df = embedder.predict(df["text_chunk"])
+    logging.info("Embeddings generated.")
+    df = df.assign(
+        embedding=embeddings_df["ml_generate_embedding_result"],
+        embedding_statistics=embeddings_df["ml_generate_embedding_statistics"],
+        embedding_status=embeddings_df["ml_generate_embedding_status"],
+        creation_timestamp=datetime.now(),
+    )
+    # Store results in BigQuery
+    PARTITION_DATE_COLUMN = "creation_timestamp"
+    # Create and populate incremental table
+    logging.info("Creating and populating incremental table...")
+    create_table_if_not_exist(
+        df=df,
+        project_id=project_id,
+        dataset_id=destination_dataset,
+        table_id=destination_table,
+        partition_column=PARTITION_DATE_COLUMN,
+    )
+    if_exists_mode = "append" if is_incremental else "replace"
+    df.to_gbq(
+        destination_table=f"{destination_dataset}.{destination_table}",
+        if_exists=if_exists_mode,
+    )
+    logging.info("Incremental table created and populated.")
+    # Create deduplicated table
+    logging.info("Creating deduplicated table...")
+    df_questions = bpd.read_gbq(
+        f"{destination_dataset}.{destination_table}", use_cache=False
+    )
+    max_date_df = (
+        df_questions.groupby("question_id")["creation_timestamp"].max().reset_index()
+    )
+    df_questions_dedup = max_date_df.merge(
+        df_questions, how="inner", on=["question_id", "creation_timestamp"]
+    )
+    create_table_if_not_exist(
+        df=df_questions_dedup,
+        project_id=project_id,
+        dataset_id=destination_dataset,
+        table_id=deduped_table,
+        partition_column=PARTITION_DATE_COLUMN,
+    )
+    df_questions_dedup.to_gbq(
+        destination_table=f"{destination_dataset}.{deduped_table}",
+        if_exists="replace",
+    )
+    logging.info("Deduplicated table created and populated.")
+    # Export to JSONL
+    logging.info("Exporting to JSONL...")
+    export_query = f"""
+    SELECT
+        chunk_id as id,
+        TO_JSON_STRING(STRUCT(
+            chunk_id as id,
+            embedding as {embedding_column},
+            text_chunk as content,
+            question_id,
+            CAST(creation_timestamp AS STRING) as creation_timestamp,
+            CAST(last_edit_date AS STRING) as last_edit_date,
+            question_text,
+            full_text_md
+        )) as json_data
+    FROM
+        `{project_id}.{destination_dataset}.{deduped_table}`
+    WHERE
+        chunk_id IS NOT NULL
+        AND embedding IS NOT NULL
+    """
+    export_df_id = bpd.read_gbq(export_query).to_gbq()
+    output_files.uri = output_files.uri + "*.jsonl"
+    job_config = bigquery.ExtractJobConfig()
+    job_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON
+    extract_job = bq_client.extract_table(
+        export_df_id, output_files.uri, job_config=job_config
+    )
+    extract_job.result()
+    logging.info("Exported to JSONL.")

src/data_ingestion/data_ingestion_pipeline/pipeline.py ADDED Viewed

@@ -0,0 +1,58 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from data_ingestion_pipeline.components.ingest_data import ingest_data
+from data_ingestion_pipeline.components.process_data import process_data
+from kfp import dsl
+@dsl.pipeline(description="A pipeline to run ingestion of new data into the datastore")
+def pipeline(
+    project_id: str,
+    location: str,
+    data_store_region: str,
+    data_store_id: str,
+    is_incremental: bool = True,
+    look_back_days: int = 1,
+    chunk_size: int = 1500,
+    chunk_overlap: int = 20,
+    destination_dataset: str = "stackoverflow_data",
+    destination_table: str = "incremental_questions_embeddings",
+    deduped_table: str = "questions_embeddings",
+) -> None:
+    """Processes data and ingests it into a datastore for RAG Retrieval"""
+    # Process the data and generate embeddings
+    processed_data = process_data(
+        project_id=project_id,
+        schedule_time=dsl.PIPELINE_JOB_SCHEDULE_TIME_UTC_PLACEHOLDER,
+        is_incremental=is_incremental,
+        look_back_days=look_back_days,
+        chunk_size=chunk_size,
+        chunk_overlap=chunk_overlap,
+        destination_dataset=destination_dataset,
+        destination_table=destination_table,
+        deduped_table=deduped_table,
+        location=location,
+        embedding_column="embedding",
+    )
+    # Ingest the processed data into Vertex AI Search datastore
+    ingest_data(
+        project_id=project_id,
+        data_store_region=data_store_region,
+        input_files=processed_data.output,
+        data_store_id=data_store_id,
+        embedding_column="embedding",
+    )