RubyGems - broadlistening - Versions diffs - 0.7.0 - Mend

broadlistening 0.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

checksums.yaml +7 -0
data/.rspec +3 -0
data/.rubocop.yml +3 -0
data/CHANGELOG.md +40 -0
data/CLAUDE.md +112 -0
data/LICENSE +24 -0
data/LICENSE-AGPLv3.txt +661 -0
data/README.md +195 -0
data/Rakefile +77 -0
data/exe/broadlistening +6 -0
data/lib/broadlistening/argument.rb +136 -0
data/lib/broadlistening/cli.rb +196 -0
data/lib/broadlistening/comment.rb +128 -0
data/lib/broadlistening/compatibility.rb +375 -0
data/lib/broadlistening/config.rb +190 -0
data/lib/broadlistening/context.rb +180 -0
data/lib/broadlistening/csv_loader.rb +109 -0
data/lib/broadlistening/hierarchical_clustering.rb +142 -0
data/lib/broadlistening/kmeans.rb +185 -0
data/lib/broadlistening/llm_client.rb +84 -0
data/lib/broadlistening/pipeline.rb +129 -0
data/lib/broadlistening/planner.rb +114 -0
data/lib/broadlistening/provider.rb +97 -0
data/lib/broadlistening/spec_loader.rb +86 -0
data/lib/broadlistening/status.rb +132 -0
data/lib/broadlistening/steps/aggregation.rb +228 -0
data/lib/broadlistening/steps/base_step.rb +42 -0
data/lib/broadlistening/steps/clustering.rb +103 -0
data/lib/broadlistening/steps/embedding.rb +40 -0
data/lib/broadlistening/steps/extraction.rb +73 -0
data/lib/broadlistening/steps/initial_labelling.rb +85 -0
data/lib/broadlistening/steps/merge_labelling.rb +93 -0
data/lib/broadlistening/steps/overview.rb +36 -0
data/lib/broadlistening/version.rb +5 -0
data/lib/broadlistening.rb +44 -0
data/schema/hierarchical_result.json +152 -0
data/sig/broadlistening.rbs +4 -0
metadata +194 -0

data/README.md ADDED Viewed

@@ -0,0 +1,195 @@
+# Broadlistening
+広聴 AIのBroadlistening パイプラインの Ruby 実装です。LLM を使用して公開コメントをクラスタリング・分析します。
+## 概要
+Broadlistening は、大量のコメントや意見を AI を活用して分析するためのパイプラインです。以下のステップで処理を行います：
+1. **Extraction (意見抽出)** - コメントから主要な意見を LLM で抽出
+2. **Embedding (ベクトル化)** - 抽出した意見をベクトル化
+3. **Clustering (クラスタリング)** - UMAP + KMeans + 階層的クラスタリング
+4. **Initial Labelling (初期ラベリング)** - 各クラスタに LLM でラベル付け
+5. **Merge Labelling (ラベル統合)** - 階層的にラベルを統合
+6. **Overview (概要生成)** - 全体の概要を LLM で生成
+7. **Aggregation (JSON 組み立て)** - 結果を JSON 形式で出力
+## インストール
+### Gemfile に追加
+```ruby
+gem 'broadlistening'
+```
+または GitHub から直接インストール：
+```ruby
+gem 'broadlistening', github: 'takahashim/broadlistening-ruby'
+```
+### 依存関係のインストール
+```bash
+bundle install
+```
+## 使い方
+### 基本的な使用方法
+```ruby
+require 'broadlistening'
+# コメントデータを準備
+comments = [
+  { id: "1", body: "環境問題への対策が必要です", proposal_id: "123" },
+  { id: "2", body: "公共交通機関の充実を希望します", proposal_id: "123" },
+  # ...
+]
+# パイプラインを実行
+pipeline = Broadlistening::Pipeline.new(
+  api_key: ENV['OPENAI_API_KEY'],
+  model: "gpt-4o-mini",
+  cluster_nums: [5, 15]
+)
+result = pipeline.run(comments)
+# 結果を取得
+puts result[:overview]
+puts result[:clusters]
+```
+### Rails での使用例
+```ruby
+# app/jobs/analysis_job.rb
+class AnalysisJob < ApplicationJob
+  queue_as :analysis
+  def perform(proposal_id)
+    proposal = Proposal.find(proposal_id)
+    comments = proposal.comments.map do |c|
+      { id: c.id, body: c.body, proposal_id: c.proposal_id }
+    end
+    pipeline = Broadlistening::Pipeline.new(
+      api_key: ENV['OPENAI_API_KEY'],
+      model: "gpt-4o-mini",
+      cluster_nums: [5, 15]
+    )
+    result = pipeline.run(comments)
+    proposal.create_analysis_result!(
+      result_data: result,
+      comment_count: comments.size
+    )
+  end
+end
+```
+### 設定オプション
+```ruby
+Broadlistening::Pipeline.new(
+  api_key: "your-api-key",          # OpenAI API キー（必須）
+  model: "gpt-4o-mini",             # LLM モデル（デフォルト: gpt-4o-mini）
+  embedding_model: "text-embedding-3-small",  # 埋め込みモデル
+  cluster_nums: [5, 15],            # クラスタ階層の数（デフォルト: [5, 15]）
+  workers: 10,                      # 並列処理のワーカー数
+  prompts: {                        # カスタムプロンプト（オプション）
+    extraction: "...",
+    initial_labelling: "...",
+    merge_labelling: "...",
+    overview: "..."
+  }
+)
+```
+## 出力形式
+パイプラインの結果は以下の構造を持つ Hash です：
+```ruby
+{
+  arguments: [
+    {
+      arg_id: "A1_0",
+      argument: "環境問題への対策が必要",
+      x: 0.5,           # UMAP X座標
+      y: 0.3,           # UMAP Y座標
+      cluster_ids: ["0", "1_0", "2_3"]  # 所属クラスタID
+    },
+    # ...
+  ],
+  clusters: [
+    {
+      level: 0,
+      id: "0",
+      label: "全体",
+      description: "",
+      count: 100,
+      parent: nil
+    },
+    {
+      level: 1,
+      id: "1_0",
+      label: "環境・エネルギー",
+      description: "環境問題やエネルギー政策に関する意見",
+      count: 25,
+      parent: "0"
+    },
+    # ...
+  ],
+  relations: [
+    { arg_id: "A1_0", comment_id: "1", proposal_id: "123" },
+    # ...
+  ],
+  comment_count: 50,
+  argument_count: 100,
+  overview: "分析の概要テキスト...",
+  config: { model: "gpt-4o-mini", ... }
+}
+```
+## 依存関係
+- Ruby >= 3.1.0
+- activesupport >= 7.0
+- numo-narray ~> 0.9
+- ruby-openai ~> 7.0
+- parallel ~> 1.20
+- rice ~> 4.6.0
+- umappp ~> 0.2
+### umappp のインストール
+umappp は C++ ネイティブ拡張を含むため、インストール時に C++ コンパイラが必要です：
+```bash
+# macOS
+CXX=clang++ gem install umappp
+# Linux
+gem install umappp
+```
+**注意**: Rice 4.7.x との互換性問題があるため、Rice 4.6.x を使用してください。
+## 開発
+```bash
+# セットアップ
+bin/setup
+# テスト実行
+bundle exec rspec
+# コンソール
+bin/console
+```
+## ライセンス
+AGPL 3.0

data/Rakefile ADDED Viewed

@@ -0,0 +1,77 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+RSpec::Core::RakeTask.new("spec:compatibility") do |t|
+  t.pattern = "spec/compatibility/**/*_spec.rb"
+end
+namespace :compatibility do
+  desc "Validate Kouchou-AI Python output structure"
+  task :validate_python do
+    require "bundler/setup"
+    require "broadlistening"
+    require "json"
+    python_path = File.expand_path(
+      "../server/broadlistening/pipeline/outputs/example-hierarchical-polis/hierarchical_result.json",
+      __dir__
+    )
+    unless File.exist?(python_path)
+      puts "Error: Python output not found at: #{python_path}"
+      exit 1
+    end
+    output = JSON.parse(File.read(python_path))
+    errors = Broadlistening::Compatibility.validate_output(output)
+    if errors.empty?
+      puts "Valid: #{python_path}"
+      puts ""
+      puts "Stats:"
+      puts "  Arguments: #{output['arguments'].size}"
+      puts "  Clusters: #{output['clusters'].size}"
+      puts "  Levels: #{output['clusters'].map { |c| c['level'] }.uniq.sort.join(', ')}"
+      puts "  Has overview: #{!output['overview'].to_s.strip.empty?}"
+    else
+      puts "Invalid: #{python_path}"
+      errors.each { |e| puts "  - #{e}" }
+      exit 1
+    end
+  end
+  desc "Compare Python and Ruby outputs"
+  task :compare, [ :python_file, :ruby_file ] do |_t, args|
+    require "bundler/setup"
+    require "broadlistening"
+    python_file = args[:python_file]
+    ruby_file = args[:ruby_file]
+    unless python_file && ruby_file
+      puts "Usage: rake compatibility:compare[python_output.json,ruby_output.json]"
+      exit 1
+    end
+    [ python_file, ruby_file ].each do |file|
+      unless File.exist?(file)
+        puts "Error: File not found: #{file}"
+        exit 1
+      end
+    end
+    report = Broadlistening::Compatibility.compare_outputs(
+      python_output: python_file,
+      ruby_output: ruby_file
+    )
+    puts report.summary
+    exit(report.compatible? ? 0 : 1)
+  end
+end
+task default: :spec

data/exe/broadlistening ADDED Viewed

@@ -0,0 +1,6 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require "broadlistening"
+Broadlistening::CLI.new(ARGV).run

data/lib/broadlistening/argument.rb ADDED Viewed

@@ -0,0 +1,136 @@
+# frozen_string_literal: true
+module Broadlistening
+  # Represents an extracted argument (opinion) from a comment.
+  #
+  # Arguments are created during the extraction step and enriched through
+  # subsequent pipeline steps (embedding, clustering).
+  #
+  # @example Creating an argument
+  #   arg = Argument.new(arg_id: "A1_0", argument: "We need more parks", comment_id: "1")
+  #   arg.embedding = [0.1, 0.2, 0.3]  # Added by embedding step
+  #   arg.x = 0.5                       # Added by clustering step
+  class Argument
+    attr_accessor :arg_id, :argument, :comment_id,
+                  :embedding, :x, :y, :cluster_ids,
+                  :attributes, :url, :properties
+    def initialize(
+      arg_id:,
+      argument:,
+      comment_id:,
+      embedding: nil,
+      x: nil,
+      y: nil,
+      cluster_ids: nil,
+      attributes: nil,
+      url: nil,
+      properties: nil
+    )
+      @arg_id = arg_id
+      @argument = argument
+      @comment_id = comment_id
+      @embedding = embedding
+      @x = x
+      @y = y
+      @cluster_ids = cluster_ids
+      @attributes = attributes
+      @url = url
+      @properties = properties
+    end
+    # Create an Argument from a hash
+    #
+    # @param hash [Hash] Input hash with argument data
+    # @return [Argument]
+    def self.from_hash(hash)
+      new(
+        arg_id: hash[:arg_id] || hash["arg_id"],
+        argument: hash[:argument] || hash["argument"],
+        comment_id: hash[:comment_id] || hash["comment_id"],
+        embedding: hash[:embedding] || hash["embedding"],
+        x: hash[:x] || hash["x"],
+        y: hash[:y] || hash["y"],
+        cluster_ids: hash[:cluster_ids] || hash["cluster_ids"],
+        attributes: hash[:attributes] || hash["attributes"],
+        url: hash[:url] || hash["url"],
+        properties: hash[:properties] || hash["properties"]
+      )
+    end
+    # Create an Argument from a Comment during extraction
+    #
+    # @param comment [Comment] Source comment
+    # @param opinion_text [String] Extracted opinion text
+    # @param index [Integer] Opinion index within the comment
+    # @return [Argument]
+    def self.from_comment(comment, opinion_text, index)
+      new(
+        arg_id: "A#{comment.id}_#{index}",
+        argument: opinion_text,
+        comment_id: comment.id,
+        attributes: comment.attributes,
+        url: comment.source_url,
+        properties: comment.properties
+      )
+    end
+    # Convert to hash for serialization
+    #
+    # @return [Hash]
+    def to_h
+      {
+        arg_id: @arg_id,
+        argument: @argument,
+        comment_id: @comment_id,
+        embedding: @embedding,
+        x: @x,
+        y: @y,
+        cluster_ids: @cluster_ids,
+        attributes: @attributes,
+        url: @url,
+        properties: @properties
+      }.compact
+    end
+    # Convert to hash with only embedding data (for embeddings.json)
+    #
+    # @return [Hash]
+    def to_embedding_h
+      {
+        arg_id: @arg_id,
+        embedding: @embedding
+      }
+    end
+    # Convert to hash with only clustering data (for clustering.json)
+    #
+    # @return [Hash]
+    def to_clustering_h
+      {
+        arg_id: @arg_id,
+        x: @x,
+        y: @y,
+        cluster_ids: @cluster_ids
+      }
+    end
+    # Check if argument belongs to a specific cluster
+    #
+    # @param cluster_id [String] Cluster ID to check
+    # @return [Boolean]
+    def in_cluster?(cluster_id)
+      @cluster_ids&.include?(cluster_id) || false
+    end
+    # Extract numeric comment_id from arg_id if comment_id is not set
+    #
+    # @return [Integer]
+    def comment_id_int
+      return @comment_id.to_i if @comment_id
+      match = @arg_id&.match(/\AA(\d+)_/)
+      match ? match[1].to_i : 0
+    end
+  end
+end

data/lib/broadlistening/cli.rb ADDED Viewed

@@ -0,0 +1,196 @@
+# frozen_string_literal: true
+require "optparse"
+require "json"
+require "pathname"
+module Broadlistening
+  class CLI
+    PIPELINE_DIR = Pathname.new(__dir__).parent.parent / "outputs"
+    attr_reader :options
+    def initialize(argv = ARGV)
+      @argv = argv
+      @options = {
+        force: false,
+        only: nil,
+        skip_interaction: false
+      }
+    end
+    def run
+      parse_options
+      validate_config_path
+      config = load_config
+      validate_config(config)
+      output_dir = determine_output_dir
+      ensure_output_dir(output_dir)
+      unless @options[:skip_interaction]
+        show_plan(config, output_dir)
+        confirm_execution || exit(0)
+      end
+      execute_pipeline(config, output_dir)
+    rescue Broadlistening::Error => e
+      $stderr.puts "Error: #{e.message}"
+      exit 1
+    rescue Interrupt
+      $stderr.puts "\nInterrupted"
+      exit 130
+    end
+    private
+    def parse_options
+      parser = OptionParser.new do |opts|
+        opts.banner = "Usage: broadlistening CONFIG [options]"
+        opts.separator ""
+        opts.separator "Run the broadlistening pipeline with the specified configuration."
+        opts.separator ""
+        opts.separator "Options:"
+        opts.on("-f", "--force", "Force re-run all steps regardless of previous execution") do
+          @options[:force] = true
+        end
+        opts.on("-o", "--only STEP", "Run only the specified step (e.g., extraction, embedding, clustering, etc.)") do |step|
+          @options[:only] = step.to_sym
+        end
+        opts.on("--skip-interaction", "Skip the interactive confirmation prompt and run pipeline immediately") do
+          @options[:skip_interaction] = true
+        end
+        opts.on("-h", "--help", "Show this help message") do
+          puts opts
+          exit 0
+        end
+        opts.on("-v", "--version", "Show version") do
+          puts "broadlistening #{Broadlistening::VERSION}"
+          exit 0
+        end
+      end
+      parser.parse!(@argv)
+      @config_path = @argv.first
+    end
+    def validate_config_path
+      unless @config_path
+        $stderr.puts "Error: CONFIG is required"
+        $stderr.puts "Usage: broadlistening CONFIG [options]"
+        exit 1
+      end
+      unless File.exist?(@config_path)
+        $stderr.puts "Error: Config file not found: #{@config_path}"
+        exit 1
+      end
+    end
+    def load_config
+      Config.from_file(@config_path)
+    rescue JSON::ParserError => e
+      raise Broadlistening::ConfigurationError, "Invalid JSON in config file: #{e.message}"
+    end
+    def validate_config(config)
+      raise Broadlistening::ConfigurationError, "Missing required field 'input' in config" unless config.input
+      raise Broadlistening::ConfigurationError, "Missing required field 'question' in config" unless config.question
+      raise Broadlistening::ConfigurationError, "Input file not found: #{config.input}" unless File.exist?(config.input)
+    end
+    def determine_output_dir
+      # Python版と同様: 設定ファイル名から出力ディレクトリを決定
+      # e.g., "config/my_report.json" -> "outputs/my_report"
+      config_basename = File.basename(@config_path, ".*")
+      PIPELINE_DIR / config_basename
+    end
+    def ensure_output_dir(output_dir)
+      FileUtils.mkdir_p(output_dir) unless output_dir.exist?
+    end
+    def show_plan(config, output_dir)
+      puts "So, here is what I am planning to run:"
+      planner = create_planner(config, output_dir)
+      plan = planner.create_plan(force: @options[:force], only: @options[:only])
+      plan.each do |step|
+        status = step[:run] ? "RUN" : "SKIP"
+        puts "  #{step[:step]}: #{status} (#{step[:reason]})"
+      end
+      puts ""
+    end
+    def confirm_execution
+      print "Looks good? Press enter to continue or Ctrl+C to abort."
+      $stdin.gets
+      true
+    rescue Interrupt
+      puts ""
+      false
+    end
+    def create_planner(config, output_dir)
+      status = Status.new(output_dir)
+      Planner.new(config: config, status: status, output_dir: output_dir)
+    end
+    def execute_pipeline(config, output_dir)
+      comments = load_comments(config.input)
+      pipeline = Pipeline.new(config)
+      setup_progress_output
+      result = pipeline.run(
+        comments,
+        output_dir: output_dir.to_s,
+        force: @options[:force],
+        only: @options[:only]
+      )
+      puts ""
+      puts "Pipeline completed."
+      result
+    end
+    def load_comments(input_path)
+      case File.extname(input_path).downcase
+      when ".csv"
+        CsvLoader.load(input_path)
+      when ".json"
+        JSON.parse(File.read(input_path), symbolize_names: true)
+      else
+        raise Broadlistening::ConfigurationError, "Unsupported input format: #{File.extname(input_path)}"
+      end
+    end
+    def setup_progress_output
+      ActiveSupport::Notifications.subscribe("step.broadlistening") do |*, payload|
+        puts "Running step: #{payload[:step]}"
+      end
+      ActiveSupport::Notifications.subscribe("step.skip.broadlistening") do |*, payload|
+        puts "Skipping '#{payload[:step]}'"
+      end
+      ActiveSupport::Notifications.subscribe("progress.broadlistening") do |*, payload|
+        step = payload[:step]
+        current = payload[:current]
+        total = payload[:total]
+        print "\r  #{step}: #{current}/#{total}"
+        puts "" if current == total
+      end
+    end
+  end
+end