RubyGems - bigrecord - Versions diffs - 0.1.0 → 0.1.1 - Mend

bigrecord 0.1.0 → 0.1.1

Files changed (11) hide show

data/README.rdoc +23 -11
data/VERSION +1 -1
data/guides/bigrecord_specs.rdoc +9 -3
data/guides/cassandra_install.rdoc +65 -0
data/guides/deployment.rdoc +12 -5
data/guides/getting_started.rdoc +48 -62
data/guides/hbase_install.rdoc +48 -0
data/guides/storage-conf.rdoc +310 -0
data/lib/big_record/connection_adapters/cassandra_adapter.rb +34 -65
data/spec/connections/bigrecord.yml +2 -2
metadata +9 -3

data/README.rdoc CHANGED

@@ -1,6 +1,7 @@
 = Big Record
-A Ruby Object/Data Mapper for distributed column-oriented data stores (inspired by BigTable) such as HBase. Intended to work as a drop-in for Rails applications.
+A Ruby Object/Data Mapper for distributed column-oriented data stores (inspired by BigTable) such as HBase. Intended
+to work as a drop-in for Rails applications.
 == Features
 * Dynamic schemas (due to the schema-less design of BigTable).
@@ -12,22 +13,33 @@ A Ruby Object/Data Mapper for distributed column-oriented data stores (inspired
 == Motivations
-BigTable, and by extension, Bigrecord isn't right for everyone. A great introductory article discussing this topic can be found at http://blog.rapleaf.com/dev/?p=26 explaining why you would or wouldn't use BigTable. The rule of thumb, however, is that if your data model is simple or can fit into a standard RDBMS, then you probably don't need it.
+BigTable, and by extension, Bigrecord isn't right for everyone. A great introductory article discussing this topic can
+be found at http://blog.rapleaf.com/dev/?p=26 explaining why you would or wouldn't use BigTable. The rule of thumb,
+however, is that if your data model is simple or can fit into a standard RDBMS, then you probably don't need it.
 Beyond this though, there are two basic motivations that almost immediately demand a BigTable model database:
-1. Your data is highly dynamic in nature and would not fit in a schema bound model, or you cannot define a schema ahead of time.
-2. You know that your database will grow to tens or hundreds of gigabytes, and can't afford big iron servers. Instead, you'd like to scale horizontally across many commodity servers.
+1. Your data is highly dynamic in nature and would not fit in a schema bound model, or you cannot define a schema ahead
+of time.
+2. You know that your database will grow to tens or hundreds of gigabytes, and can't afford big iron servers. Instead,
+you'd like to scale horizontally across many commodity servers.
-== Requirements
+== Components
-* Big Record: Ruby Object/Data Mapper. Inspired and architected similarly to Active Record.
-* Big Record Driver: JRuby application that bridges Ruby and Java (through JRuby's Drb protocol) to interact with Java-based data stores and their native APIs. Required for HBase and Cassandra. This application can be run from a separate server than your Rails application.
-  * JRuby 1.1.6+ is needed to run Big Record Driver.
-* Any other requirements needed to run Hadoop, HBase or your data store of choice.
+* Bigrecord: Ruby Object/Data Mapper. Inspired and architected similarly to Active Record.
-== Optional Requirements
+== Optional Component
-* Big Index (highly recommended): Due to the nature of Big Table data stores, some limitations occur while using Big Record standalone when compared to Active Record. Some major limitations include the inability to query for data other than with the row ID, indexing, searching, and dynamic finders (find_by_attribute_name). Since these data access patterns are vital for most Rails applications to function, Big Index was created to address these issues, and bring the feature set more up to par with Active Record. Please refer to the <tt>Big Index</tt> package for more information and its requirements.
+* Bigrecord Driver: Consists of a JRuby server component that bridges Ruby and Java (through the DRb protocol) to
+interact with Java-based data stores and their native APIs. Clients that connect to the DRb server can be of any Ruby
+type (JRuby, MRI, etc). Currently, this is used only for HBase to serve as a connection alternative to Thrift or
+Stargate. This application can be run from a separate server than your Rails application.
+* Bigindex [http://github.com/openplaces/bigindex]: Due to the nature of BigTable databases, some limitations are
+present while using Bigrecord standalone when compared to Active Record. Some major limitations include the inability
+to query for data other than with the row ID, indexing, searching, and dynamic finders (find_by_attribute_name). Since
+these data access patterns are vital for most Rails applications to function, Bigindex was created to address these
+issues, and bring the feature set more up to par with Active Record. Please refer to the <tt>Bigindex</tt> package for
+more information and its requirements.
 == Getting Started

data/VERSION CHANGED

	@@ -1 +1 @@
1	- 0.1.0
1	+ 0.1.1

data/guides/bigrecord_specs.rdoc CHANGED

@@ -2,11 +2,15 @@
 == Data store information
-The default settings for the Bigrecord specs can be found at spec/connections/bigrecord.yml with each environment broken down by the data store type (Hbase and Cassandra at the time of writing). These are the minimal settings required to connect to each data store, and should be modified freely to reflect your own system configurations.
+The default settings for the Bigrecord specs can be found at spec/connections/bigrecord.yml with each environment
+broken down by the data store type (Hbase and Cassandra at the time of writing). These are the minimal settings
+required to connect to each data store, and should be modified freely to reflect your own system configurations.
 == Data store migration
-There are also migrations to create the necessary tables for the specs to run. To ensure migrations are functioning properly before actually running the migrations, you can run: spec spec/unit/migration_spec.rb. Alternatively, you can manually create the tables according to the migration files under: spec/lib/migrations
+There are also migrations to create the necessary tables for the specs to run. To ensure migrations are functioning
+properly before actually running the migrations, you can run: spec spec/unit/migration_spec.rb. Alternatively, you
+can manually create the tables according to the migration files under: spec/lib/migrations
 Migrations have their own log file for debugging purposes. It's created under: bigrecord/migrate.log
@@ -31,6 +35,8 @@ To run a specific spec, you can run the following command from the bigrecord roo
 == Debugging
-If any problems or failures arise during the unit tests, please refer to the log files before submitting it as an issue. Often, it's a simple matter of forgetting to turn on BigrecordDriver, the tables weren't created, or configurations weren't set properly.
+If any problems or failures arise during the unit tests, please refer to the log files before submitting it as an
+issue. Often, it's a simple matter of forgetting to turn on BigrecordDriver, the tables weren't created, or
+configurations weren't set properly.
 The log file for specs is created under: <bigrecord root>/spec/debug.log

data/guides/cassandra_install.rdoc ADDED

@@ -0,0 +1,65 @@
+== Setting up Cassandra
+To quickly get started with development, you can set up Cassandra to run as a single node cluster on your local system.
+(1) Download and unpack the most recent release of Cassandra from http://cassandra.apache.org/download/
+(2) Add a <Keyspace></Keyspace> entry into your (cassandra-dir)/conf/storage-conf.xml configuration file named after
+your application, and create <ColumnFamily> entries corresponding to each model you wish to add. The following is an
+example of the Bigrecord keyspace used to run the spec suite against:
+  <Keyspace Name="Bigrecord">
+    <ColumnFamily Name="animals" CompareWith="UTF8Type" />
+    <ColumnFamily Name="books" CompareWith="UTF8Type" />
+    <ColumnFamily Name="companies" CompareWith="UTF8Type" />
+    <ColumnFamily Name="employees" CompareWith="UTF8Type" />
+    <ColumnFamily Name="novels" CompareWith="UTF8Type" />
+    <ColumnFamily Name="zoos" CompareWith="UTF8Type" />
+    <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
+    <ReplicationFactor>1</ReplicationFactor>
+    <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
+  </Keyspace>
+You can also see {file:guides/storage-conf.rdoc guides/storage-conf.rdoc} for an example of a full configuration. More
+documentation on setting up Cassandra can be found at http://wiki.apache.org/cassandra/GettingStarted
+(3) Install the Cassandra Rubygem:
+  $ [sudo] gem install cassandra
+(4) Start up Cassandra:
+  $ (cassandra-dir)/bin/cassandra -f
+== Setting up Bigrecord
+(1) Add the following line into the Rails::Initializer.run do |config| block:
+  config.gem "bigrecord", :source => "http://gemcutter.org"
+and run the following command to install all the gems listed for your Rails app:
+  [sudo] rake gems:install
+(2) Bootstrap Bigrecord into your project:
+  script/generate bigrecord
+(3) Edit the config/bigrecord.yml[.sample] file in your Rails root to the information corresponding to your Cassandra
+install (keyspace should correspond to the one you defined in step 2 of "Setting up Cassandra" above):
+  development:
+    adapter: cassandra
+    keyspace: Bigrecord
+    servers: localhost:9160
+  production:
+    adapter: cassandra
+    keyspace: Bigrecord
+    servers:
+      - server1:9160
+      - server2:9160
+Note: 9160 is the default port for Cassandra's Thrift server.

data/guides/deployment.rdoc CHANGED

@@ -1,15 +1,22 @@
-= Deploying Big Record
+= Deploying Big Record with HBase
-Stargate is a new implementation for HBase's web service front-end, and as such, is not currently recommended for deployment.
+Stargate is a new implementation for HBase's web service front-end, and as such, is not currently recommended for
+deployment.
-We here at Openplaces have developed Bigrecord Driver, which uses JRuby to interact with HBase via the native Java API and connect to Bigrecord through the DRb protocol. This method is slightly more complicated to setup, but preliminary benchmarks show that it runs faster (especially for scanner functionality).
+We here at Openplaces have developed Bigrecord Driver, which uses JRuby to interact with HBase via the native
+Java API and connect to Bigrecord through the DRb protocol. This method is slightly more complicated to setup,
+but preliminary benchmarks show that it runs faster (especially for scanner functionality).
 == Instructions
-* Your database should already be set up (please refer to the database's own documentation) with the required information known such as the zookeeper quorum/port, etc. in order for Bigrecord to connect to it.
+* Your database should already be set up (please refer to the database's own documentation) with the required
+information known such as the zookeeper quorum/port, etc. in order for Bigrecord to connect to it.
 * Bigrecord Driver (if your database requires it for connecting)
   * JRuby 1.1.6+ is needed to run Bigrecord Driver.
-Install the Bigrecord Driver gem and its dependencies, then start up a DRb server. Please refer the Bigrecord Driver documentation for more detailed instructions. (http://github.com/openplaces/bigrecord/blob/master/bigrecord-driver/README.rdoc)
+Install the Bigrecord Driver gem and its dependencies, then start up a DRb server. Please refer the Bigrecord Driver
+documentation for more detailed instructions.
+(http://github.com/openplaces/bigrecord/blob/master/bigrecord-driver/README.rdoc)
 Edit your bigrecord.yml config file as follows:

data/guides/getting_started.rdoc CHANGED

@@ -1,50 +1,10 @@
 = Getting Started
+== Install HBase or Cassandra
-== Setting up HBase and Stargate
-To quickly get started with development, you can set up HBase to run as a single server on your local computer, along with Stargate, its RESTful web service front-end.
-(1) Download and unpack the most recent release of HBase from http://hadoop.apache.org/hbase/releases.html#Download
-(2) Edit (hbase-dir)/conf/hbase-env.sh and uncomment/modify the following line to correspond to your Java home path:
-  export JAVA_HOME=/usr/lib/jvm/java-6-sun
-(3) Copy (hbase-dir)/contrib/stargate/hbase-<version>-stargate.jar into <hbase-dir>/lib
-(4) Copy all the files in the (hbase-dir)/contrib/stargate/lib folder into <hbase-dir>/lib
-(5) Start up HBase:
-  $ (hbase-dir)/bin/start-hbase.sh
-(6)Start up Stargate (append "-p 1234" at the end if you want to change the port):
-  $ (hbase-dir)/bin/hbase org.apache.hadoop.hbase.stargate.Main
-== Setting up Bigrecord
-(1) Install the Bigrecord Driver gem and its dependencies, then start up a DRb server. Please see the Bigrecord Driver documentation for more detailed instructions. (http://github.com/openplaces/bigrecord/blob/master/bigrecord-driver/README.rdoc)
-(2) Add the following line into the Rails::Initializer.run do |config| block:
-  config.gem "bigrecord", :source => "http://gemcutter.org"
-and run the following command to install all the gems listed for your Rails app:
-  [sudo] rake gems:install
-(3) Bootstrap Bigrecord into your project:
-  script/generate bigrecord
-(4) Edit the config/bigrecord.yml[.sample] file in your Rails root to the information corresponding to the Stargate server.
-  development:
-    adapter: hbase_rest
-    api_address: http://localhost:8080
-Note: 8080 is the default port that Stargate starts up on. Make sure you modify this if you changed the port from the default.
+* HBase: {file:guides/hbase_install.rdoc guides/hbase_install.rdoc}
+* Cassandra: {file:guides/cassandra_install.rdoc guides/cassandra_install.rdoc}
 == Usage
@@ -54,7 +14,8 @@ Once Bigrecord is working in your Rails project, you can use the following gener
   script/generate bigrecord_model ModelName
-This will add a model in app/models and a migration file in db/bigrecord_migrate. Note: This generator does not accept attributes.
+This will add a model in app/models and a migration file in db/bigrecord_migrate. Note: This generator does not
+accept attributes.
   script/generate bigrecord_migration MigrationName
@@ -62,11 +23,19 @@ Creates a Bigrecord specific migration and adds it into db/bigrecord_migrate
 === {BigRecord::Migration Migration File}
-Although column-oriented databases are generally schema-less, certain ones (like Hbase) require the creation of tables and column families ahead of time. The individual columns, however, are defined in the model itself and can be modified dynamically without the need for migrations.
+Note: Cassandra doesn't have the capability to modify the ColumnFamily schema while running, and can only be edited
+from the storage-conf.xml configuration while the cluster is down. Future versions of Cassandra have this planned.
+Although column-oriented databases are generally schema-less, certain ones (like Hbase) require the creation of
+tables and column families ahead of time. The individual columns, however, are defined in the model itself and can
+be modified dynamically without the need for migrations.
-Unless you're familiar with column families, the majority of use cases work perfectly fine within one column family. When you generate a bigrecord_model, it will default to creating the :attribute column family.
+Unless you're familiar with column families, the majority of use cases work perfectly fine within one column family.
+When you generate a bigrecord_model, it will default to creating the :attribute column family.
-The following is a standard migration file that creates a table called "Books" with the default column family :attribute that has the following option of 100 versions and uses the 'lzo' compression scheme. Leave any options blank for the default value.
+The following is a standard migration file that creates a table called "Books" with the default column family
+:attribute that has the following option of 100 versions and uses the 'lzo' compression scheme. Leave any options
+blank for the default value.
   class CreateBooks < BigRecord::Migration
     def self.up
@@ -80,12 +49,15 @@ The following is a standard migration file that creates a table called "Books" w
     end
   end
-=== HBase column family options (HBase specific)
+==== HBase column family options (HBase specific)
-* versions: integer. By default, Hbase will store 3 versions of changes for any column family. Changing this value on the creation will change this behavior.
-* compression: 'none', 'gz', 'lzo'. Defaults to 'none'. Since Hbase 0.20, column families can be stored using compression. The compression scheme you define here must be installed on the Hbase servers!
+* versions: integer. By default, Hbase will store 3 versions of changes for any column family. Changing this value on
+the creation will change this behavior.
-=== Migrating
+* compression: 'none', 'gz', 'lzo'. Defaults to 'none'. Since Hbase 0.20, column families can be stored using
+compression. The compression scheme you define here must be installed on the Hbase servers!
+==== Migrating
 Run the following rake task to migrate your tables and column families up to the latest version:
@@ -93,7 +65,8 @@ Run the following rake task to migrate your tables and column families up to the
 === {BigRecord::ConnectionAdapters::Column Column and Attribute Definition}
-Now that you have your tables and column families all set up, you can begin adding columns to your model. The following is an example of a model named book.rb
+Now that you have your tables and column families all set up, you can begin adding columns to your model. The
+following is an example of a model named book.rb
   class Book < BigRecord::Base
     column 'attribute:title',   :string
@@ -102,11 +75,16 @@ Now that you have your tables and column families all set up, you can begin addi
     column :links,              :string,  :collection => true
   end
-This simple model defines 4 columns of type string. An important thing to notice here is that the first column 'attribute:title' had the column family prepended to it. This is identical to just passing the symbol :title to the column method, and the default behaviour is to prepend the column family (attribute) automatically if one is not defined. Furthermore, in Hbase, there's the option of storing collections for a given column. This will return an array for the links attribute on a Book record.
+This simple model defines 4 columns of type string. An important thing to notice here is that the first column
+'attribute:title' had the column family prepended to it. This is identical to just passing the symbol :title to
+the column method, and the default behaviour is to prepend the column family (attribute) automatically if one is
+not defined. Furthermore, in Hbase, there's the option of storing collections for a given column. This will return
+an array for the links attribute on a Book record.
 === {BigRecord::BrAssociations Associations}
-There are also associations available in Bigrecord, as well as the ability to associate to Activerecord models. The following are a few models demonstrating this:
+There are also associations available in Bigrecord, as well as the ability to associate to Activerecord models. The
+following are a few models demonstrating this:
 animal.rb
   class Animal < BigRecord::Base
@@ -124,13 +102,18 @@ animal.rb
     belongs_to :trainer,        :foreign_key => :trainer_id
   end
-In this example, an Animal is related to Zoo and Trainer. Both Animal and Zoo are Bigrecord models, and Trainer is an Activerecord model. Notice here that we need to define both the association field for storing the information and the association itself. It's also important to remember that Bigrecord models have their IDs stored as string, and Activerecord models use integers.
+In this example, an Animal is related to Zoo and Trainer. Both Animal and Zoo are Bigrecord models, and Trainer is
+an Activerecord model. Notice here that we need to define both the association field for storing the information and
+the association itself. It's also important to remember that Bigrecord models have their IDs stored as string, and
+Activerecord models use integers.
-Once the association columns are defined, you define the associations themselves with either belongs_to_bigrecord or belongs_to_many and defining the :foreign_key (this is required for all associations).
+Once the association columns are defined, you define the associations themselves with either belongs_to_bigrecord or
+belongs_to_many and defining the :foreign_key (this is required for all associations).
 === {BigRecord::ConnectionAdapters::View Specifying return columns}
-There are two ways to define specific columns to be returned with your models: 1) at the model level and 2) during the query.
+There are two ways to define specific columns to be returned with your models: 1) at the model level and 2) during
+the query.
 (1) At the model level, a collection of columns are called named views, and are defined like the following:
@@ -147,7 +130,8 @@ There are two ways to define specific columns to be returned with your models: 1
     view :default, :title, :author, :description
   end
-Now, whenever you work with a Book record, it will only returned the columns you specify according to the view option you pass. i.e.
+Now, whenever you work with a Book record, it will only returned the columns you specify according to the view option
+you pass. i.e.
   >> Book.find(:first, :view => :front_page)
   => #<Book id: "2e13f182-1085-495e-9841-fe5c84ae9992", attribute:title: "Hello Thar", attribute:author: "Greg">
@@ -158,10 +142,11 @@ Now, whenever you work with a Book record, it will only returned the columns you
   >> Book.find(:first, :view => :default)
   => #<Book id: "2e13f182-1085-495e-9841-fe5c84ae9992", attribute:description: "Masterpiece!", attribute:title: "Hello Thar", attribute:links: ["link1", "link2", "link3", "link4"], attribute:author: "Greg">
-Note: A Bigrecord model will return all the columns within the default column family (when :view option is left blank, for example). You can override the :default name view to change this behaviour.
+Note: A Bigrecord model will return all the columns within the default column family (when :view option is left blank,
+for example). You can override the :default name view to change this behaviour.
-(2) If you don't want to define named views ahead of time, you can just pass an array of columns to the :columns option and it will return only those attributes:
+(2) If you don't want to define named views ahead of time, you can just pass an array of columns to the :columns
+option and it will return only those attributes:
   >> Book.find(:first, :columns => [:author, :description])
   => #<Book id: "2e13f182-1085-495e-9841-fe5c84ae9992", attribute:description: "Masterpiece!", attribute:author: "Greg">
@@ -170,4 +155,5 @@ As you may have noticed, this functionality is synonymous with the :select optio
 === {BigRecord::Embedded Embedded Records}
-=== At this point, usage patterns for a Bigrecord model are similar to that of an Activerecord model, and much of that documentation applies as well. Please refer to those and see if they work!
+=== At this point, usage patterns for a Bigrecord model are similar to that of an Activerecord model, and much of that
+documentation applies as well. Please refer to those and see if they work!

data/guides/hbase_install.rdoc ADDED

@@ -0,0 +1,48 @@
+== Setting up HBase and Stargate
+To quickly get started with development, you can set up HBase to run as a single server on your local computer,
+along with Stargate, its RESTful web service front-end.
+(1) Download and unpack the most recent release of HBase from http://hadoop.apache.org/hbase/releases.html#Download
+(2) Edit (hbase-dir)/conf/hbase-env.sh and uncomment/modify the following line to correspond to your Java home path:
+  export JAVA_HOME=/usr/lib/jvm/java-6-sun
+(3) Copy (hbase-dir)/contrib/stargate/hbase-<version>-stargate.jar into <hbase-dir>/lib
+(4) Copy all the files in the (hbase-dir)/contrib/stargate/lib folder into <hbase-dir>/lib
+(5) Start up HBase:
+  $ (hbase-dir)/bin/start-hbase.sh
+(6)Start up Stargate (append "-p 1234" at the end if you want to change the port):
+  $ (hbase-dir)/bin/hbase org.apache.hadoop.hbase.stargate.Main
+== Setting up Bigrecord
+(1) Install the Bigrecord Driver gem and its dependencies, then start up a DRb server. Please see the Bigrecord Driver
+documentation for more detailed instructions.
+(http://github.com/openplaces/bigrecord/blob/master/bigrecord-driver/README.rdoc)
+(2) Add the following line into the Rails::Initializer.run do |config| block:
+  config.gem "bigrecord", :source => "http://gemcutter.org"
+and run the following command to install all the gems listed for your Rails app:
+  [sudo] rake gems:install
+(3) Bootstrap Bigrecord into your project:
+  script/generate bigrecord
+(4) Edit the config/bigrecord.yml[.sample] file in your Rails root to the information corresponding to the Stargate
+server.
+  development:
+    adapter: hbase_rest
+    api_address: http://localhost:8080
+Note: 8080 is the default port that Stargate starts up on. Make sure you modify this if you changed the port from
+the default.

data/guides/storage-conf.rdoc ADDED

@@ -0,0 +1,310 @@
+Example storage-conf.xml file:
+  <!--
+   ~ Licensed to the Apache Software Foundation (ASF) under one
+   ~ or more contributor license agreements.  See the NOTICE file
+   ~ distributed with this work for additional information
+   ~ regarding copyright ownership.  The ASF licenses this file
+   ~ to you under the Apache License, Version 2.0 (the
+   ~ "License"); you may not use this file except in compliance
+   ~ with the License.  You may obtain a copy of the License at
+   ~
+   ~    http://www.apache.org/licenses/LICENSE-2.0
+   ~
+   ~ Unless required by applicable law or agreed to in writing,
+   ~ software distributed under the License is distributed on an
+   ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+   ~ KIND, either express or implied.  See the License for the
+   ~ specific language governing permissions and limitations
+   ~ under the License.
+  -->
+  <Storage>
+    <!--======================================================================-->
+    <!-- Basic Configuration                                                  -->
+    <!--======================================================================-->
+    <!--
+     ~ The name of this cluster.  This is mainly used to prevent machines in
+     ~ one logical cluster from joining another.
+    -->
+    <ClusterName>Local Testing</ClusterName>
+    <!--
+     ~ Turn on to make new [non-seed] nodes automatically migrate the right data
+     ~ to themselves.  (If no InitialToken is specified, they will pick one
+     ~ such that they will get half the range of the most-loaded node.)
+     ~ If a node starts up without bootstrapping, it will mark itself bootstrapped
+     ~ so that you can't subsequently accidently bootstrap a node with
+     ~ data on it.  (You can reset this by wiping your data and commitlog
+     ~ directories.)
+     ~
+     ~ Off by default so that new clusters and upgraders from 0.4 don't
+     ~ bootstrap immediately.  You should turn this on when you start adding
+     ~ new nodes to a cluster that already has data on it.  (If you are upgrading
+     ~ from 0.4, start your cluster with it off once before changing it to true.
+     ~ Otherwise, no data will be lost but you will incur a lot of unnecessary
+     ~ I/O before your cluster starts up.)
+    -->
+    <AutoBootstrap>false</AutoBootstrap>
+    <!--
+     ~ Keyspaces and ColumnFamilies:
+     ~ A ColumnFamily is the Cassandra concept closest to a relational
+     ~ table.  Keyspaces are separate groups of ColumnFamilies.  Except in
+     ~ very unusual circumstances you will have one Keyspace per application.
+     ~ There is an implicit keyspace named 'system' for Cassandra internals.
+    -->
+    <Keyspaces>
+      <Keyspace Name="Bigrecord">
+        <ColumnFamily Name="animals" CompareWith="UTF8Type" />
+        <ColumnFamily Name="books" CompareWith="UTF8Type" />
+        <ColumnFamily Name="companies" CompareWith="UTF8Type" />
+        <ColumnFamily Name="employees" CompareWith="UTF8Type" />
+        <ColumnFamily Name="novels" CompareWith="UTF8Type" />
+        <ColumnFamily Name="zoos" CompareWith="UTF8Type" />
+        <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
+        <ReplicationFactor>1</ReplicationFactor>
+        <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
+      </Keyspace>
+    </Keyspaces>
+    <!--
+     ~ Authenticator: any IAuthenticator may be used, including your own as long
+     ~ as it is on the classpath.  Out of the box, Cassandra provides
+     ~ org.apache.cassandra.auth.AllowAllAuthenticator and,
+     ~ org.apache.cassandra.auth.SimpleAuthenticator
+     ~ (SimpleAuthenticator uses access.properties and passwd.properties by
+     ~ default).
+     ~
+     ~ If you don't specify an authenticator, AllowAllAuthenticator is used.
+    -->
+    <Authenticator>org.apache.cassandra.auth.AllowAllAuthenticator</Authenticator>
+    <!--
+     ~ Partitioner: any IPartitioner may be used, including your own as long
+     ~ as it is on the classpath.  Out of the box, Cassandra provides
+     ~ org.apache.cassandra.dht.RandomPartitioner,
+     ~ org.apache.cassandra.dht.OrderPreservingPartitioner, and
+     ~ org.apache.cassandra.dht.CollatingOrderPreservingPartitioner.
+     ~ (CollatingOPP colates according to EN,US rules, not naive byte
+     ~ ordering.  Use this as an example if you need locale-aware collation.)
+     ~ Range queries require using an order-preserving partitioner.
+     ~
+     ~ Achtung!  Changing this parameter requires wiping your data
+     ~ directories, since the partitioner can modify the sstable on-disk
+     ~ format.
+    -->
+    <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>
+    <!--
+     ~ If you are using an order-preserving partitioner and you know your key
+     ~ distribution, you can specify the token for this node to use. (Keys
+     ~ are sent to the node with the "closest" token, so distributing your
+     ~ tokens equally along the key distribution space will spread keys
+     ~ evenly across your cluster.)  This setting is only checked the first
+     ~ time a node is started.
+     ~ This can also be useful with RandomPartitioner to force equal spacing
+     ~ of tokens around the hash space, especially for clusters with a small
+     ~ number of nodes.
+    -->
+    <InitialToken></InitialToken>
+    <!--
+     ~ Directories: Specify where Cassandra should store different data on
+     ~ disk.  Keep the data disks and the CommitLog disks separate for best
+     ~ performance
+    -->
+    <CommitLogDirectory>/var/lib/cassandra/commitlog</CommitLogDirectory>
+    <DataFileDirectories>
+        <DataFileDirectory>/var/lib/cassandra/data</DataFileDirectory>
+    </DataFileDirectories>
+    <!--
+     ~ Addresses of hosts that are deemed contact points. Cassandra nodes
+     ~ use this list of hosts to find each other and learn the topology of
+     ~ the ring. You must change this if you are running multiple nodes!
+    -->
+    <Seeds>
+        <Seed>127.0.0.1</Seed>
+    </Seeds>
+    <!-- Miscellaneous -->
+    <!-- Time to wait for a reply from other nodes before failing the command -->
+    <RpcTimeoutInMillis>10000</RpcTimeoutInMillis>
+    <!-- Size to allow commitlog to grow to before creating a new segment -->
+    <CommitLogRotationThresholdInMB>128</CommitLogRotationThresholdInMB>
+    <!-- Local hosts and ports -->
+    <!--
+     ~ Address to bind to and tell other nodes to connect to.  You _must_
+     ~ change this if you want multiple nodes to be able to communicate!
+     ~
+     ~ Leaving it blank leaves it up to InetAddress.getLocalHost(). This
+     ~ will always do the Right Thing *if* the node is properly configured
+     ~ (hostname, name resolution, etc), and the Right Thing is to use the
+     ~ address associated with the hostname (it might not be).
+    -->
+    <ListenAddress>localhost</ListenAddress>
+    <!-- internal communications port -->
+    <StoragePort>7000</StoragePort>
+    <!--
+     ~ The address to bind the Thrift RPC service to. Unlike ListenAddress
+     ~ above, you *can* specify 0.0.0.0 here if you want Thrift to listen on
+     ~ all interfaces.
+     ~
+     ~ Leaving this blank has the same effect it does for ListenAddress,
+     ~ (i.e. it will be based on the configured hostname of the node).
+    -->
+    <ThriftAddress>localhost</ThriftAddress>
+    <!-- Thrift RPC port (the port clients connect to). -->
+    <ThriftPort>9160</ThriftPort>
+    <!--
+     ~ Whether or not to use a framed transport for Thrift. If this option
+     ~ is set to true then you must also use a framed transport on the
+     ~ client-side, (framed and non-framed transports are not compatible).
+    -->
+    <ThriftFramedTransport>false</ThriftFramedTransport>
+    <!--======================================================================-->
+    <!-- Memory, Disk, and Performance                                        -->
+    <!--======================================================================-->
+    <!--
+     ~ Access mode.  mmapped i/o is substantially faster, but only practical on
+     ~ a 64bit machine (which notably does not include EC2 "small" instances)
+     ~ or relatively small datasets.  "auto", the safe choice, will enable
+     ~ mmapping on a 64bit JVM.  Other values are "mmap", "mmap_index_only"
+     ~ (which may allow you to get part of the benefits of mmap on a 32bit
+     ~ machine by mmapping only index files) and "standard".
+     ~ (The buffer size settings that follow only apply to standard,
+     ~ non-mmapped i/o.)
+     -->
+    <DiskAccessMode>auto</DiskAccessMode>
+    <!--
+     ~ Size of compacted row above which to log a warning.  (If compacted
+     ~ rows do not fit in memory, Cassandra will crash.  This is explained
+     ~ in http://wiki.apache.org/cassandra/CassandraLimitations and is
+     ~ scheduled to be fixed in 0.7.)
+    -->
+    <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
+    <!--
+     ~ Buffer size to use when performing contiguous column slices. Increase
+     ~ this to the size of the column slices you typically perform.
+     ~ (Name-based queries are performed with a buffer size of
+     ~ ColumnIndexSizeInKB.)
+    -->
+    <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
+    <!--
+     ~ Buffer size to use when flushing memtables to disk. (Only one
+     ~ memtable is ever flushed at a time.) Increase (decrease) the index
+     ~ buffer size relative to the data buffer if you have few (many)
+     ~ columns per key.  Bigger is only better _if_ your memtables get large
+     ~ enough to use the space. (Check in your data directory after your
+     ~ app has been running long enough.) -->
+    <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
+    <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
+    <!--
+     ~ Add column indexes to a row after its contents reach this size.
+     ~ Increase if your column values are large, or if you have a very large
+     ~ number of columns.  The competing causes are, Cassandra has to
+     ~ deserialize this much of the row to read a single column, so you want
+     ~ it to be small - at least if you do many partial-row reads - but all
+     ~ the index data is read for each access, so you don't want to generate
+     ~ that wastefully either.
+    -->
+    <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
+    <!--
+     ~ Flush memtable after this much data has been inserted, including
+     ~ overwritten data.  There is one memtable per column family, and
+     ~ this threshold is based solely on the amount of data stored, not
+     ~ actual heap memory usage (there is some overhead in indexing the
+     ~ columns).
+    -->
+    <MemtableThroughputInMB>64</MemtableThroughputInMB>
+    <!--
+     ~ Throughput setting for Binary Memtables.  Typically these are
+     ~ used for bulk load so you want them to be larger.
+    -->
+    <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
+    <!--
+     ~ The maximum number of columns in millions to store in memory per
+     ~ ColumnFamily before flushing to disk.  This is also a per-memtable
+     ~ setting.  Use with MemtableThroughputInMB to tune memory usage.
+    -->
+    <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
+    <!--
+     ~ The maximum time to leave a dirty memtable unflushed.
+     ~ (While any affected columnfamilies have unflushed data from a
+     ~ commit log segment, that segment cannot be deleted.)
+     ~ This needs to be large enough that it won't cause a flush storm
+     ~ of all your memtables flushing at once because none has hit
+     ~ the size or count thresholds yet.  For production, a larger
+     ~ value such as 1440 is recommended.
+    -->
+    <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
+    <!--
+     ~ Unlike most systems, in Cassandra writes are faster than reads, so
+     ~ you can afford more of those in parallel.  A good rule of thumb is 2
+     ~ concurrent reads per processor core.  Increase ConcurrentWrites to
+     ~ the number of clients writing at once if you enable CommitLogSync +
+     ~ CommitLogSyncDelay. -->
+    <ConcurrentReads>8</ConcurrentReads>
+    <ConcurrentWrites>32</ConcurrentWrites>
+    <!--
+     ~ CommitLogSync may be either "periodic" or "batch."  When in batch
+     ~ mode, Cassandra won't ack writes until the commit log has been
+     ~ fsynced to disk.  It will wait up to CommitLogSyncBatchWindowInMS
+     ~ milliseconds for other writes, before performing the sync.
+     ~ This is less necessary in Cassandra than in traditional databases
+     ~ since replication reduces the odds of losing data from a failure
+     ~ after writing the log entry but before it actually reaches the disk.
+     ~ So the other option is "periodic," where writes may be acked immediately
+     ~ and the CommitLog is simply synced every CommitLogSyncPeriodInMS
+     ~ milliseconds.
+    -->
+    <CommitLogSync>periodic</CommitLogSync>
+    <!--
+     ~ Interval at which to perform syncs of the CommitLog in periodic mode.
+     ~ Usually the default of 10000ms is fine; increase it if your i/o
+     ~ load is such that syncs are taking excessively long times.
+    -->
+    <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
+    <!--
+     ~ Delay (in milliseconds) during which additional commit log entries
+     ~ may be written before fsync in batch mode.  This will increase
+     ~ latency slightly, but can vastly improve throughput where there are
+     ~ many writers.  Set to zero to disable (each entry will be synced
+     ~ individually).  Reasonable values range from a minimal 0.1 to 10 or
+     ~ even more if throughput matters more than latency.
+    -->
+    <!-- <CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS> -->
+    <!--
+     ~ Time to wait before garbage-collection deletion markers.  Set this to
+     ~ a large enough value that you are confident that the deletion marker
+     ~ will be propagated to all replicas by the time this many seconds has
+     ~ elapsed, even in the face of hardware failures.  The default value is
+     ~ ten days.
+    -->
+    <GCGraceSeconds>10</GCGraceSeconds>
+  </Storage>

data/lib/big_record/connection_adapters/cassandra_adapter.rb CHANGED

@@ -68,7 +68,7 @@ module BigRecord
       def update_raw(table_name, row, values, timestamp)
         result = nil
         log "UPDATE #{table_name} SET #{values.inspect if values} WHERE ROW=#{row};" do
-          result = @connection.insert(table_name, row, data_to_cassandra_format(values), {:consistency => Cassandra::Consistency::QUORUM})
+          result = @connection.insert(table_name, row, values, {:consistency => Cassandra::Consistency::QUORUM})
         end
         result
       end
@@ -84,8 +84,7 @@ module BigRecord
       def get_raw(table_name, row, column, options={})
         result = nil
         log "SELECT (#{column}) FROM #{table_name} WHERE ROW=#{row};" do
-          super_column, name = column.split(":")
-          result = @connection.get(table_name, row, super_column, name)
+          result = @connection.get(table_name, row, column)
         end
         result
       end
@@ -103,33 +102,33 @@ module BigRecord
       def get_columns_raw(table_name, row, columns, options={})
         result = {}
         log "SELECT (#{columns.join(", ")}) FROM #{table_name} WHERE ROW=#{row};" do
-          requested_columns = columns_to_cassandra_format(columns)
-          super_columns = requested_columns.keys
+          prefix_mode = false
+          prefixes = []
-          if super_columns.size == 1 && requested_columns[super_columns.first].size > 0
-            column_names = requested_columns[super_columns.first]
+          columns.each do |name|
+            prefix, name = name.split(":")
+            prefixes << prefix+":" unless prefixes.include?(prefix+":")
+            prefix_mode = name.blank?
+          end
-            values = @connection.get_columns(table_name, row, super_columns.first, column_names)
+          if prefix_mode
+            prefixes.sort!
+            values = @connection.get(table_name, row, {:start => prefixes.first, :finish => prefixes.last + "~"})
-            result["id"] = row if values && values.compact.size > 0
-            column_names.each_index do |id|
-              full_key = super_columns.first + ":" + column_names[id].to_s
-              result[full_key] = values[id] unless values[id].nil?
+            result["id"] = row if values && values.size > 0
+            values.each do |key,value|
+              result[key] = value unless value.blank?
             end
           else
-            values = @connection.get_columns(table_name, row, super_columns)
+            values = @connection.get_columns(table_name, row, columns)
             result["id"] = row if values && values.compact.size > 0
-            super_columns.each_index do |id|
-              next if values[id].nil?
-              values[id].each do |column_name, value|
-                next if value.nil?
-                full_key = super_columns[id] + ":" + column_name
-                result[full_key] = value
-              end
+            columns.each_index do |id|
+              result[columns[id].to_s] = values[id] unless values[id].blank?
             end
           end
         end
@@ -144,11 +143,11 @@ module BigRecord
         row_cols.each do |key,value|
           begin
             result[key] =
-            if key == 'id'
-              value
-            else
-              deserialize(value)
-            end
+              if key == 'id'
+                value
+              else
+                deserialize(value)
+              end
           rescue Exception => e
             puts "Could not load column value #{key} for row=#{row.name}"
           end
@@ -160,9 +159,9 @@ module BigRecord
         result = []
         log "SCAN (#{columns.join(", ")}) FROM #{table_name} WHERE START_ROW=#{start_row} AND STOP_ROW=#{stop_row} LIMIT=#{limit};" do
           options = {}
-          options[:start] = start_row if start_row
-          options[:finish] = stop_row if stop_row
-          options[:count] = limit if limit
+          options[:start] = start_row unless start_row.blank?
+          options[:finish] = stop_row unless stop_row.blank?
+          options[:count] = limit unless limit.blank?
           keys = @connection.get_range(table_name, options)
@@ -172,14 +171,9 @@ module BigRecord
               row = {}
               row["id"] = key.key
-              key.columns.each do |s_col|
-                super_column = s_col.super_column
-                super_column_name = super_column.name
-                super_column.columns.each do |column|
-                  full_key = super_column_name + ":" + column.name
-                  row[full_key] = column.value
-                end
+              key.columns.each do |col|
+                column = col.column
+                row[column.name] = column.value
               end
               result << row if row.keys.size > 1
@@ -266,31 +260,6 @@ module BigRecord
     protected
-      def data_to_cassandra_format(data = {})
-        super_columns = {}
-        data.each do |name, value|
-          super_column, column = name.split(":")
-          super_columns[super_column.to_s] = {} unless super_columns.has_key?(super_column.to_s)
-          super_columns[super_column.to_s][column.to_s] = value
-        end
-        return super_columns
-      end
-      def columns_to_cassandra_format(column_names = [])
-        super_columns = {}
-        column_names.each do |name|
-          super_column, sub_column = name.split(":")
-          super_columns[super_column.to_s] = [] unless super_columns.has_key?(super_column.to_s)
-          super_columns[super_column.to_s] << sub_column
-        end
-        return super_columns
-      end
       def log(str, name = nil)
         if block_given?
           if @logger and @logger.level <= Logger::INFO
@@ -346,4 +315,4 @@ module BigRecord
       end
     end
   end
-end
+end

data/spec/connections/bigrecord.yml CHANGED

@@ -1,7 +1,7 @@
-hbase_rest:
+hbase:
   adapter: hbase_rest
   api_address: http://localhost:8080
-hbase:
+hbase_brd:
   adapter: hbase
   zookeeper_quorum: localhost
   zookeeper_client_port: 2181

metadata CHANGED

@@ -5,8 +5,8 @@ version: !ruby/object:Gem::Version
   segments:
   - 0
   - 1
-  - 0
-  version: 0.1.0
+  - 1
+  version: 0.1.1
 platform: ruby
 authors:
 - openplaces.org
@@ -14,7 +14,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-04-27 00:00:00 -04:00
+date: 2010-05-05 00:00:00 -04:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -77,8 +77,11 @@ extra_rdoc_files:
 - LICENSE
 - README.rdoc
 - guides/bigrecord_specs.rdoc
+- guides/cassandra_install.rdoc
 - guides/deployment.rdoc
 - guides/getting_started.rdoc
+- guides/hbase_install.rdoc
+- guides/storage-conf.rdoc
 files:
 - Rakefile
 - VERSION
@@ -92,8 +95,11 @@ files:
 - generators/bigrecord_model/templates/model.rb
 - generators/bigrecord_model/templates/model_spec.rb
 - guides/bigrecord_specs.rdoc
+- guides/cassandra_install.rdoc
 - guides/deployment.rdoc
 - guides/getting_started.rdoc
+- guides/hbase_install.rdoc
+- guides/storage-conf.rdoc
 - init.rb
 - install.rb
 - lib/big_record.rb