RubyGems - twitter-text - Versions diffs - 1.14.7 → 2.0.0 - Mend

twitter-text 1.14.7 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

checksums.yaml +5 -5
data/.rspec +1 -1
data/README.md +104 -33
data/lib/assets/tld_lib.yml +1 -0
data/lib/twitter-text.rb +2 -0
data/lib/twitter-text/autolink.rb +4 -4
data/lib/twitter-text/configuration.rb +53 -0
data/lib/twitter-text/deprecation.rb +1 -1
data/lib/twitter-text/extractor.rb +31 -1
data/lib/twitter-text/regex.rb +13 -13
data/lib/twitter-text/validation.rb +155 -43
data/lib/twitter-text/weighted_range.rb +18 -0
data/spec/autolinking_spec.rb +161 -161
data/spec/configuration_spec.rb +91 -0
data/spec/extractor_spec.rb +92 -72
data/spec/hithighlighter_spec.rb +15 -15
data/spec/regex_spec.rb +7 -7
data/spec/rewriter_spec.rb +110 -109
data/spec/spec_helper.rb +13 -15
data/spec/test_urls.rb +6 -4
data/spec/twitter_text_spec.rb +2 -2
data/spec/unicode_spec.rb +10 -10
data/spec/validation_spec.rb +35 -11
data/test/conformance_test.rb +14 -0
data/twitter-text.gemspec +11 -9
metadata +53 -32
data/lib/assets/tld_lib.yml +0 -1565

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: f1dd5437a51b3767c45499c3d5d4b438bd1b7ba1
-  data.tar.gz: 0c7bcee79f7fc1e955cad42ddba8c9318c885ae5
+SHA256:
+  metadata.gz: 92e1f709304c7902186bbe50ff5f7d215059d292a4e8730b9cdff12210dff1aa
+  data.tar.gz: fd50deede86bb5ba1a47ff214350f86a928ed59926438d3361475f3640ff8531
 SHA512:
-  metadata.gz: 6a9ad3b3b822e358070f6722e4d88a362c05016584739d290beafe2e8763aaf202e9198a16855b140acefd2b87f9b1d29990254422750353a786437198f5f8c8
-  data.tar.gz: 4d1ea6e3fd1a158bfcaaad04454723145bdb66c991a51e499015aaaa135fb9792120391032a7ee269cf1d37a71d4cddd9c03511a78026c66830c007c286372f7
+  metadata.gz: 85f39c5bd4d9c58b863d5e9490618ee941a528ab8fd23a463857a206d53ba50a4235cb7e287245a0e3bb66bb78955b98cf6973f1ed5e2ec5741090ef34a77c52
+  data.tar.gz: 6a0133f3acd0a34742435777f4fc276df4639066adc80b39d3a4b84f77ec73eb0772fae0ff52f20e3af752b089ca976de3b215d83241944c2fd0ef8f9823ba85

data/.rspec CHANGED

@@ -1,2 +1,2 @@
 --color
---format=nested
+--format=documentation

data/README.md CHANGED

@@ -1,16 +1,82 @@
 # twitter-text
-![hello](https://img.shields.io/gem/v/twitter-text.svg)
+![](https://img.shields.io/gem/v/twitter-text.svg)
-A gem that provides text processing routines for Twitter Tweets. The major
-reason for this is to unify the various auto-linking and extraction of
-usernames, lists, hashtags and URLs.
+This is the Ruby implementation of the twitter-text parsing
+library. The library has methods to parse Tweets and calculate length,
+validity, parse @mentions, #hashtags, URLs, and more.
-## Extraction Examples
+## Setup
+Installation uses bundler.
-# Extraction
 ```
+% gem install bundler
+% bundle install
+```
+## Conformance tests
+To run the Conformance test suite from the command line via rake:
+```
+% rake test:conformance:run
+```
+You can also run the rspec tests in the `spec` directory:
+```
+% rspec spec
+```
+# Length validation
+twitter-text 2.0 introduces configuration files that define how Tweets
+are parsed for length. This allows for backwards compatibility and
+flexibility going forward. Old-style traditional 140-character parsing
+is defined by the v1.json configuration file, whereas v2.json is
+updated for "weighted" Tweets where ranges of Unicode code points can
+have independent weights aside from the default weight. The sum of all
+code points, each weighted appropriately, should not exceed the max
+weighted length.
+Some old methods from twitter-text 1.0 have been marked deprecated,
+such as the `tweet_length()` method. The new API is based on the
+following method, `parse_tweet()`
+```ruby
+def parse_tweet(text, options = {}) { ... }
+```
+This method takes a string as input and returns a results object that
+contains information about the
+string. `Twitter::Validation::ParseResults` object includes:
+* `:weighted_length`: the overall length of the tweet with code points
+weighted per the ranges defined in the configuration file.
+* `:permillage`: indicates the proportion (per thousand) of the weighted
+length in comparison to the max weighted length. A value > 1000
+indicates input text that is longer than the allowable maximum.
+* `:valid`: indicates if input text length corresponds to a valid
+result.
+* `:display_range_start, :display_range_end`: An array of two unicode code point
+indices identifying the inclusive start and exclusive end of the
+displayable content of the Tweet. For more information, see
+the description of `display_text_range` here:
+[Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates)
+* `:valid_range_start, :valid_range_end`: An array of two unicode code point
+indices identifying the inclusive start and exclusive end of the valid
+content of the Tweet. For more information on the extended Tweet
+payload see [Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates)
+## Extraction Examples
+# Extraction
+```ruby
 class MyClass
   include Twitter::Extractor
   usernames = extract_mentioned_screen_names("Mentioning @twitter and @jack")
@@ -18,9 +84,9 @@ class MyClass
 end
 ```
-# Extraction with a block argument
-```ruby
+### Extraction with a block argument
+```ruby
 class MyClass
   include Twitter::Extractor
   extract_reply_screen_name("@twitter are you hiring?").do |username|
@@ -31,8 +97,9 @@ end
 ## Auto-linking Examples
-# Auto-link
-```
+### Auto-link
+```ruby
 class MyClass
   include Twitter::Autolink
@@ -40,14 +107,14 @@ class MyClass
 end
 ```
-# For Ruby on Rails you want to add this to app/helpers/application_helper.rb
-```
+### For Ruby on Rails you want to add this to app/helpers/application_helper.rb
+```ruby
 module ApplicationHelper
   include Twitter::Autolink
 end
 ```
-# Now the auto_link function is available in every view. So in index.html.erb:
+### Now the auto_link function is available in every view. So in index.html.erb:
 ```ruby
 <%= auto_link("link @user, please #request") %>
 ```
@@ -90,33 +157,37 @@ words should work equally well.
 Use to provide emphasis around the "hits" returned from the Search API, built
 to work against text that has been auto-linked already.
-### Thanks
+## Issues
-Thanks to everybody who has filed issues, provided feedback or contributed
-patches. Patches courtesy of:
+Have a bug? Please create an issue here on GitHub!
-*   At Twitter …
-    *   Matt Sanford - http://github.com/mzsanford
-    *   Raffi Krikorian - http://github.com/r
-    *   Ben Cherry - http://github.com/bcherry
-    *   Patrick Ewing - http://github.com/hoverbird
-    *   Jeff Smick - http://github.com/sprsquish
-    *   Kenneth Kufluk - https://github.com/kennethkufluk
-    *   Keita Fujii - https://github.com/keitaf
-    *   Yoshimasa Niwa - https://github.com/niw
+<https://github.com/twitter/twitter-text/issues>
+## Authors
-*   Patches from the community …
-    *   Jean-Philippe Bougie - http://github.com/jpbougie
-    *   Erik Michaels-Ober - https://github.com/sferik
+### V2.0
+* David LaMacchia (<https://github.com/dlamacchia>)
+* Yoshimasa Niwa (<https://github.com/niw>)
+* Sudheer Guntupalli (<https://github.com/sudhee>)
+* Kaushik Lakshmikanth (<https://github.com/kaushlakers>)
+* Jose Antonio Marquez Russo (<https://github.com/joseeight>)
+* Lee Adams (<https://github.com/leeaustinadams>)
-*   Anyone who has filed an issue. It helps. Really.
+### Previous authors
+* Matt Sanford (<http://github.com/mzsanford>)
+* Raffi Krikorian (<http://github.com/r>)
+* Ben Cherry (<http://github.com/bcherry>)
+* Patrick Ewing (<http://github.com/hoverbird>)
+* Jeff Smick (<http://github.com/sprsquish>)
+* Kenneth Kufluk (<https://github.com/kennethkufluk>)
+* Keita Fujii (<https://github.com/keitaf>)
+* Jean-Philippe Bougie (<http://github.com/jpbougie>)
+* Erik Michaels-Ober (<https://github.com/sferik>)
-### Copyright and License
+## License
-**Copyright 2011 Twitter, Inc.**
+Copyright 2012-2017 Twitter, Inc and other contributors
-Licensed under the Apache License, Version 2.0:
-http://www.apache.org/licenses/LICENSE-2.0
+Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)

data/lib/assets/tld_lib.yml ADDED

	@@ -0,0 +1 @@
1	+ lib/assets/../../../conformance/tld_lib.yml

data/lib/twitter-text.rb CHANGED

@@ -15,6 +15,8 @@ end
   autolink
   extractor
   unicode
+  weighted_range
+  configuration
   validation
   hit_highlighter
 ).each do |name|

data/lib/twitter-text/autolink.rb CHANGED

@@ -1,4 +1,4 @@
-# encoding: UTF-8
+# encoding: utf-8
 require 'set'
 require 'twitter-text/hash_helper'
@@ -21,9 +21,9 @@ module Twitter
     # Default URL base for auto-linked lists
     DEFAULT_LIST_URL_BASE = "https://twitter.com/".freeze
     # Default URL base for auto-linked hashtags
-    DEFAULT_HASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%23".freeze
+    DEFAULT_HASHTAG_URL_BASE = "https://twitter.com/search?q=%23".freeze
     # Default URL base for auto-linked cashtags
-    DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%24".freeze
+    DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/search?q=%24".freeze
     # Default attributes for invisible span tag
     DEFAULT_INVISIBLE_TAG_ATTRS = "style='position:absolute;left:-9999px;'".freeze
@@ -286,7 +286,7 @@ module Twitter
       # wrap the ellipses in a tco-ellipsis class and provide an onCopy handler that sets display:none on
       # everything with the tco-ellipsis class.
       #
-      # Exception: pic.twitter.com images, for which expandedUrl = "https://twitter.com/#!/username/status/1234/photo/1
+      # Exception: pic.twitter.com images, for which expandedUrl = "https://twitter.com/username/status/1234/photo/1
       # For those URLs, display_url is not a substring of expanded_url, so we don't do anything special to render the elided parts.
       # For a pic.twitter.com URL, the only elided part will be the "https://", so this is fine.
       display_url_sans_ellipses = display_url.gsub("…", "")

data/lib/twitter-text/configuration.rb ADDED

@@ -0,0 +1,53 @@
+# encoding: UTF-8
+module Twitter
+  class Configuration
+    require 'json'
+    PARSER_VERSION_CLASSIC = "v1"
+    PARSER_VERSION_DEFAULT = "v2"
+    class << self
+      attr_accessor :default_configuration
+    end
+    attr_reader :version, :max_weighted_tweet_length, :scale
+    attr_reader :default_weight, :transformed_url_length, :ranges
+    CONFIG_V1 = File.join(
+      File.expand_path('../../../../config', __FILE__), # project root
+      "#{PARSER_VERSION_CLASSIC}.json"
+    )
+    CONFIG_V2 = File.join(
+      File.expand_path('../../../../config', __FILE__), # project root
+      "#{PARSER_VERSION_DEFAULT}.json"
+    )
+    def self.parse_string(string, options = {})
+      JSON.parse(string, options.merge(symbolize_names: true))
+    end
+    def self.parse_file(filename)
+      string = File.open(filename, 'rb') { |f| f.read }
+      parse_string(string)
+    end
+    def self.configuration_from_file(filename)
+      config = parse_file(filename)
+      config ? Twitter::Configuration.new(config) : nil
+    end
+    def initialize(config = {})
+      @version = config[:version]
+      @max_weighted_tweet_length = config[:maxWeightedTweetLength]
+      @scale = config[:scale]
+      @default_weight = config[:defaultWeight]
+      @transformed_url_length = config[:transformedURLLength]
+      @ranges = config[:ranges].map { |range| Twitter::WeightedRange.new(range) } if config.key?(:ranges) && config[:ranges].is_a?(Array)
+    end
+    self.default_configuration = Twitter::Configuration.configuration_from_file(Twitter::Configuration::CONFIG_V2)
+  end
+end

data/lib/twitter-text/deprecation.rb CHANGED

@@ -7,7 +7,7 @@ module Twitter
       alias_method(deprecated_method, method)
       define_method method do |*args, &block|
-        warn message
+        warn message unless $TESTING
         send(deprecated_method, *args, &block)
       end
     end

data/lib/twitter-text/extractor.rb CHANGED

@@ -1,4 +1,5 @@
-# encoding: UTF-8
+# encoding: utf-8
+require 'idn'
 class String
   # Helper function to count the character length by first converting to an
@@ -47,6 +48,15 @@ module Twitter
   # A module for including Tweet parsing in a class. This module provides function for the extraction and processing
   # of usernames, lists, URLs and hashtags.
   module Extractor extend self
+    # Maximum URL length as defined by Twitter's backend.
+    MAX_URL_LENGTH = 4096
+    # The maximum t.co path length that the Twitter backend supports.
+    MAX_TCO_SLUG_LENGTH = 40
+    URL_PROTOCOL_LENGTH = "https://".length
     # Remove overlapping entities.
     # This returns a new array with no overlapping entities.
     def remove_overlapping_entities(entities)
@@ -201,6 +211,7 @@ module Twitter
           next if !options[:extract_url_without_protocol] || before =~ Twitter::Regex[:invalid_url_without_protocol_preceding_chars]
           last_url = nil
           domain.scan(Twitter::Regex[:valid_ascii_domain]) do |ascii_domain|
+            next unless is_valid_domain(url.length, ascii_domain, protocol)
             last_url = {
               :url => ascii_domain,
               :indices => [start_position + $~.char_begin(0),
@@ -225,9 +236,13 @@ module Twitter
         else
           # In the case of t.co URLs, don't allow additional path characters
           if url =~ Twitter::Regex[:valid_tco_url]
+            next if $1 && $1.length > MAX_TCO_SLUG_LENGTH
             url = $&
             end_position = start_position + url.char_length
           end
+          next unless is_valid_domain(url.length, domain, protocol)
           urls << {
             :url => url,
             :indices => [start_position, end_position]
@@ -324,5 +339,20 @@ module Twitter
       tags.each{|tag| yield tag[:cashtag], tag[:indices].first, tag[:indices].last} if block_given?
       tags
     end
+    def is_valid_domain(url_length, domain, protocol)
+      begin
+        raise ArgumentError.new("invalid empty domain") unless domain
+        original_domain_length = domain.length
+        encoded_domain = IDN::Idna.toASCII(domain)
+        updated_domain_length = encoded_domain.length
+        url_length += (updated_domain_length - original_domain_length) if (updated_domain_length > original_domain_length)
+        url_length += URL_PROTOCOL_LENGTH unless protocol
+        url_length <= MAX_URL_LENGTH
+      rescue Exception
+        # On error don't consider this a valid domain.
+        return false
+      end
+    end
   end
 end

data/lib/twitter-text/regex.rb CHANGED

@@ -1,4 +1,4 @@
-# encoding: UTF-8
+# encoding: utf-8
 module Twitter
   # A collection of regular expressions for parsing Tweet text. The regular expression
@@ -62,10 +62,10 @@ module Twitter
     major, minor, _patch = RUBY_VERSION.split('.')
     if major.to_i >= 2 || major.to_i == 1 && minor.to_i >= 9 || (defined?(RUBY_ENGINE) && ["jruby", "rbx"].include?(RUBY_ENGINE))
-      REGEXEN[:list_name] = /[a-zA-Z][a-zA-Z0-9_\-\u0080-\u00ff]{0,24}/
+      REGEXEN[:list_name] = /[a-z][a-z0-9_\-\u0080-\u00ff]{0,24}/i
     else
       # This line barfs at compile time in Ruby 1.9, JRuby, or Rubinius.
-      REGEXEN[:list_name] = eval("/[a-zA-Z][a-zA-Z0-9_\\-\x80-\xff]{0,24}/")
+      REGEXEN[:list_name] = eval("/[a-z][a-z0-9_\\-\x80-\xff]{0,24}/i")
     end
     # Latin accented characters
@@ -148,17 +148,17 @@ module Twitter
     # Used in Extractor for final filtering
     REGEXEN[:end_hashtag_match] = /\A(?:[#＃]|:\/\/)/o
-    REGEXEN[:valid_mention_preceding_chars] = /(?:[^a-zA-Z0-9_!#\$%&*@＠]|^|(?:^|[^a-zA-Z0-9_+~.-])[rR][tT]:?)/o
+    REGEXEN[:valid_mention_preceding_chars] = /(?:[^a-z0-9_!#\$%&*@＠]|^|(?:^|[^a-z0-9_+~.-])[rR][tT]:?)/io
     REGEXEN[:at_signs] = /[@＠]/
     REGEXEN[:valid_mention_or_list] = /
       (#{REGEXEN[:valid_mention_preceding_chars]})  # $1: Preceeding character
       (#{REGEXEN[:at_signs]})                       # $2: At mark
-      ([a-zA-Z0-9_]{1,20})                          # $3: Screen name
-      (\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?             # $4: List (optional)
-    /ox
-    REGEXEN[:valid_reply] = /^(?:#{REGEXEN[:spaces]})*#{REGEXEN[:at_signs]}([a-zA-Z0-9_]{1,20})/o
+      ([a-z0-9_]{1,20})                             # $3: Screen name
+      (\/[a-z][a-zA-Z0-9_\-]{0,24})?                # $4: List (optional)
+    /iox
+    REGEXEN[:valid_reply] = /^(?:#{REGEXEN[:spaces]})*#{REGEXEN[:at_signs]}([a-z0-9_]{1,20})/io
     # Used in Extractor for final filtering
-    REGEXEN[:end_mention_match] = /\A(?:#{REGEXEN[:at_signs]}|#{REGEXEN[:latin_accents]}|:\/\/)/o
+    REGEXEN[:end_mention_match] = /\A(?:#{REGEXEN[:at_signs]}|#{REGEXEN[:latin_accents]}|:\/\/)/io
     # URL related hash regex collection
     REGEXEN[:valid_url_preceding_chars] = /(?:[^A-Z0-9@＠$#＃#{INVALID_CHARACTERS.join('')}]|^)/io
@@ -196,12 +196,12 @@ module Twitter
     # This is used in Extractor
     REGEXEN[:valid_ascii_domain] = /
-      (?:(?:[A-Za-z0-9\-_]|#{REGEXEN[:latin_accents]})+\.)+
+      (?:(?:[a-z0-9\-_]|#{REGEXEN[:latin_accents]})+\.)+
       (?:#{REGEXEN[:valid_gTLD]}|#{REGEXEN[:valid_ccTLD]}|#{REGEXEN[:valid_punycode]})
     /iox
     # This is used in Extractor for stricter t.co URL extraction
-    REGEXEN[:valid_tco_url] = /^https?:\/\/t\.co\/[a-z0-9]+/i
+    REGEXEN[:valid_tco_url] = /^https?:\/\/t\.co\/([a-z0-9]+)/i
     # This is used in Extractor to filter out unwanted URLs.
     REGEXEN[:invalid_short_domain] = /\A#{REGEXEN[:valid_domain_name]}#{REGEXEN[:valid_ccTLD]}\Z/io
@@ -209,7 +209,7 @@ module Twitter
     REGEXEN[:valid_port_number] = /[0-9]+/
-    REGEXEN[:valid_general_url_path_chars] = /[a-z\p{Cyrillic}0-9!\*';:=\+\,\.\$\/%#\[\]\-_~&\|@#{LATIN_ACCENTS}]/io
+    REGEXEN[:valid_general_url_path_chars] = /[a-z\p{Cyrillic}0-9!\*';:=\+\,\.\$\/%#\[\]\p{Pd}_~&\|@#{LATIN_ACCENTS}]/io
     # Allow URL paths to contain up to two nested levels of balanced parens
     #  1. Used in Wikipedia URLs like /Primer_(film)
     #  2. Used in IIS sessions like /S(dfd346)/
@@ -260,7 +260,7 @@ module Twitter
     REGEXEN[:valid_cashtag] = /(^|#{REGEXEN[:spaces]})(\$)(#{REGEXEN[:cashtag]})(?=$|\s|[#{PUNCTUATION_CHARS}])/i
     # These URL validation pattern strings are based on the ABNF from RFC 3986
-    REGEXEN[:validate_url_unreserved] = /[a-z\p{Cyrillic}0-9\-._~]/i
+    REGEXEN[:validate_url_unreserved] = /[a-z\p{Cyrillic}0-9\p{Pd}._~]/i
     REGEXEN[:validate_url_pct_encoded] = /(?:%[0-9a-f]{2})/i
     REGEXEN[:validate_url_sub_delims] = /[!$&'()*+,;=]/i
     REGEXEN[:validate_url_pchar] = /(?:

data/lib/twitter-text/validation.rb CHANGED

@@ -2,65 +2,114 @@ require 'unf'
 module Twitter
   module Validation extend self
-    MAX_LENGTH = 140
     DEFAULT_TCO_URL_LENGTHS = {
       :short_url_length => 23,
-      :short_url_length_https => 23,
-      :characters_reserved_per_media => 23
-    }.freeze
+    }
-    # Returns the length of the string as it would be displayed. This is equivilent to the length of the Unicode NFC
-    # (See: http://www.unicode.org/reports/tr15). This is needed in order to consistently calculate the length of a
-    # string no matter which actual form was transmitted. For example:
-    #
-    #     U+0065  Latin Small Letter E
-    # +   U+0301  Combining Acute Accent
-    # ----------
-    # =   2 bytes, 2 characters, displayed as é (1 visual glyph)
-    #     … The NFC of {U+0065, U+0301} is {U+00E9}, which is a single chracter and a +display_length+ of 1
-    #
-    # The string could also contain U+00E9 already, in which case the canonicalization will not change the value.
-    #
-    def tweet_length(text, options = {})
-      options = DEFAULT_TCO_URL_LENGTHS.merge(options)
+    # :weighted_length the weighted length of tweet based on weights specified in the config
+    # :valid If tweet is valid
+    # :permillage permillage of the tweet over the max length specified in config
+    # :valid_range_start beginning of valid text
+    # :valid_range_end End index of valid part of the tweet text (inclusive)
+    # :display_range_start beginning index of display text
+    # :display_range_end end index of display text (inclusive)
+    class ParseResults < Hash
-      length = text.to_nfc.unpack("U*").length
+      RESULT_PARAMS = [:weighted_length, :valid, :permillage, :valid_range_start, :valid_range_end, :display_range_start, :display_range_end]
-      Twitter::Extractor.extract_urls_with_indices(text) do |url, start_position, end_position|
-        length += start_position - end_position
-        length += url.downcase =~ /^https:\/\// ? options[:short_url_length_https] : options[:short_url_length]
+      def self.empty
+        return ParseResults.new(weighted_length: 0, permillage: 0, valid: true, display_range_start: 0, display_range_end: 0, valid_range_start: 0, valid_range_end: 0)
       end
-      length
+      def initialize(params = {})
+        RESULT_PARAMS.each do |key|
+          super[key] = params[key] if params.key?(key)
+        end
+      end
     end
-    # Check the <tt>text</tt> for any reason that it may not be valid as a Tweet. This is meant as a pre-validation
-    # before posting to api.twitter.com. There are several server-side reasons for Tweets to fail but this pre-validation
-    # will allow quicker feedback.
-    #
-    # Returns <tt>false</tt> if this <tt>text</tt> is valid. Otherwise one of the following Symbols will be returned:
-    #
-    #   <tt>:too_long</tt>:: if the <tt>text</tt> is too long
-    #   <tt>:empty</tt>:: if the <tt>text</tt> is nil or empty
-    #   <tt>:invalid_characters</tt>:: if the <tt>text</tt> contains non-Unicode or any of the disallowed Unicode characters
-    def tweet_invalid?(text)
-      return :empty if !text || text.empty?
+    # Parse input text and return hash with descriptive parameters populated.
+    def parse_tweet(text, options = {})
+      options = DEFAULT_TCO_URL_LENGTHS.merge(options)
+      config = options[:config] || Twitter::Configuration.default_configuration
+      normalized_text = text.to_nfc
+      normalized_text_length = normalized_text.char_length
+      unless (normalized_text_length > 0)
+        ParseResults.empty()
+      end
+      scale = config.scale
+      max_weighted_tweet_length = config.max_weighted_tweet_length
+      scaled_max_weighted_tweet_length = max_weighted_tweet_length * scale
+      transformed_url_length = config.transformed_url_length * scale
+      ranges = config.ranges
+      url_entities = Twitter::Extractor.extract_urls_with_indices(normalized_text)
+      has_invalid_chars = false
+      weighted_count = 0
+      offset = 0
+      display_offset = 0
+      valid_offset = 0
+      while offset < normalized_text_length
+        # Reset the default char weight each pass through the loop
+        char_weight = config.default_weight
+        url_entities.each do |url_entity|
+          if url_entity[:indices].first == offset
+            url_length = url_entity[:indices].last - url_entity[:indices].first
+            weighted_count += transformed_url_length
+            offset += url_length
+            display_offset += url_length
+            if weighted_count <= scaled_max_weighted_tweet_length
+              valid_offset += url_length
+            end
+            # Finding a match breaks the loop; order of ranges matters.
+            break
+          end
+        end
+        if offset < normalized_text_length
+          code_point = normalized_text[offset]
+          ranges.each do |range|
+            if range.contains?(code_point.unpack("U").first)
+              char_weight = range.weight
+              break
+            end
+          end
+          weighted_count += char_weight
+          has_invalid_chars = contains_invalid?(normalized_text[offset]) unless has_invalid_chars
+          char_count = code_point.char_length
+          offset += char_count
+          display_offset += char_count
+          if !has_invalid_chars && (weighted_count <= scaled_max_weighted_tweet_length)
+            valid_offset += char_count
+          end
+        end
+      end
+      normalized_text_offset = text.char_length - normalized_text.char_length
+      scaled_weighted_length = weighted_count / scale
+      is_valid = !has_invalid_chars && (scaled_weighted_length <= max_weighted_tweet_length)
+      permillage = scaled_weighted_length * 1000 / max_weighted_tweet_length
+      return ParseResults.new(weighted_length: scaled_weighted_length, permillage: permillage, valid: is_valid, display_range_start: 0, display_range_end: (display_offset + normalized_text_offset - 1), valid_range_start: 0, valid_range_end: (valid_offset + normalized_text_offset - 1))
+    end
+    def contains_invalid?(text)
+      return false if !text || text.empty?
       begin
-        return :too_long if tweet_length(text) > MAX_LENGTH
-        return :invalid_characters if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
+        return true if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
       rescue ArgumentError
         # non-Unicode value.
-        return :invalid_characters
+        return true
       end
       return false
     end
-    def valid_tweet_text?(text)
-      !tweet_invalid?(text)
-    end
     def valid_username?(username)
       return false if !username || username.empty?
@@ -102,6 +151,69 @@ module Twitter
              (!unicode_domains && valid_match?(authority, Twitter::Regex[:validate_url_authority]))
     end
+    # These methods are deprecated, will be removed in future.
+    extend Deprecation
+    MAX_LENGTH_LEGACY = 140
+    # DEPRECATED: Please use parse_text instead.
+    #
+    # Returns the length of the string as it would be displayed. This is equivilent to the length of the Unicode NFC
+    # (See: http://www.unicode.org/reports/tr15). This is needed in order to consistently calculate the length of a
+    # string no matter which actual form was transmitted. For example:
+    #
+    #     U+0065  Latin Small Letter E
+    # +   U+0301  Combining Acute Accent
+    # ----------
+    # =   2 bytes, 2 characters, displayed as é (1 visual glyph)
+    #     … The NFC of {U+0065, U+0301} is {U+00E9}, which is a single chracter and a +display_length+ of 1
+    #
+    # The string could also contain U+00E9 already, in which case the canonicalization will not change the value.
+    #
+    def tweet_length(text, options = {})
+      options = DEFAULT_TCO_URL_LENGTHS.merge(options)
+      length = text.to_nfc.unpack("U*").length
+      Twitter::Extractor.extract_urls_with_indices(text) do |url, start_position, end_position|
+        length += start_position - end_position
+        length += options[:short_url_length] if url.length > 0
+      end
+      length
+    end
+    deprecate :tweet_length, :parse_tweet
+    # DEPRECATED: Please use parse_text instead.
+    #
+    # Check the <tt>text</tt> for any reason that it may not be valid as a Tweet. This is meant as a pre-validation
+    # before posting to api.twitter.com. There are several server-side reasons for Tweets to fail but this pre-validation
+    # will allow quicker feedback.
+    #
+    # Returns <tt>false</tt> if this <tt>text</tt> is valid. Otherwise one of the following Symbols will be returned:
+    #
+    #   <tt>:too_long</tt>:: if the <tt>text</tt> is too long
+    #   <tt>:empty</tt>:: if the <tt>text</tt> is nil or empty
+    #   <tt>:invalid_characters</tt>:: if the <tt>text</tt> contains non-Unicode or any of the disallowed Unicode characters
+    def tweet_invalid?(text)
+      return :empty if !text || text.empty?
+      begin
+        return :too_long if tweet_length(text) > MAX_LENGTH_LEGACY
+        return :invalid_characters if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
+      rescue ArgumentError
+        # non-Unicode value.
+        return :invalid_characters
+      end
+      return false
+    end
+    deprecate :tweet_invalid?, :parse_tweet
+    def valid_tweet_text?(text)
+      !tweet_invalid?(text)
+    end
+    deprecate :valid_tweet_text?, :parse_tweet
     private
     def valid_match?(string, regex, optional=false)