RubyGems - twitter-text - Versions diffs - 1.14.7 → 2.0.0 - Mend

twitter-text 1.14.7 → 2.0.0

Files changed (27) hide show

checksums.yaml +5 -5
data/.rspec +1 -1
data/README.md +104 -33
data/lib/assets/tld_lib.yml +1 -0
data/lib/twitter-text.rb +2 -0
data/lib/twitter-text/autolink.rb +4 -4
data/lib/twitter-text/configuration.rb +53 -0
data/lib/twitter-text/deprecation.rb +1 -1
data/lib/twitter-text/extractor.rb +31 -1
data/lib/twitter-text/regex.rb +13 -13
data/lib/twitter-text/validation.rb +155 -43
data/lib/twitter-text/weighted_range.rb +18 -0
data/spec/autolinking_spec.rb +161 -161
data/spec/configuration_spec.rb +91 -0
data/spec/extractor_spec.rb +92 -72
data/spec/hithighlighter_spec.rb +15 -15
data/spec/regex_spec.rb +7 -7
data/spec/rewriter_spec.rb +110 -109
data/spec/spec_helper.rb +13 -15
data/spec/test_urls.rb +6 -4
data/spec/twitter_text_spec.rb +2 -2
data/spec/unicode_spec.rb +10 -10
data/spec/validation_spec.rb +35 -11
data/test/conformance_test.rb +14 -0
data/twitter-text.gemspec +11 -9
metadata +53 -32
data/lib/assets/tld_lib.yml +0 -1565

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
-SHA1:
-  metadata.gz: f1dd5437a51b3767c45499c3d5d4b438bd1b7ba1
-  data.tar.gz: 0c7bcee79f7fc1e955cad42ddba8c9318c885ae5
+SHA256:
+  metadata.gz: 92e1f709304c7902186bbe50ff5f7d215059d292a4e8730b9cdff12210dff1aa
+  data.tar.gz: fd50deede86bb5ba1a47ff214350f86a928ed59926438d3361475f3640ff8531
 SHA512:
-  metadata.gz: 6a9ad3b3b822e358070f6722e4d88a362c05016584739d290beafe2e8763aaf202e9198a16855b140acefd2b87f9b1d29990254422750353a786437198f5f8c8
-  data.tar.gz: 4d1ea6e3fd1a158bfcaaad04454723145bdb66c991a51e499015aaaa135fb9792120391032a7ee269cf1d37a71d4cddd9c03511a78026c66830c007c286372f7
+  metadata.gz: 85f39c5bd4d9c58b863d5e9490618ee941a528ab8fd23a463857a206d53ba50a4235cb7e287245a0e3bb66bb78955b98cf6973f1ed5e2ec5741090ef34a77c52
+  data.tar.gz: 6a0133f3acd0a34742435777f4fc276df4639066adc80b39d3a4b84f77ec73eb0772fae0ff52f20e3af752b089ca976de3b215d83241944c2fd0ef8f9823ba85

data/.rspec CHANGED

@@ -1,2 +1,2 @@
 --color
---format=nested
+--format=documentation

data/README.md CHANGED

@@ -1,16 +1,82 @@
 # twitter-text
-![hello](https://img.shields.io/gem/v/twitter-text.svg)
+![](https://img.shields.io/gem/v/twitter-text.svg)
-A gem that provides text processing routines for Twitter Tweets. The major
-reason for this is to unify the various auto-linking and extraction of
-usernames, lists, hashtags and URLs.
+This is the Ruby implementation of the twitter-text parsing
+library. The library has methods to parse Tweets and calculate length,
+validity, parse @mentions, #hashtags, URLs, and more.
-## Extraction Examples
+## Setup
+Installation uses bundler.
-# Extraction
 ```
+% gem install bundler
+% bundle install
+```
+## Conformance tests
+To run the Conformance test suite from the command line via rake:
+```
+% rake test:conformance:run
+```
+You can also run the rspec tests in the `spec` directory:
+```
+% rspec spec
+```
+# Length validation
+twitter-text 2.0 introduces configuration files that define how Tweets
+are parsed for length. This allows for backwards compatibility and
+flexibility going forward. Old-style traditional 140-character parsing
+is defined by the v1.json configuration file, whereas v2.json is
+updated for "weighted" Tweets where ranges of Unicode code points can
+have independent weights aside from the default weight. The sum of all
+code points, each weighted appropriately, should not exceed the max
+weighted length.
+Some old methods from twitter-text 1.0 have been marked deprecated,
+such as the `tweet_length()` method. The new API is based on the
+following method, `parse_tweet()`
+```ruby
+def parse_tweet(text, options = {}) { ... }
+```
+This method takes a string as input and returns a results object that
+contains information about the
+string. `Twitter::Validation::ParseResults` object includes:
+* `:weighted_length`: the overall length of the tweet with code points
+weighted per the ranges defined in the configuration file.
+* `:permillage`: indicates the proportion (per thousand) of the weighted
+length in comparison to the max weighted length. A value > 1000
+indicates input text that is longer than the allowable maximum.
+* `:valid`: indicates if input text length corresponds to a valid
+result.
+* `:display_range_start, :display_range_end`: An array of two unicode code point
+indices identifying the inclusive start and exclusive end of the
+displayable content of the Tweet. For more information, see
+the description of `display_text_range` here:
+[Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates)
+* `:valid_range_start, :valid_range_end`: An array of two unicode code point
+indices identifying the inclusive start and exclusive end of the valid
+content of the Tweet. For more information on the extended Tweet
+payload see [Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates)
+## Extraction Examples
+# Extraction
+```ruby
 class MyClass
   include Twitter::Extractor
   usernames = extract_mentioned_screen_names("Mentioning @twitter and @jack")
@@ -18,9 +84,9 @@ class MyClass
 end
 ```
-# Extraction with a block argument
-```ruby
+### Extraction with a block argument
+```ruby
 class MyClass
   include Twitter::Extractor
   extract_reply_screen_name("@twitter are you hiring?").do |username|
@@ -31,8 +97,9 @@ end
 ## Auto-linking Examples
-# Auto-link
-```
+### Auto-link
+```ruby
 class MyClass
   include Twitter::Autolink
@@ -40,14 +107,14 @@ class MyClass
 end
 ```
-# For Ruby on Rails you want to add this to app/helpers/application_helper.rb
-```
+### For Ruby on Rails you want to add this to app/helpers/application_helper.rb
+```ruby
 module ApplicationHelper
   include Twitter::Autolink
 end
 ```
-# Now the auto_link function is available in every view. So in index.html.erb:
+### Now the auto_link function is available in every view. So in index.html.erb:
 ```ruby
 <%= auto_link("link @user, please #request") %>
 ```
@@ -90,33 +157,37 @@ words should work equally well.
 Use to provide emphasis around the "hits" returned from the Search API, built
 to work against text that has been auto-linked already.
-### Thanks
+## Issues
-Thanks to everybody who has filed issues, provided feedback or contributed
-patches. Patches courtesy of:
+Have a bug? Please create an issue here on GitHub!
-*   At Twitter …
-    *   Matt Sanford - http://github.com/mzsanford
-    *   Raffi Krikorian - http://github.com/r
-    *   Ben Cherry - http://github.com/bcherry
-    *   Patrick Ewing - http://github.com/hoverbird
-    *   Jeff Smick - http://github.com/sprsquish
-    *   Kenneth Kufluk - https://github.com/kennethkufluk
-    *   Keita Fujii - https://github.com/keitaf
-    *   Yoshimasa Niwa - https://github.com/niw
+<https://github.com/twitter/twitter-text/issues>
+## Authors
-*   Patches from the community …
-    *   Jean-Philippe Bougie - http://github.com/jpbougie
-    *   Erik Michaels-Ober - https://github.com/sferik
+### V2.0
+* David LaMacchia (<https://github.com/dlamacchia>)
+* Yoshimasa Niwa (<https://github.com/niw>)
+* Sudheer Guntupalli (<https://github.com/sudhee>)
+* Kaushik Lakshmikanth (<https://github.com/kaushlakers>)
+* Jose Antonio Marquez Russo (<https://github.com/joseeight>)
+* Lee Adams (<https://github.com/leeaustinadams>)
-*   Anyone who has filed an issue. It helps. Really.
+### Previous authors
+* Matt Sanford (<http://github.com/mzsanford>)
+* Raffi Krikorian (<http://github.com/r>)
+* Ben Cherry (<http://github.com/bcherry>)
+* Patrick Ewing (<http://github.com/hoverbird>)
+* Jeff Smick (<http://github.com/sprsquish>)
+* Kenneth Kufluk (<https://github.com/kennethkufluk>)
+* Keita Fujii (<https://github.com/keitaf>)
+* Jean-Philippe Bougie (<http://github.com/jpbougie>)
+* Erik Michaels-Ober (<https://github.com/sferik>)
-### Copyright and License
+## License
-**Copyright 2011 Twitter, Inc.**
+Copyright 2012-2017 Twitter, Inc and other contributors
-Licensed under the Apache License, Version 2.0:
-http://www.apache.org/licenses/LICENSE-2.0
+Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)

data/lib/assets/tld_lib.yml ADDED

	@@ -0,0 +1 @@
1	+ lib/assets/../../../conformance/tld_lib.yml

data/lib/twitter-text.rb CHANGED

@@ -15,6 +15,8 @@ end
   autolink
   extractor
   unicode
+  weighted_range
+  configuration
   validation
   hit_highlighter
 ).each do |name|

data/lib/twitter-text/autolink.rb CHANGED

@@ -1,4 +1,4 @@
-# encoding: UTF-8
+# encoding: utf-8
 require 'set'
 require 'twitter-text/hash_helper'
@@ -21,9 +21,9 @@ module Twitter
     # Default URL base for auto-linked lists
     DEFAULT_LIST_URL_BASE = "https://twitter.com/".freeze
     # Default URL base for auto-linked hashtags
-    DEFAULT_HASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%23".freeze
+    DEFAULT_HASHTAG_URL_BASE = "https://twitter.com/search?q=%23".freeze
     # Default URL base for auto-linked cashtags
-    DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%24".freeze
+    DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/search?q=%24".freeze
     # Default attributes for invisible span tag
     DEFAULT_INVISIBLE_TAG_ATTRS = "style='position:absolute;left:-9999px;'".freeze
@@ -286,7 +286,7 @@ module Twitter
       # wrap the ellipses in a tco-ellipsis class and provide an onCopy handler that sets display:none on
       # everything with the tco-ellipsis class.
       #
-      # Exception: pic.twitter.com images, for which expandedUrl = "https://twitter.com/#!/username/status/1234/photo/1
+      # Exception: pic.twitter.com images, for which expandedUrl = "https://twitter.com/username/status/1234/photo/1
       # For those URLs, display_url is not a substring of expanded_url, so we don't do anything special to render the elided parts.
       # For a pic.twitter.com URL, the only elided part will be the "https://", so this is fine.
       display_url_sans_ellipses = display_url.gsub("…", "")

data/lib/twitter-text/configuration.rb ADDED

@@ -0,0 +1,53 @@
+# encoding: UTF-8
+module Twitter
+  class Configuration
+    require 'json'
+    PARSER_VERSION_CLASSIC = "v1"
+    PARSER_VERSION_DEFAULT = "v2"
+    class << self
+      attr_accessor :default_configuration
+    end
+    attr_reader :version, :max_weighted_tweet_length, :scale
+    attr_reader :default_weight, :transformed_url_length, :ranges
+    CONFIG_V1 = File.join(
+      File.expand_path('../../../../config', __FILE__), # project root
+      "#{PARSER_VERSION_CLASSIC}.json"
+    )
+    CONFIG_V2 = File.join(
+      File.expand_path('../../../../config', __FILE__), # project root
+      "#{PARSER_VERSION_DEFAULT}.json"
+    )
+    def self.parse_string(string, options = {})
+      JSON.parse(string, options.merge(symbolize_names: true))
+    end
+    def self.parse_file(filename)
+      string = File.open(filename, 'rb') { |f| f.read }
+      parse_string(string)
+    end
+    def self.configuration_from_file(filename)
+      config = parse_file(filename)
+      config ? Twitter::Configuration.new(config) : nil
+    end
+    def initialize(config = {})
+      @version = config[:version]
+      @max_weighted_tweet_length = config[:maxWeightedTweetLength]
+      @scale = config[:scale]
+      @default_weight = config[:defaultWeight]
+      @transformed_url_length = config[:transformedURLLength]
+      @ranges = config[:ranges].map { |range| Twitter::WeightedRange.new(range) } if config.key?(:ranges) && config[:ranges].is_a?(Array)
+    end
+    self.default_configuration = Twitter::Configuration.configuration_from_file(Twitter::Configuration::CONFIG_V2)
+  end
+end

data/lib/twitter-text/deprecation.rb CHANGED

@@ -7,7 +7,7 @@ module Twitter
       alias_method(deprecated_method, method)
       define_method method do |*args, &block|
-        warn message
+        warn message unless $TESTING
         send(deprecated_method, *args, &block)
       end
     end

data/lib/twitter-text/extractor.rb CHANGED

@@ -1,4 +1,5 @@
-# encoding: UTF-8
+# encoding: utf-8
+require 'idn'
 class String
   # Helper function to count the character length by first converting to an
@@ -47,6 +48,15 @@ module Twitter
   # A module for including Tweet parsing in a class. This module provides function for the extraction and processing
   # of usernames, lists, URLs and hashtags.
   module Extractor extend self
+    # Maximum URL length as defined by Twitter's backend.
+    MAX_URL_LENGTH = 4096
+    # The maximum t.co path length that the Twitter backend supports.
+    MAX_TCO_SLUG_LENGTH = 40
+    URL_PROTOCOL_LENGTH = "https://".length
     # Remove overlapping entities.
     # This returns a new array with no overlapping entities.
     def remove_overlapping_entities(entities)
@@ -201,6 +211,7 @@ module Twitter
           next if !options[:extract_url_without_protocol] || before =~ Twitter::Regex[:invalid_url_without_protocol_preceding_chars]
           last_url = nil
           domain.scan(Twitter::Regex[:valid_ascii_domain]) do |ascii_domain|
+            next unless is_valid_domain(url.length, ascii_domain, protocol)
             last_url = {
               :url => ascii_domain,
               :indices => [start_position + $~.char_begin(0),
@@ -225,9 +236,13 @@ module Twitter
         else
           # In the case of t.co URLs, don't allow additional path characters
           if url =~ Twitter::Regex[:valid_tco_url]
+            next if $1 && $1.length > MAX_TCO_SLUG_LENGTH
             url = $&
             end_position = start_position + url.char_length
           end
+          next unless is_valid_domain(url.length, domain, protocol)
           urls << {
             :url => url,
             :indices => [start_position, end_position]
@@ -324,5 +339,20 @@ module Twitter
       tags.each{|tag| yield tag[:cashtag], tag[:indices].first, tag[:indices].last} if block_given?
       tags
     end
+    def is_valid_domain(url_length, domain, protocol)
+      begin
+        raise ArgumentError.new("invalid empty domain") unless domain
+        original_domain_length = domain.length
+        encoded_domain = IDN::Idna.toASCII(domain)
+        updated_domain_length = encoded_domain.length
+        url_length += (updated_domain_length - original_domain_length) if (updated_domain_length > original_domain_length)
+        url_length += URL_PROTOCOL_LENGTH unless protocol
+        url_length <= MAX_URL_LENGTH
+      rescue Exception
+        # On error don't consider this a valid domain.
+        return false
+      end
+    end
   end
 end

data/lib/twitter-text/regex.rb CHANGED

@@ -1,4 +1,4 @@
-# encoding: UTF-8
+# encoding: utf-8
 module Twitter
   # A collection of regular expressions for parsing Tweet text. The regular expression
@@ -62,10 +62,10 @@ module Twitter
     major, minor, _patch = RUBY_VERSION.split('.')
     if major.to_i >= 2 || major.to_i == 1 && minor.to_i >= 9 || (defined?(RUBY_ENGINE) && ["jruby", "rbx"].include?(RUBY_ENGINE))
-      REGEXEN[:list_name] = /[a-zA-Z][a-zA-Z0-9_\-\u0080-\u00ff]{0,24}/
+      REGEXEN[:list_name] = /[a-z][a-z0-9_\-\u0080-\u00ff]{0,24}/i
     else
       # This line barfs at compile time in Ruby 1.9, JRuby, or Rubinius.
-      REGEXEN[:list_name] = eval("/[a-zA-Z][a-zA-Z0-9_\\-\x80-\xff]{0,24}/")
+      REGEXEN[:list_name] = eval("/[a-z][a-z0-9_\\-\x80-\xff]{0,24}/i")
     end
     # Latin accented characters
@@ -148,17 +148,17 @@ module Twitter
     # Used in Extractor for final filtering
     REGEXEN[:end_hashtag_match] = /\A(?:[#＃]|:\/\/)/o
-    REGEXEN[:valid_mention_preceding_chars] = /(?:[^a-zA-Z0-9_!#\$%&*@＠]|^|(?:^|[^a-zA-Z0-9_+~.-])[rR][tT]:?)/o
+    REGEXEN[:valid_mention_preceding_chars] = /(?:[^a-z0-9_!#\$%&*@＠]|^|(?:^|[^a-z0-9_+~.-])[rR][tT]:?)/io
     REGEXEN[:at_signs] = /[@＠]/
     REGEXEN[:valid_mention_or_list] = /
       (#{REGEXEN[:valid_mention_preceding_chars]})  # $1: Preceeding character
       (#{REGEXEN[:at_signs]})                       # $2: At mark
-      ([a-zA-Z0-9_]{1,20})                          # $3: Screen name
-      (\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})?             # $4: List (optional)
-    /ox
-    REGEXEN[:valid_reply] = /^(?:#{REGEXEN[:spaces]})*#{REGEXEN[:at_signs]}([a-zA-Z0-9_]{1,20})/o
+      ([a-z0-9_]{1,20})                             # $3: Screen name
+      (\/[a-z][a-zA-Z0-9_\-]{0,24})?                # $4: List (optional)
+    /iox
+    REGEXEN[:valid_reply] = /^(?:#{REGEXEN[:spaces]})*#{REGEXEN[:at_signs]}([a-z0-9_]{1,20})/io
     # Used in Extractor for final filtering
-    REGEXEN[:end_mention_match] = /\A(?:#{REGEXEN[:at_signs]}|#{REGEXEN[:latin_accents]}|:\/\/)/o
+    REGEXEN[:end_mention_match] = /\A(?:#{REGEXEN[:at_signs]}|#{REGEXEN[:latin_accents]}|:\/\/)/io
     # URL related hash regex collection
     REGEXEN[:valid_url_preceding_chars] = /(?:[^A-Z0-9@＠$#＃#{INVALID_CHARACTERS.join('')}]|^)/io
@@ -196,12 +196,12 @@ module Twitter
     # This is used in Extractor
     REGEXEN[:valid_ascii_domain] = /
-      (?:(?:[A-Za-z0-9\-_]|#{REGEXEN[:latin_accents]})+\.)+
+      (?:(?:[a-z0-9\-_]|#{REGEXEN[:latin_accents]})+\.)+
       (?:#{REGEXEN[:valid_gTLD]}|#{REGEXEN[:valid_ccTLD]}|#{REGEXEN[:valid_punycode]})
     /iox
     # This is used in Extractor for stricter t.co URL extraction
-    REGEXEN[:valid_tco_url] = /^https?:\/\/t\.co\/[a-z0-9]+/i
+    REGEXEN[:valid_tco_url] = /^https?:\/\/t\.co\/([a-z0-9]+)/i
     # This is used in Extractor to filter out unwanted URLs.
     REGEXEN[:invalid_short_domain] = /\A#{REGEXEN[:valid_domain_name]}#{REGEXEN[:valid_ccTLD]}\Z/io
@@ -209,7 +209,7 @@ module Twitter
     REGEXEN[:valid_port_number] = /[0-9]+/
-    REGEXEN[:valid_general_url_path_chars] = /[a-z\p{Cyrillic}0-9!\*';:=\+\,\.\$\/%#\[\]\-_~&\|@#{LATIN_ACCENTS}]/io
+    REGEXEN[:valid_general_url_path_chars] = /[a-z\p{Cyrillic}0-9!\*';:=\+\,\.\$\/%#\[\]\p{Pd}_~&\|@#{LATIN_ACCENTS}]/io
     # Allow URL paths to contain up to two nested levels of balanced parens
     #  1. Used in Wikipedia URLs like /Primer_(film)
     #  2. Used in IIS sessions like /S(dfd346)/
@@ -260,7 +260,7 @@ module Twitter
     REGEXEN[:valid_cashtag] = /(^|#{REGEXEN[:spaces]})(\$)(#{REGEXEN[:cashtag]})(?=$|\s|[#{PUNCTUATION_CHARS}])/i
     # These URL validation pattern strings are based on the ABNF from RFC 3986
-    REGEXEN[:validate_url_unreserved] = /[a-z\p{Cyrillic}0-9\-._~]/i
+    REGEXEN[:validate_url_unreserved] = /[a-z\p{Cyrillic}0-9\p{Pd}._~]/i
     REGEXEN[:validate_url_pct_encoded] = /(?:%[0-9a-f]{2})/i
     REGEXEN[:validate_url_sub_delims] = /[!$&'()*+,;=]/i
     REGEXEN[:validate_url_pchar] = /(?:

data/lib/twitter-text/validation.rb CHANGED

@@ -2,65 +2,114 @@ require 'unf'
 module Twitter
   module Validation extend self
-    MAX_LENGTH = 140
     DEFAULT_TCO_URL_LENGTHS = {
       :short_url_length => 23,
-      :short_url_length_https => 23,
-      :characters_reserved_per_media => 23
-    }.freeze
+    }
-    # Returns the length of the string as it would be displayed. This is equivilent to the length of the Unicode NFC
-    # (See: http://www.unicode.org/reports/tr15). This is needed in order to consistently calculate the length of a
-    # string no matter which actual form was transmitted. For example:
-    #
-    #     U+0065  Latin Small Letter E
-    # +   U+0301  Combining Acute Accent
-    # ----------
-    # =   2 bytes, 2 characters, displayed as é (1 visual glyph)
-    #     … The NFC of {U+0065, U+0301} is {U+00E9}, which is a single chracter and a +display_length+ of 1
-    #
-    # The string could also contain U+00E9 already, in which case the canonicalization will not change the value.
-    #
-    def tweet_length(text, options = {})
-      options = DEFAULT_TCO_URL_LENGTHS.merge(options)
+    # :weighted_length the weighted length of tweet based on weights specified in the config
+    # :valid If tweet is valid
+    # :permillage permillage of the tweet over the max length specified in config
+    # :valid_range_start beginning of valid text
+    # :valid_range_end End index of valid part of the tweet text (inclusive)
+    # :display_range_start beginning index of display text
+    # :display_range_end end index of display text (inclusive)
+    class ParseResults < Hash
-      length = text.to_nfc.unpack("U*").length
+      RESULT_PARAMS = [:weighted_length, :valid, :permillage, :valid_range_start, :valid_range_end, :display_range_start, :display_range_end]
-      Twitter::Extractor.extract_urls_with_indices(text) do |url, start_position, end_position|
-        length += start_position - end_position
-        length += url.downcase =~ /^https:\/\// ? options[:short_url_length_https] : options[:short_url_length]
+      def self.empty
+        return ParseResults.new(weighted_length: 0, permillage: 0, valid: true, display_range_start: 0, display_range_end: 0, valid_range_start: 0, valid_range_end: 0)
       end
-      length
+      def initialize(params = {})
+        RESULT_PARAMS.each do |key|
+          super[key] = params[key] if params.key?(key)
+        end
+      end
     end
-    # Check the <tt>text</tt> for any reason that it may not be valid as a Tweet. This is meant as a pre-validation
-    # before posting to api.twitter.com. There are several server-side reasons for Tweets to fail but this pre-validation
-    # will allow quicker feedback.
-    #
-    # Returns <tt>false</tt> if this <tt>text</tt> is valid. Otherwise one of the following Symbols will be returned:
-    #
-    #   <tt>:too_long</tt>:: if the <tt>text</tt> is too long
-    #   <tt>:empty</tt>:: if the <tt>text</tt> is nil or empty
-    #   <tt>:invalid_characters</tt>:: if the <tt>text</tt> contains non-Unicode or any of the disallowed Unicode characters
-    def tweet_invalid?(text)
-      return :empty if !text || text.empty?
+    # Parse input text and return hash with descriptive parameters populated.
+    def parse_tweet(text, options = {})
+      options = DEFAULT_TCO_URL_LENGTHS.merge(options)
+      config = options[:config] || Twitter::Configuration.default_configuration
+      normalized_text = text.to_nfc
+      normalized_text_length = normalized_text.char_length
+      unless (normalized_text_length > 0)
+        ParseResults.empty()
+      end
+      scale = config.scale
+      max_weighted_tweet_length = config.max_weighted_tweet_length
+      scaled_max_weighted_tweet_length = max_weighted_tweet_length * scale
+      transformed_url_length = config.transformed_url_length * scale
+      ranges = config.ranges
+      url_entities = Twitter::Extractor.extract_urls_with_indices(normalized_text)
+      has_invalid_chars = false
+      weighted_count = 0
+      offset = 0
+      display_offset = 0
+      valid_offset = 0
+      while offset < normalized_text_length
+        # Reset the default char weight each pass through the loop
+        char_weight = config.default_weight
+        url_entities.each do |url_entity|
+          if url_entity[:indices].first == offset
+            url_length = url_entity[:indices].last - url_entity[:indices].first
+            weighted_count += transformed_url_length
+            offset += url_length
+            display_offset += url_length
+            if weighted_count <= scaled_max_weighted_tweet_length
+              valid_offset += url_length
+            end
+            # Finding a match breaks the loop; order of ranges matters.
+            break
+          end
+        end
+        if offset < normalized_text_length
+          code_point = normalized_text[offset]
+          ranges.each do |range|
+            if range.contains?(code_point.unpack("U").first)
+              char_weight = range.weight
+              break
+            end
+          end
+          weighted_count += char_weight
+          has_invalid_chars = contains_invalid?(normalized_text[offset]) unless has_invalid_chars
+          char_count = code_point.char_length
+          offset += char_count
+          display_offset += char_count
+          if !has_invalid_chars && (weighted_count <= scaled_max_weighted_tweet_length)
+            valid_offset += char_count
+          end
+        end
+      end
+      normalized_text_offset = text.char_length - normalized_text.char_length
+      scaled_weighted_length = weighted_count / scale
+      is_valid = !has_invalid_chars && (scaled_weighted_length <= max_weighted_tweet_length)
+      permillage = scaled_weighted_length * 1000 / max_weighted_tweet_length
+      return ParseResults.new(weighted_length: scaled_weighted_length, permillage: permillage, valid: is_valid, display_range_start: 0, display_range_end: (display_offset + normalized_text_offset - 1), valid_range_start: 0, valid_range_end: (valid_offset + normalized_text_offset - 1))
+    end
+    def contains_invalid?(text)
+      return false if !text || text.empty?
       begin
-        return :too_long if tweet_length(text) > MAX_LENGTH
-        return :invalid_characters if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
+        return true if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
       rescue ArgumentError
         # non-Unicode value.
-        return :invalid_characters
+        return true
       end
       return false
     end
-    def valid_tweet_text?(text)
-      !tweet_invalid?(text)
-    end
     def valid_username?(username)
       return false if !username || username.empty?
@@ -102,6 +151,69 @@ module Twitter
              (!unicode_domains && valid_match?(authority, Twitter::Regex[:validate_url_authority]))
     end
+    # These methods are deprecated, will be removed in future.
+    extend Deprecation
+    MAX_LENGTH_LEGACY = 140
+    # DEPRECATED: Please use parse_text instead.
+    #
+    # Returns the length of the string as it would be displayed. This is equivilent to the length of the Unicode NFC
+    # (See: http://www.unicode.org/reports/tr15). This is needed in order to consistently calculate the length of a
+    # string no matter which actual form was transmitted. For example:
+    #
+    #     U+0065  Latin Small Letter E
+    # +   U+0301  Combining Acute Accent
+    # ----------
+    # =   2 bytes, 2 characters, displayed as é (1 visual glyph)
+    #     … The NFC of {U+0065, U+0301} is {U+00E9}, which is a single chracter and a +display_length+ of 1
+    #
+    # The string could also contain U+00E9 already, in which case the canonicalization will not change the value.
+    #
+    def tweet_length(text, options = {})
+      options = DEFAULT_TCO_URL_LENGTHS.merge(options)
+      length = text.to_nfc.unpack("U*").length
+      Twitter::Extractor.extract_urls_with_indices(text) do |url, start_position, end_position|
+        length += start_position - end_position
+        length += options[:short_url_length] if url.length > 0
+      end
+      length
+    end
+    deprecate :tweet_length, :parse_tweet
+    # DEPRECATED: Please use parse_text instead.
+    #
+    # Check the <tt>text</tt> for any reason that it may not be valid as a Tweet. This is meant as a pre-validation
+    # before posting to api.twitter.com. There are several server-side reasons for Tweets to fail but this pre-validation
+    # will allow quicker feedback.
+    #
+    # Returns <tt>false</tt> if this <tt>text</tt> is valid. Otherwise one of the following Symbols will be returned:
+    #
+    #   <tt>:too_long</tt>:: if the <tt>text</tt> is too long
+    #   <tt>:empty</tt>:: if the <tt>text</tt> is nil or empty
+    #   <tt>:invalid_characters</tt>:: if the <tt>text</tt> contains non-Unicode or any of the disallowed Unicode characters
+    def tweet_invalid?(text)
+      return :empty if !text || text.empty?
+      begin
+        return :too_long if tweet_length(text) > MAX_LENGTH_LEGACY
+        return :invalid_characters if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
+      rescue ArgumentError
+        # non-Unicode value.
+        return :invalid_characters
+      end
+      return false
+    end
+    deprecate :tweet_invalid?, :parse_tweet
+    def valid_tweet_text?(text)
+      !tweet_invalid?(text)
+    end
+    deprecate :valid_tweet_text?, :parse_tweet
     private
     def valid_match?(string, regex, optional=false)