twitter-text 1.14.7 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
- SHA1:
3
- metadata.gz: f1dd5437a51b3767c45499c3d5d4b438bd1b7ba1
4
- data.tar.gz: 0c7bcee79f7fc1e955cad42ddba8c9318c885ae5
2
+ SHA256:
3
+ metadata.gz: 92e1f709304c7902186bbe50ff5f7d215059d292a4e8730b9cdff12210dff1aa
4
+ data.tar.gz: fd50deede86bb5ba1a47ff214350f86a928ed59926438d3361475f3640ff8531
5
5
  SHA512:
6
- metadata.gz: 6a9ad3b3b822e358070f6722e4d88a362c05016584739d290beafe2e8763aaf202e9198a16855b140acefd2b87f9b1d29990254422750353a786437198f5f8c8
7
- data.tar.gz: 4d1ea6e3fd1a158bfcaaad04454723145bdb66c991a51e499015aaaa135fb9792120391032a7ee269cf1d37a71d4cddd9c03511a78026c66830c007c286372f7
6
+ metadata.gz: 85f39c5bd4d9c58b863d5e9490618ee941a528ab8fd23a463857a206d53ba50a4235cb7e287245a0e3bb66bb78955b98cf6973f1ed5e2ec5741090ef34a77c52
7
+ data.tar.gz: 6a0133f3acd0a34742435777f4fc276df4639066adc80b39d3a4b84f77ec73eb0772fae0ff52f20e3af752b089ca976de3b215d83241944c2fd0ef8f9823ba85
data/.rspec CHANGED
@@ -1,2 +1,2 @@
1
1
  --color
2
- --format=nested
2
+ --format=documentation
data/README.md CHANGED
@@ -1,16 +1,82 @@
1
1
  # twitter-text
2
2
 
3
- ![hello](https://img.shields.io/gem/v/twitter-text.svg)
3
+ ![](https://img.shields.io/gem/v/twitter-text.svg)
4
4
 
5
- A gem that provides text processing routines for Twitter Tweets. The major
6
- reason for this is to unify the various auto-linking and extraction of
7
- usernames, lists, hashtags and URLs.
5
+ This is the Ruby implementation of the twitter-text parsing
6
+ library. The library has methods to parse Tweets and calculate length,
7
+ validity, parse @mentions, #hashtags, URLs, and more.
8
8
 
9
- ## Extraction Examples
9
+ ## Setup
10
10
 
11
+ Installation uses bundler.
11
12
 
12
- # Extraction
13
13
  ```
14
+ % gem install bundler
15
+ % bundle install
16
+ ```
17
+
18
+ ## Conformance tests
19
+
20
+ To run the Conformance test suite from the command line via rake:
21
+
22
+ ```
23
+ % rake test:conformance:run
24
+ ```
25
+
26
+ You can also run the rspec tests in the `spec` directory:
27
+
28
+ ```
29
+ % rspec spec
30
+ ```
31
+
32
+ # Length validation
33
+
34
+ twitter-text 2.0 introduces configuration files that define how Tweets
35
+ are parsed for length. This allows for backwards compatibility and
36
+ flexibility going forward. Old-style traditional 140-character parsing
37
+ is defined by the v1.json configuration file, whereas v2.json is
38
+ updated for "weighted" Tweets where ranges of Unicode code points can
39
+ have independent weights aside from the default weight. The sum of all
40
+ code points, each weighted appropriately, should not exceed the max
41
+ weighted length.
42
+
43
+ Some old methods from twitter-text 1.0 have been marked deprecated,
44
+ such as the `tweet_length()` method. The new API is based on the
45
+ following method, `parse_tweet()`
46
+
47
+ ```ruby
48
+ def parse_tweet(text, options = {}) { ... }
49
+ ```
50
+
51
+ This method takes a string as input and returns a results object that
52
+ contains information about the
53
+ string. `Twitter::Validation::ParseResults` object includes:
54
+
55
+ * `:weighted_length`: the overall length of the tweet with code points
56
+ weighted per the ranges defined in the configuration file.
57
+
58
+ * `:permillage`: indicates the proportion (per thousand) of the weighted
59
+ length in comparison to the max weighted length. A value > 1000
60
+ indicates input text that is longer than the allowable maximum.
61
+
62
+ * `:valid`: indicates if input text length corresponds to a valid
63
+ result.
64
+
65
+ * `:display_range_start, :display_range_end`: An array of two unicode code point
66
+ indices identifying the inclusive start and exclusive end of the
67
+ displayable content of the Tweet. For more information, see
68
+ the description of `display_text_range` here:
69
+ [Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates)
70
+
71
+ * `:valid_range_start, :valid_range_end`: An array of two unicode code point
72
+ indices identifying the inclusive start and exclusive end of the valid
73
+ content of the Tweet. For more information on the extended Tweet
74
+ payload see [Tweet updates](https://developer.twitter.com/en/docs/tweets/tweet-updates)
75
+
76
+ ## Extraction Examples
77
+
78
+ # Extraction
79
+ ```ruby
14
80
  class MyClass
15
81
  include Twitter::Extractor
16
82
  usernames = extract_mentioned_screen_names("Mentioning @twitter and @jack")
@@ -18,9 +84,9 @@ class MyClass
18
84
  end
19
85
  ```
20
86
 
21
- # Extraction with a block argument
22
- ```ruby
87
+ ### Extraction with a block argument
23
88
 
89
+ ```ruby
24
90
  class MyClass
25
91
  include Twitter::Extractor
26
92
  extract_reply_screen_name("@twitter are you hiring?").do |username|
@@ -31,8 +97,9 @@ end
31
97
 
32
98
  ## Auto-linking Examples
33
99
 
34
- # Auto-link
35
- ```
100
+ ### Auto-link
101
+
102
+ ```ruby
36
103
  class MyClass
37
104
  include Twitter::Autolink
38
105
 
@@ -40,14 +107,14 @@ class MyClass
40
107
  end
41
108
  ```
42
109
 
43
- # For Ruby on Rails you want to add this to app/helpers/application_helper.rb
44
- ```
110
+ ### For Ruby on Rails you want to add this to app/helpers/application_helper.rb
111
+ ```ruby
45
112
  module ApplicationHelper
46
113
  include Twitter::Autolink
47
114
  end
48
115
  ```
49
116
 
50
- # Now the auto_link function is available in every view. So in index.html.erb:
117
+ ### Now the auto_link function is available in every view. So in index.html.erb:
51
118
  ```ruby
52
119
  <%= auto_link("link @user, please #request") %>
53
120
  ```
@@ -90,33 +157,37 @@ words should work equally well.
90
157
  Use to provide emphasis around the "hits" returned from the Search API, built
91
158
  to work against text that has been auto-linked already.
92
159
 
93
- ### Thanks
160
+ ## Issues
94
161
 
95
- Thanks to everybody who has filed issues, provided feedback or contributed
96
- patches. Patches courtesy of:
162
+ Have a bug? Please create an issue here on GitHub!
97
163
 
98
- * At Twitter …
99
- * Matt Sanford - http://github.com/mzsanford
100
- * Raffi Krikorian - http://github.com/r
101
- * Ben Cherry - http://github.com/bcherry
102
- * Patrick Ewing - http://github.com/hoverbird
103
- * Jeff Smick - http://github.com/sprsquish
104
- * Kenneth Kufluk - https://github.com/kennethkufluk
105
- * Keita Fujii - https://github.com/keitaf
106
- * Yoshimasa Niwa - https://github.com/niw
164
+ <https://github.com/twitter/twitter-text/issues>
107
165
 
166
+ ## Authors
108
167
 
109
- * Patches from the community …
110
- * Jean-Philippe Bougie - http://github.com/jpbougie
111
- * Erik Michaels-Ober - https://github.com/sferik
168
+ ### V2.0
112
169
 
170
+ * David LaMacchia (<https://github.com/dlamacchia>)
171
+ * Yoshimasa Niwa (<https://github.com/niw>)
172
+ * Sudheer Guntupalli (<https://github.com/sudhee>)
173
+ * Kaushik Lakshmikanth (<https://github.com/kaushlakers>)
174
+ * Jose Antonio Marquez Russo (<https://github.com/joseeight>)
175
+ * Lee Adams (<https://github.com/leeaustinadams>)
113
176
 
114
- * Anyone who has filed an issue. It helps. Really.
177
+ ### Previous authors
115
178
 
179
+ * Matt Sanford (<http://github.com/mzsanford>)
180
+ * Raffi Krikorian (<http://github.com/r>)
181
+ * Ben Cherry (<http://github.com/bcherry>)
182
+ * Patrick Ewing (<http://github.com/hoverbird>)
183
+ * Jeff Smick (<http://github.com/sprsquish>)
184
+ * Kenneth Kufluk (<https://github.com/kennethkufluk>)
185
+ * Keita Fujii (<https://github.com/keitaf>)
186
+ * Jean-Philippe Bougie (<http://github.com/jpbougie>)
187
+ * Erik Michaels-Ober (<https://github.com/sferik>)
116
188
 
117
- ### Copyright and License
189
+ ## License
118
190
 
119
- **Copyright 2011 Twitter, Inc.**
191
+ Copyright 2012-2017 Twitter, Inc and other contributors
120
192
 
121
- Licensed under the Apache License, Version 2.0:
122
- http://www.apache.org/licenses/LICENSE-2.0
193
+ Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0)
@@ -0,0 +1 @@
1
+ lib/assets/../../../conformance/tld_lib.yml
@@ -15,6 +15,8 @@ end
15
15
  autolink
16
16
  extractor
17
17
  unicode
18
+ weighted_range
19
+ configuration
18
20
  validation
19
21
  hit_highlighter
20
22
  ).each do |name|
@@ -1,4 +1,4 @@
1
- # encoding: UTF-8
1
+ # encoding: utf-8
2
2
 
3
3
  require 'set'
4
4
  require 'twitter-text/hash_helper'
@@ -21,9 +21,9 @@ module Twitter
21
21
  # Default URL base for auto-linked lists
22
22
  DEFAULT_LIST_URL_BASE = "https://twitter.com/".freeze
23
23
  # Default URL base for auto-linked hashtags
24
- DEFAULT_HASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%23".freeze
24
+ DEFAULT_HASHTAG_URL_BASE = "https://twitter.com/search?q=%23".freeze
25
25
  # Default URL base for auto-linked cashtags
26
- DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/#!/search?q=%24".freeze
26
+ DEFAULT_CASHTAG_URL_BASE = "https://twitter.com/search?q=%24".freeze
27
27
 
28
28
  # Default attributes for invisible span tag
29
29
  DEFAULT_INVISIBLE_TAG_ATTRS = "style='position:absolute;left:-9999px;'".freeze
@@ -286,7 +286,7 @@ module Twitter
286
286
  # wrap the ellipses in a tco-ellipsis class and provide an onCopy handler that sets display:none on
287
287
  # everything with the tco-ellipsis class.
288
288
  #
289
- # Exception: pic.twitter.com images, for which expandedUrl = "https://twitter.com/#!/username/status/1234/photo/1
289
+ # Exception: pic.twitter.com images, for which expandedUrl = "https://twitter.com/username/status/1234/photo/1
290
290
  # For those URLs, display_url is not a substring of expanded_url, so we don't do anything special to render the elided parts.
291
291
  # For a pic.twitter.com URL, the only elided part will be the "https://", so this is fine.
292
292
  display_url_sans_ellipses = display_url.gsub("…", "")
@@ -0,0 +1,53 @@
1
+ # encoding: UTF-8
2
+
3
+ module Twitter
4
+ class Configuration
5
+ require 'json'
6
+
7
+ PARSER_VERSION_CLASSIC = "v1"
8
+ PARSER_VERSION_DEFAULT = "v2"
9
+
10
+ class << self
11
+ attr_accessor :default_configuration
12
+ end
13
+
14
+ attr_reader :version, :max_weighted_tweet_length, :scale
15
+ attr_reader :default_weight, :transformed_url_length, :ranges
16
+
17
+ CONFIG_V1 = File.join(
18
+ File.expand_path('../../../../config', __FILE__), # project root
19
+ "#{PARSER_VERSION_CLASSIC}.json"
20
+ )
21
+
22
+ CONFIG_V2 = File.join(
23
+ File.expand_path('../../../../config', __FILE__), # project root
24
+ "#{PARSER_VERSION_DEFAULT}.json"
25
+ )
26
+
27
+ def self.parse_string(string, options = {})
28
+ JSON.parse(string, options.merge(symbolize_names: true))
29
+ end
30
+
31
+ def self.parse_file(filename)
32
+ string = File.open(filename, 'rb') { |f| f.read }
33
+ parse_string(string)
34
+ end
35
+
36
+ def self.configuration_from_file(filename)
37
+ config = parse_file(filename)
38
+ config ? Twitter::Configuration.new(config) : nil
39
+ end
40
+
41
+ def initialize(config = {})
42
+ @version = config[:version]
43
+ @max_weighted_tweet_length = config[:maxWeightedTweetLength]
44
+ @scale = config[:scale]
45
+ @default_weight = config[:defaultWeight]
46
+ @transformed_url_length = config[:transformedURLLength]
47
+ @ranges = config[:ranges].map { |range| Twitter::WeightedRange.new(range) } if config.key?(:ranges) && config[:ranges].is_a?(Array)
48
+ end
49
+
50
+ self.default_configuration = Twitter::Configuration.configuration_from_file(Twitter::Configuration::CONFIG_V2)
51
+ end
52
+ end
53
+
@@ -7,7 +7,7 @@ module Twitter
7
7
 
8
8
  alias_method(deprecated_method, method)
9
9
  define_method method do |*args, &block|
10
- warn message
10
+ warn message unless $TESTING
11
11
  send(deprecated_method, *args, &block)
12
12
  end
13
13
  end
@@ -1,4 +1,5 @@
1
- # encoding: UTF-8
1
+ # encoding: utf-8
2
+ require 'idn'
2
3
 
3
4
  class String
4
5
  # Helper function to count the character length by first converting to an
@@ -47,6 +48,15 @@ module Twitter
47
48
  # A module for including Tweet parsing in a class. This module provides function for the extraction and processing
48
49
  # of usernames, lists, URLs and hashtags.
49
50
  module Extractor extend self
51
+
52
+ # Maximum URL length as defined by Twitter's backend.
53
+ MAX_URL_LENGTH = 4096
54
+
55
+ # The maximum t.co path length that the Twitter backend supports.
56
+ MAX_TCO_SLUG_LENGTH = 40
57
+
58
+ URL_PROTOCOL_LENGTH = "https://".length
59
+
50
60
  # Remove overlapping entities.
51
61
  # This returns a new array with no overlapping entities.
52
62
  def remove_overlapping_entities(entities)
@@ -201,6 +211,7 @@ module Twitter
201
211
  next if !options[:extract_url_without_protocol] || before =~ Twitter::Regex[:invalid_url_without_protocol_preceding_chars]
202
212
  last_url = nil
203
213
  domain.scan(Twitter::Regex[:valid_ascii_domain]) do |ascii_domain|
214
+ next unless is_valid_domain(url.length, ascii_domain, protocol)
204
215
  last_url = {
205
216
  :url => ascii_domain,
206
217
  :indices => [start_position + $~.char_begin(0),
@@ -225,9 +236,13 @@ module Twitter
225
236
  else
226
237
  # In the case of t.co URLs, don't allow additional path characters
227
238
  if url =~ Twitter::Regex[:valid_tco_url]
239
+ next if $1 && $1.length > MAX_TCO_SLUG_LENGTH
228
240
  url = $&
229
241
  end_position = start_position + url.char_length
230
242
  end
243
+
244
+ next unless is_valid_domain(url.length, domain, protocol)
245
+
231
246
  urls << {
232
247
  :url => url,
233
248
  :indices => [start_position, end_position]
@@ -324,5 +339,20 @@ module Twitter
324
339
  tags.each{|tag| yield tag[:cashtag], tag[:indices].first, tag[:indices].last} if block_given?
325
340
  tags
326
341
  end
342
+
343
+ def is_valid_domain(url_length, domain, protocol)
344
+ begin
345
+ raise ArgumentError.new("invalid empty domain") unless domain
346
+ original_domain_length = domain.length
347
+ encoded_domain = IDN::Idna.toASCII(domain)
348
+ updated_domain_length = encoded_domain.length
349
+ url_length += (updated_domain_length - original_domain_length) if (updated_domain_length > original_domain_length)
350
+ url_length += URL_PROTOCOL_LENGTH unless protocol
351
+ url_length <= MAX_URL_LENGTH
352
+ rescue Exception
353
+ # On error don't consider this a valid domain.
354
+ return false
355
+ end
356
+ end
327
357
  end
328
358
  end
@@ -1,4 +1,4 @@
1
- # encoding: UTF-8
1
+ # encoding: utf-8
2
2
 
3
3
  module Twitter
4
4
  # A collection of regular expressions for parsing Tweet text. The regular expression
@@ -62,10 +62,10 @@ module Twitter
62
62
 
63
63
  major, minor, _patch = RUBY_VERSION.split('.')
64
64
  if major.to_i >= 2 || major.to_i == 1 && minor.to_i >= 9 || (defined?(RUBY_ENGINE) && ["jruby", "rbx"].include?(RUBY_ENGINE))
65
- REGEXEN[:list_name] = /[a-zA-Z][a-zA-Z0-9_\-\u0080-\u00ff]{0,24}/
65
+ REGEXEN[:list_name] = /[a-z][a-z0-9_\-\u0080-\u00ff]{0,24}/i
66
66
  else
67
67
  # This line barfs at compile time in Ruby 1.9, JRuby, or Rubinius.
68
- REGEXEN[:list_name] = eval("/[a-zA-Z][a-zA-Z0-9_\\-\x80-\xff]{0,24}/")
68
+ REGEXEN[:list_name] = eval("/[a-z][a-z0-9_\\-\x80-\xff]{0,24}/i")
69
69
  end
70
70
 
71
71
  # Latin accented characters
@@ -148,17 +148,17 @@ module Twitter
148
148
  # Used in Extractor for final filtering
149
149
  REGEXEN[:end_hashtag_match] = /\A(?:[##]|:\/\/)/o
150
150
 
151
- REGEXEN[:valid_mention_preceding_chars] = /(?:[^a-zA-Z0-9_!#\$%&*@@]|^|(?:^|[^a-zA-Z0-9_+~.-])[rR][tT]:?)/o
151
+ REGEXEN[:valid_mention_preceding_chars] = /(?:[^a-z0-9_!#\$%&*@@]|^|(?:^|[^a-z0-9_+~.-])[rR][tT]:?)/io
152
152
  REGEXEN[:at_signs] = /[@@]/
153
153
  REGEXEN[:valid_mention_or_list] = /
154
154
  (#{REGEXEN[:valid_mention_preceding_chars]}) # $1: Preceeding character
155
155
  (#{REGEXEN[:at_signs]}) # $2: At mark
156
- ([a-zA-Z0-9_]{1,20}) # $3: Screen name
157
- (\/[a-zA-Z][a-zA-Z0-9_\-]{0,24})? # $4: List (optional)
158
- /ox
159
- REGEXEN[:valid_reply] = /^(?:#{REGEXEN[:spaces]})*#{REGEXEN[:at_signs]}([a-zA-Z0-9_]{1,20})/o
156
+ ([a-z0-9_]{1,20}) # $3: Screen name
157
+ (\/[a-z][a-zA-Z0-9_\-]{0,24})? # $4: List (optional)
158
+ /iox
159
+ REGEXEN[:valid_reply] = /^(?:#{REGEXEN[:spaces]})*#{REGEXEN[:at_signs]}([a-z0-9_]{1,20})/io
160
160
  # Used in Extractor for final filtering
161
- REGEXEN[:end_mention_match] = /\A(?:#{REGEXEN[:at_signs]}|#{REGEXEN[:latin_accents]}|:\/\/)/o
161
+ REGEXEN[:end_mention_match] = /\A(?:#{REGEXEN[:at_signs]}|#{REGEXEN[:latin_accents]}|:\/\/)/io
162
162
 
163
163
  # URL related hash regex collection
164
164
  REGEXEN[:valid_url_preceding_chars] = /(?:[^A-Z0-9@@$###{INVALID_CHARACTERS.join('')}]|^)/io
@@ -196,12 +196,12 @@ module Twitter
196
196
 
197
197
  # This is used in Extractor
198
198
  REGEXEN[:valid_ascii_domain] = /
199
- (?:(?:[A-Za-z0-9\-_]|#{REGEXEN[:latin_accents]})+\.)+
199
+ (?:(?:[a-z0-9\-_]|#{REGEXEN[:latin_accents]})+\.)+
200
200
  (?:#{REGEXEN[:valid_gTLD]}|#{REGEXEN[:valid_ccTLD]}|#{REGEXEN[:valid_punycode]})
201
201
  /iox
202
202
 
203
203
  # This is used in Extractor for stricter t.co URL extraction
204
- REGEXEN[:valid_tco_url] = /^https?:\/\/t\.co\/[a-z0-9]+/i
204
+ REGEXEN[:valid_tco_url] = /^https?:\/\/t\.co\/([a-z0-9]+)/i
205
205
 
206
206
  # This is used in Extractor to filter out unwanted URLs.
207
207
  REGEXEN[:invalid_short_domain] = /\A#{REGEXEN[:valid_domain_name]}#{REGEXEN[:valid_ccTLD]}\Z/io
@@ -209,7 +209,7 @@ module Twitter
209
209
 
210
210
  REGEXEN[:valid_port_number] = /[0-9]+/
211
211
 
212
- REGEXEN[:valid_general_url_path_chars] = /[a-z\p{Cyrillic}0-9!\*';:=\+\,\.\$\/%#\[\]\-_~&\|@#{LATIN_ACCENTS}]/io
212
+ REGEXEN[:valid_general_url_path_chars] = /[a-z\p{Cyrillic}0-9!\*';:=\+\,\.\$\/%#\[\]\p{Pd}_~&\|@#{LATIN_ACCENTS}]/io
213
213
  # Allow URL paths to contain up to two nested levels of balanced parens
214
214
  # 1. Used in Wikipedia URLs like /Primer_(film)
215
215
  # 2. Used in IIS sessions like /S(dfd346)/
@@ -260,7 +260,7 @@ module Twitter
260
260
  REGEXEN[:valid_cashtag] = /(^|#{REGEXEN[:spaces]})(\$)(#{REGEXEN[:cashtag]})(?=$|\s|[#{PUNCTUATION_CHARS}])/i
261
261
 
262
262
  # These URL validation pattern strings are based on the ABNF from RFC 3986
263
- REGEXEN[:validate_url_unreserved] = /[a-z\p{Cyrillic}0-9\-._~]/i
263
+ REGEXEN[:validate_url_unreserved] = /[a-z\p{Cyrillic}0-9\p{Pd}._~]/i
264
264
  REGEXEN[:validate_url_pct_encoded] = /(?:%[0-9a-f]{2})/i
265
265
  REGEXEN[:validate_url_sub_delims] = /[!$&'()*+,;=]/i
266
266
  REGEXEN[:validate_url_pchar] = /(?:
@@ -2,65 +2,114 @@ require 'unf'
2
2
 
3
3
  module Twitter
4
4
  module Validation extend self
5
- MAX_LENGTH = 140
6
-
7
5
  DEFAULT_TCO_URL_LENGTHS = {
8
6
  :short_url_length => 23,
9
- :short_url_length_https => 23,
10
- :characters_reserved_per_media => 23
11
- }.freeze
7
+ }
12
8
 
13
- # Returns the length of the string as it would be displayed. This is equivilent to the length of the Unicode NFC
14
- # (See: http://www.unicode.org/reports/tr15). This is needed in order to consistently calculate the length of a
15
- # string no matter which actual form was transmitted. For example:
16
- #
17
- # U+0065 Latin Small Letter E
18
- # + U+0301 Combining Acute Accent
19
- # ----------
20
- # = 2 bytes, 2 characters, displayed as é (1 visual glyph)
21
- # … The NFC of {U+0065, U+0301} is {U+00E9}, which is a single chracter and a +display_length+ of 1
22
- #
23
- # The string could also contain U+00E9 already, in which case the canonicalization will not change the value.
24
- #
25
- def tweet_length(text, options = {})
26
- options = DEFAULT_TCO_URL_LENGTHS.merge(options)
9
+ # :weighted_length the weighted length of tweet based on weights specified in the config
10
+ # :valid If tweet is valid
11
+ # :permillage permillage of the tweet over the max length specified in config
12
+ # :valid_range_start beginning of valid text
13
+ # :valid_range_end End index of valid part of the tweet text (inclusive)
14
+ # :display_range_start beginning index of display text
15
+ # :display_range_end end index of display text (inclusive)
16
+ class ParseResults < Hash
27
17
 
28
- length = text.to_nfc.unpack("U*").length
18
+ RESULT_PARAMS = [:weighted_length, :valid, :permillage, :valid_range_start, :valid_range_end, :display_range_start, :display_range_end]
29
19
 
30
- Twitter::Extractor.extract_urls_with_indices(text) do |url, start_position, end_position|
31
- length += start_position - end_position
32
- length += url.downcase =~ /^https:\/\// ? options[:short_url_length_https] : options[:short_url_length]
20
+ def self.empty
21
+ return ParseResults.new(weighted_length: 0, permillage: 0, valid: true, display_range_start: 0, display_range_end: 0, valid_range_start: 0, valid_range_end: 0)
33
22
  end
34
23
 
35
- length
24
+ def initialize(params = {})
25
+ RESULT_PARAMS.each do |key|
26
+ super[key] = params[key] if params.key?(key)
27
+ end
28
+ end
36
29
  end
37
30
 
38
- # Check the <tt>text</tt> for any reason that it may not be valid as a Tweet. This is meant as a pre-validation
39
- # before posting to api.twitter.com. There are several server-side reasons for Tweets to fail but this pre-validation
40
- # will allow quicker feedback.
41
- #
42
- # Returns <tt>false</tt> if this <tt>text</tt> is valid. Otherwise one of the following Symbols will be returned:
43
- #
44
- # <tt>:too_long</tt>:: if the <tt>text</tt> is too long
45
- # <tt>:empty</tt>:: if the <tt>text</tt> is nil or empty
46
- # <tt>:invalid_characters</tt>:: if the <tt>text</tt> contains non-Unicode or any of the disallowed Unicode characters
47
- def tweet_invalid?(text)
48
- return :empty if !text || text.empty?
31
+ # Parse input text and return hash with descriptive parameters populated.
32
+ def parse_tweet(text, options = {})
33
+ options = DEFAULT_TCO_URL_LENGTHS.merge(options)
34
+ config = options[:config] || Twitter::Configuration.default_configuration
35
+ normalized_text = text.to_nfc
36
+ normalized_text_length = normalized_text.char_length
37
+ unless (normalized_text_length > 0)
38
+ ParseResults.empty()
39
+ end
40
+
41
+ scale = config.scale
42
+ max_weighted_tweet_length = config.max_weighted_tweet_length
43
+ scaled_max_weighted_tweet_length = max_weighted_tweet_length * scale
44
+ transformed_url_length = config.transformed_url_length * scale
45
+ ranges = config.ranges
46
+
47
+ url_entities = Twitter::Extractor.extract_urls_with_indices(normalized_text)
48
+
49
+ has_invalid_chars = false
50
+ weighted_count = 0
51
+ offset = 0
52
+ display_offset = 0
53
+ valid_offset = 0
54
+
55
+ while offset < normalized_text_length
56
+ # Reset the default char weight each pass through the loop
57
+ char_weight = config.default_weight
58
+ url_entities.each do |url_entity|
59
+ if url_entity[:indices].first == offset
60
+ url_length = url_entity[:indices].last - url_entity[:indices].first
61
+ weighted_count += transformed_url_length
62
+ offset += url_length
63
+ display_offset += url_length
64
+ if weighted_count <= scaled_max_weighted_tweet_length
65
+ valid_offset += url_length
66
+ end
67
+ # Finding a match breaks the loop; order of ranges matters.
68
+ break
69
+ end
70
+ end
71
+
72
+ if offset < normalized_text_length
73
+ code_point = normalized_text[offset]
74
+
75
+ ranges.each do |range|
76
+ if range.contains?(code_point.unpack("U").first)
77
+ char_weight = range.weight
78
+ break
79
+ end
80
+ end
81
+
82
+ weighted_count += char_weight
83
+
84
+ has_invalid_chars = contains_invalid?(normalized_text[offset]) unless has_invalid_chars
85
+ char_count = code_point.char_length
86
+ offset += char_count
87
+ display_offset += char_count
88
+
89
+ if !has_invalid_chars && (weighted_count <= scaled_max_weighted_tweet_length)
90
+ valid_offset += char_count
91
+ end
92
+ end
93
+ end
94
+ normalized_text_offset = text.char_length - normalized_text.char_length
95
+ scaled_weighted_length = weighted_count / scale
96
+ is_valid = !has_invalid_chars && (scaled_weighted_length <= max_weighted_tweet_length)
97
+ permillage = scaled_weighted_length * 1000 / max_weighted_tweet_length
98
+
99
+ return ParseResults.new(weighted_length: scaled_weighted_length, permillage: permillage, valid: is_valid, display_range_start: 0, display_range_end: (display_offset + normalized_text_offset - 1), valid_range_start: 0, valid_range_end: (valid_offset + normalized_text_offset - 1))
100
+ end
101
+
102
+ def contains_invalid?(text)
103
+ return false if !text || text.empty?
49
104
  begin
50
- return :too_long if tweet_length(text) > MAX_LENGTH
51
- return :invalid_characters if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
105
+ return true if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
52
106
  rescue ArgumentError
53
107
  # non-Unicode value.
54
- return :invalid_characters
108
+ return true
55
109
  end
56
-
57
110
  return false
58
111
  end
59
112
 
60
- def valid_tweet_text?(text)
61
- !tweet_invalid?(text)
62
- end
63
-
64
113
  def valid_username?(username)
65
114
  return false if !username || username.empty?
66
115
 
@@ -102,6 +151,69 @@ module Twitter
102
151
  (!unicode_domains && valid_match?(authority, Twitter::Regex[:validate_url_authority]))
103
152
  end
104
153
 
154
+ # These methods are deprecated, will be removed in future.
155
+ extend Deprecation
156
+
157
+ MAX_LENGTH_LEGACY = 140
158
+
159
+ # DEPRECATED: Please use parse_text instead.
160
+ #
161
+ # Returns the length of the string as it would be displayed. This is equivilent to the length of the Unicode NFC
162
+ # (See: http://www.unicode.org/reports/tr15). This is needed in order to consistently calculate the length of a
163
+ # string no matter which actual form was transmitted. For example:
164
+ #
165
+ # U+0065 Latin Small Letter E
166
+ # + U+0301 Combining Acute Accent
167
+ # ----------
168
+ # = 2 bytes, 2 characters, displayed as é (1 visual glyph)
169
+ # … The NFC of {U+0065, U+0301} is {U+00E9}, which is a single chracter and a +display_length+ of 1
170
+ #
171
+ # The string could also contain U+00E9 already, in which case the canonicalization will not change the value.
172
+ #
173
+ def tweet_length(text, options = {})
174
+ options = DEFAULT_TCO_URL_LENGTHS.merge(options)
175
+
176
+ length = text.to_nfc.unpack("U*").length
177
+
178
+ Twitter::Extractor.extract_urls_with_indices(text) do |url, start_position, end_position|
179
+ length += start_position - end_position
180
+ length += options[:short_url_length] if url.length > 0
181
+ end
182
+
183
+ length
184
+ end
185
+ deprecate :tweet_length, :parse_tweet
186
+
187
+ # DEPRECATED: Please use parse_text instead.
188
+ #
189
+ # Check the <tt>text</tt> for any reason that it may not be valid as a Tweet. This is meant as a pre-validation
190
+ # before posting to api.twitter.com. There are several server-side reasons for Tweets to fail but this pre-validation
191
+ # will allow quicker feedback.
192
+ #
193
+ # Returns <tt>false</tt> if this <tt>text</tt> is valid. Otherwise one of the following Symbols will be returned:
194
+ #
195
+ # <tt>:too_long</tt>:: if the <tt>text</tt> is too long
196
+ # <tt>:empty</tt>:: if the <tt>text</tt> is nil or empty
197
+ # <tt>:invalid_characters</tt>:: if the <tt>text</tt> contains non-Unicode or any of the disallowed Unicode characters
198
+ def tweet_invalid?(text)
199
+ return :empty if !text || text.empty?
200
+ begin
201
+ return :too_long if tweet_length(text) > MAX_LENGTH_LEGACY
202
+ return :invalid_characters if Twitter::Regex::INVALID_CHARACTERS.any?{|invalid_char| text.include?(invalid_char) }
203
+ rescue ArgumentError
204
+ # non-Unicode value.
205
+ return :invalid_characters
206
+ end
207
+
208
+ return false
209
+ end
210
+ deprecate :tweet_invalid?, :parse_tweet
211
+
212
+ def valid_tweet_text?(text)
213
+ !tweet_invalid?(text)
214
+ end
215
+ deprecate :valid_tweet_text?, :parse_tweet
216
+
105
217
  private
106
218
 
107
219
  def valid_match?(string, regex, optional=false)