Class Raingrams::Model
In: lib/raingrams/model.rb
Parent: Object
Model TrigramModel BigramModel QuadgramModel PentagramModel HexagramModel Model TrigramModel BigramModel QuadgramModel PentagramModel HexagramModel RuntimeError PrefixFrequencyMissing Set NgramSet Array Ngram Token StopSentence StartSentence Unknown ProbabilityTable Tokens Commonality Random Similarity Frequency Probability Helpers OpenModel OpenVocabulary Raingrams dot/f_1.png

Methods

Included Modules

Helpers::Frequency Helpers::Probability Helpers::Similarity Helpers::Commonality Helpers::Random

Attributes

ignore_case  [R]  Ignore case of parsed text
ignore_phone_numbers  [R]  Ignore Phone numbers
ignore_punctuation  [R]  Ignore the punctuation of parsed text
ignore_references  [R]  Ignore References
ignore_urls  [R]  Ignore URLs
ngram_size  [R]  Size of ngrams to use
prefixes  [R]  Probabilities of all (n-1) grams
starting_ngram  [R]  The sentence starting ngram
stoping_ngram  [R]  The sentence stopping ngram

Public Class methods

Creates a new model object with the given options. If a block is given, it will be passed the newly created model. After the block as been called the model will be built.

[Source]

# File lib/raingrams/model.rb, line 100
    def self.build(options={},&block)
      self.new(options) do |model|
        model.build(&block)
      end
    end

Creates a new NgramModel with the specified options.

options must contain the following keys:

:ngram_size:The size of each gram.

options may contain the following keys:

:ignore_case:Defaults to false.
:ignore_punctuation:Defaults to true.
:ignore_urls:Defaults to false.
:ignore_phone_numbers:Defaults to false.

[Source]

# File lib/raingrams/model.rb, line 59
    def initialize(options={},&block)
      @ngram_size = options[:ngram_size]
      @starting_ngram = Ngram.new(Tokens.start * @ngram_size)
      @stoping_ngram = Ngram.new(Tokens.stop * @ngram_size)

      @ignore_case = false
      @ignore_punctuation = true
      @ignore_urls = true
      @ignore_phone_numbers = false
      @ignore_references = false

      if options.has_key?(:ignore_case)
        @ignore_case = options[:ignore_case]
      end

      if options.has_key?(:ignore_punctuation)
        @ignore_punctuation = options[:ignore_punctuation]
      end

      if options.has_key?(:ignore_urls)
        @ignore_urls = options[:ignore_urls]
      end

      if options.has_key?(:ignore_phone_numbers)
        @ignore_phone_numbers = options[:ignore_phone_numbers]
      end

      if options.has_key?(:ignore_references)
        @ignore_references = options[:ignore_references]
      end

      @prefixes = {}

      block.call(self) if block
    end

Marshals a model from the contents of the file at the specified path.

[Source]

# File lib/raingrams/model.rb, line 150
    def self.open(path)
      model = nil

      File.open(path) do |file|
        model = Marshal.load(file)
      end

      return model
    end

Creates a new model object with the given options and trains it with the contents of the specified path.

[Source]

# File lib/raingrams/model.rb, line 130
    def self.train_with_file(path,options={})
      self.build(options) do |model|
        model.train_with_file(path)
      end
    end

Creates a new model object with the given options and trains it with the specified paragraph.

[Source]

# File lib/raingrams/model.rb, line 110
    def self.train_with_paragraph(paragraph,options={})
      self.build(options) do |model|
        model.train_with_paragraph(paragraph)
      end
    end

Creates a new model object with the given options and trains it with the specified text.

[Source]

# File lib/raingrams/model.rb, line 120
    def self.train_with_text(text,options={})
      self.build(options) do |model|
        model.train_with_text(text)
      end
    end

Creates a new model object with the given options and trains it with the inner text of the paragraphs tags at the specified url.

[Source]

# File lib/raingrams/model.rb, line 140
    def self.train_with_url(url,options={})
      self.build(options) do |model|
        model.train_with_url(url)
      end
    end

Protected Class methods

Defines the default ngram size for the model.

[Source]

# File lib/raingrams/model.rb, line 588
    def self.ngram_size(size)
      class_eval %{
        def initialize(options={},&block)
          super(options.merge(:ngram_size => #{size.to_i}),&block)
        end
      }
    end

Public Instance methods

Clears and rebuilds the model.

[Source]

# File lib/raingrams/model.rb, line 549
    def build(&block)
      refresh do
        clear

        block.call(self) if block
      end
    end

Clears the model of any training data.

[Source]

# File lib/raingrams/model.rb, line 560
    def clear
      @prefixes.clear
      return self
    end

Iterates over the ngrams that compose the model, passing each one to the given block.

[Source]

# File lib/raingrams/model.rb, line 243
    def each_ngram(&block)
      @prefixes.each do |prefix,table|
        table.each_gram do |postfix_gram|
          block.call(prefix + postfix_gram) if block
        end
      end

      return self
    end

Returns all grams within the model.

[Source]

# File lib/raingrams/model.rb, line 433
    def grams
      @prefixes.keys.inject(Set.new) do |all_grams,gram|
        all_grams + gram
      end
    end

Returns all grams which occur directly after the specified gram.

[Source]

# File lib/raingrams/model.rb, line 465
    def grams_following(gram)
      gram_set = Set.new

      ngram_starting_with(gram).each do |ngram|
        gram_set << ngram[1]
      end

      return gram_set
    end

Returns all grams which preceed the specified gram.

[Source]

# File lib/raingrams/model.rb, line 452
    def grams_preceeding(gram)
      gram_set = Set.new

      ngrams_ending_with(gram).each do |ngram|
        gram_set << ngram[-2]
      end

      return gram_set
    end

Returns true if the model contain the specified gram, returns false otherwise.

[Source]

# File lib/raingrams/model.rb, line 443
    def has_gram?(gram)
      @prefixes.keys.any? do |prefix|
        prefix.include?(gram)
      end
    end

Returns true if the model contains the specified ngram, returns false otherwise.

[Source]

# File lib/raingrams/model.rb, line 231
    def has_ngram?(ngram)
      if @prefixes.has_key?(ngram.prefix)
        return @prefixes[ngram.prefix].has_gram?(ngram.last)
      else
        return false
      end
    end

Returns the ngrams that compose the model.

[Source]

# File lib/raingrams/model.rb, line 215
    def ngrams
      ngram_set = NgramSet.new

      @prefixes.each do |prefix,table|
        table.each_gram do |postfix_gram|
          ngram_set << (prefix + postfix_gram)
        end
      end

      return ngram_set
    end

Returns the ngrams which end with the specified gram.

[Source]

# File lib/raingrams/model.rb, line 318
    def ngrams_ending_with(gram)
      ngram_set = NgramSet.new

      @prefixes.each do |prefix,table|
        if table.has_gram?(gram)
          ngram_set << (prefix + gram)
        end
      end

      return ngram_set
    end

Returns all ngrams which occur directly after the specified gram.

[Source]

# File lib/raingrams/model.rb, line 418
    def ngrams_following(gram)
      ngram_set = NgramSet.new

      ngrams_starting_with(gram).each do |starts_with|
        ngrams_prefixed_by(starts_with.postfix).each do |ngram|
          ngram_set << ngram
        end
      end

      return ngram_set
    end

Returns the ngrams extracted from the specified fragment of text.

[Source]

# File lib/raingrams/model.rb, line 378
    def ngrams_from_fragment(fragment)
      ngrams_from_words(parse_sentence(fragment))
    end
ngrams_from_paragraph(text)

Alias for ngrams_from_text

Returns the ngrams extracted from the specified sentence.

[Source]

# File lib/raingrams/model.rb, line 385
    def ngrams_from_sentence(sentence)
      ngrams_from_words(wrap_sentence(parse_sentence(sentence)))
    end

Returns the ngrams extracted from the specified text.

[Source]

# File lib/raingrams/model.rb, line 392
    def ngrams_from_text(text)
      parse_text(text).inject([]) do |ngrams,sentence|
        ngrams + ngrams_from_sentence(sentence)
      end
    end

Returns the ngrams extracted from the specified words.

[Source]

# File lib/raingrams/model.rb, line 369
    def ngrams_from_words(words)
      return (0...(words.length-@ngram_size+1)).map do |index|
        Ngram.new(words[index,@ngram_size])
      end
    end

Returns the ngrams including all of the specified grams.

[Source]

# File lib/raingrams/model.rb, line 356
    def ngrams_including_all(*grams)
      ngram_set = NgramSet.new

      each_ngram do |ngram|
        ngram_set << ngram if ngram.includes_all?(*grams)
      end

      return ngram_set
    end

Returns the ngrams including any of the specified grams.

[Source]

# File lib/raingrams/model.rb, line 333
    def ngrams_including_any(*grams)
      ngram_set = NgramSet.new

      @prefixes.each do |prefix,table|
        if prefix.includes_any?(*grams)
          table.each_gram do |postfix_gram|
            ngram_set << (prefix + postfix_gram)
          end
        else
          table.each_gram do |postfix_gram|
            if grams.include?(postfix_gram)
              ngram_set << (prefix + postfix_gram)
            end
          end
        end
      end

      return ngram_set
    end

Returns the ngrams postfixed by the specified postfix.

[Source]

# File lib/raingrams/model.rb, line 284
    def ngrams_postfixed_by(postfix)
      ngram_set = NgramSet.new

      @prefixes.each do |prefix,table|
        if prefix[1..-1] == postfix[0..-2]
          if table.has_gram?(postfix.last)
            ngram_set << (prefix + postfix.last)
          end
        end
      end

      return ngram_set
    end

Returns all ngrams which preceed the specified gram.

[Source]

# File lib/raingrams/model.rb, line 403
    def ngrams_preceeding(gram)
      ngram_set = NgramSet.new

      ngrams_ending_with(gram).each do |ends_with|
        ngrams_postfixed_by(ends_with.prefix).each do |ngram|
          ngram_set << ngram
        end
      end

      return ngram_set
    end

Returns the ngrams prefixed by the specified prefix.

[Source]

# File lib/raingrams/model.rb, line 269
    def ngrams_prefixed_by(prefix)
      ngram_set = NgramSet.new

      return ngram_set unless @prefixes.has_key?(prefix)

      ngram_set += @prefixes[prefix].grams.map do |gram|
        prefix + gram
      end

      return ngram_set
    end

Returns the ngrams starting with the specified gram.

[Source]

# File lib/raingrams/model.rb, line 301
    def ngrams_starting_with(gram)
      ngram_set = NgramSet.new

      @prefixes.each do |prefix,table|
        if prefix.first == gram
          table.each_gram do |postfix_gram|
            ngram_set << (prefix + postfix_gram)
          end
        end
      end

      return ngram_set
    end

Selects the ngrams that match the given block.

[Source]

# File lib/raingrams/model.rb, line 256
    def ngrams_with(&block)
      selected_ngrams = NgramSet.new

      each_ngram do |ngram|
        selected_ngrams << ngram if block.call(ngram)
      end

      return selected_ngrams
    end

Parses the specified sentence and returns an Array of tokens.

[Source]

# File lib/raingrams/model.rb, line 163
    def parse_sentence(sentence)
      sentence = sentence.to_s

      if @ignore_punctuation
        # eat tailing punctuation
        sentence.gsub!(/[\.\?!]*$/,'')
      end

      if @ignore_case
        # downcase the sentence
        sentence.downcase!
      end

      if @ignore_urls
        sentence.gsub!(/\s*\w+:\/\/[\w\/\+_\-,:%\d\.\-\?&=]*\s*/,' ')
      end

      if @ignore_phone_numbers
        # remove phone numbers
        sentence.gsub!(/\s*(\d-)?(\d{3}-)?\d{3}-\d{4}\s*/,' ')
      end

      if @ignore_references
        # remove RFC style references
        sentence.gsub!(/\s*[\(\{\[]\d+[\)\}\]]\s*/,' ')
      end

      if @ignore_punctuation
        # split and ignore punctuation characters
        return sentence.scan(/\w+[\-_\.:']\w+|\w+/)
      else
        # split and accept punctuation characters
        return sentence.scan(/[\w\-_,:;\.\?\!'"\\\/]+/)
      end
    end

Parses the specified text and returns an Array of sentences.

[Source]

# File lib/raingrams/model.rb, line 202
    def parse_text(text)
      text = text.to_s

      if @ignore_urls
        text.gsub!(/\s*\w+:\/\/[\w\/\+_\-,:%\d\.\-\?&=]*\s*/,' ')
      end

      return text.scan(/[^\s\.\?!][^\.\?!]*[\.\?\!]/)
    end

Refreshes the probability tables of the model.

[Source]

# File lib/raingrams/model.rb, line 539
    def refresh(&block)
      block.call(self) if block

      @prefixes.each_value { |table| table.build }
      return self
    end

Saves the model to the file at the specified path.

[Source]

# File lib/raingrams/model.rb, line 568
    def save(path)
      File.open(path,'w') do |file|
        Marshal.dump(self,file)
      end

      return self
    end

Sets the frequency of the specified ngram to the specified value.

[Source]

# File lib/raingrams/model.rb, line 478
    def set_ngram_frequency(ngram,value)
      probability_table(ngram).set_count(ngram.last,value)
    end

Returns a Hash representation of the model.

[Source]

# File lib/raingrams/model.rb, line 579
    def to_hash
      @prefixes
    end

Train the model with the contents of the specified path.

[Source]

# File lib/raingrams/model.rb, line 520
    def train_with_file(path)
      train_with_text(File.read(path))
    end

Train the model with the specified ngram.

[Source]

# File lib/raingrams/model.rb, line 485
    def train_with_ngram(ngram)
      probability_table(ngram).count(ngram.last)
    end

Train the model with the specified ngrams.

[Source]

# File lib/raingrams/model.rb, line 492
    def train_with_ngrams(ngrams)
      ngrams.each { |ngram| train_with_ngram(ngram) }
    end

Train the model with the specified paragraphs.

[Source]

# File lib/raingrams/model.rb, line 506
    def train_with_paragraph(paragraph)
      train_with_ngrams(ngrams_from_paragraph(paragraph))
    end

Train the model with the specified sentence.

[Source]

# File lib/raingrams/model.rb, line 499
    def train_with_sentence(sentence)
      train_with_ngrams(ngrams_from_sentence(sentence))
    end

Train the model with the specified text.

[Source]

# File lib/raingrams/model.rb, line 513
    def train_with_text(text)
      train_with_ngrams(ngrams_from_text(text))
    end

Train the model with the inner text of the paragraph tags at the specified url.

[Source]

# File lib/raingrams/model.rb, line 528
    def train_with_url(url)
      doc = Nokogiri::HTML(open(url))

      return doc.search('p').map do |p|
        train_with_paragraph(p.inner_text)
      end
    end

Protected Instance methods

Returns the probability table for the specified ngram.

[Source]

# File lib/raingrams/model.rb, line 607
    def probability_table(ngram)
      @prefixes[ngram.prefix] ||= ProbabilityTable.new
    end

Wraps the specified setence with StartSentence and StopSentence tokens.

[Source]

# File lib/raingrams/model.rb, line 600
    def wrap_sentence(sentence)
      @starting_ngram + sentence.to_a + @stoping_ngram
    end

[Validate]