RSS

Defining text field types in schema.xml

07 Oct

By: 

Overview

Solr’s world view consists of documents, where each document consists of searchable fields. The rules for searching each field are defined using field type definitions. A field type definition describes the analyzers, tokenizers and filters which control searching behaviour for all fields of that type.

 

When a document is added/updated, its fields are analyzed and tokenized, and those tokens are stored in solr’s index. When a query is sent, the query is again analyzed, tokenized and then matched against tokens in the index. This critical function of tokenization is handled by Tokenizer components.

 

In addition to tokenizers, there are TokenFiltercomponents, whose job is to modify the token stream.

There are also CharFilter components, whose job is to modify individual characters. For example, HTML text can be filtered to modify HTML entities like & to regular &.

 

Defining text field types in schema.xml

Here’s a typical text field type definition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What this type definition specifies is:

  • When indexing a field of this type, use an analyzer composed of
    • a WhitespaceTokenizerFactory object
    • a StopFilterFactory
    • a WordDelimiterFilterFactory
    • a LowerCaseFilterFactory
  • When querying a field of this type, use an analyzer composed of
    • a WhitespaceTokenizerFactory object
    • a SynonymFilterFactory
    • a StopFilterFactory
    • a WordDelimiterFilterFactory
    • a LowerCaseFilterFactory

If there is only one analyzer element, then the same analyzer is used for both indexing and querying.

    It’s important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.

 

Under the hood

Solr builds a TokenizerChain instance for each of these analyzers. A TokenizerChain is composed of 1 TokenizerFactory instance, 0-n TokenFilterFactory instances, and 0-n CharFilterFactoryinstances. These factory instances are responsible for creating their respective objects from the Lucene framework. For example, a TokenizerFactory creates a Lucene Tokenizer; its concrete implementation WhitespaceTokenizerFactory creates a Lucene WhitespaceTokenizer. image

The class design diagram shows how a TokenizerChain works:

  • Raw input is provided by a Reader instance
  • CharReader (is-a CharStream) wraps the raw Reader
  • Each CharFilterFactory creates a character filter that modifies input CharStream and outputs a CharStream. So CharFilterFactories can be chained.
  • TokenizerFactory creates a Tokenizer from the CharStream.
  • Tokenizer is-a TokenStream, and can be passed to TokenFilterFactories.
  • Each TokenFilterFactory modifies the token stream and outputs another TokenStream. So these can be chained.

Commonly used CharFilterFactories

solr.MappingCharFilterFactory Maps a set of characters to another set of characters.The mapping file is specified by mappingattribute, and should be present under /solr/conf.
Example: <charFilter class=”solr.MappingCharFilterFactory” mapping=”mapping-ISOLatin1Accent.txt”/>
The mapping file should have this format:
# Ä => A “u00C4″ => “A”
# Å => A “u00C5″ => “A”
solr.HTMLStripCharFilterFactory Strips HTML/XML from input stream.The input need not be an HTML document as only constructs that look like HTML will be removed.
Removes HTML/XML tags while keeping the content
Attributes within tags are also removed, and attribute quoting is optional.
Removes XML processing instructions: <?foo bar?>
Removes XML comments
Removes XML elements starting with <! and ending with >
Removes contents of <script> and <style> elements.
Handles XML comments inside these elements (normal comment processing won’t always work)
Replaces numeric character entities references like A or The terminating ‘;’ is optional if the entity reference is followed by whitespace.
Replaces all named character entity references.
terminating ‘;’ is mandatory to avoid false matches on something like “Alpha&Omega Corp” Examples: <charFilter class=”solr.HTMLStripCharFilterFactory”/>
The text
my <a href=”www.foo.bar”>link</a>
becomes
my link

 

Commonly used TokenizerFactories

solr.WhitespaceTokenizerFactory A tokenizer that divides text at whitespaces, as defined byjava.lang.Character.isWhiteSpace().Adjacent sequences of non-whitespace characters form tokens.

Example: HELLOtttWORLD.txt   is tokenized into 2 tokensHELLO WORLD.txt

solr.KeywordTokenizerFactory Treats the entire field as one token, regardless of its content.This is a lot like the “string” field type, in that no tokenization happens at all.Use it if a text field requires no tokenization, but does require char filters, or token filtering like LowerCaseFilter and TrimFilter. Example:http://example.com/I-am+example?Text=-Hello is retained as http://example.com/I-am+example?Text=-Hello
solr.StandardTokenizerFactory A good general purpose tokenizer.

  • Splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.
  • Not suitable for file names because the .extension is treated as part of token.
  • Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Example: This sentencet can’t be “tokenized_Correctly” by http://www.google.com or IBM  or NATO 10.1.9.5 test@email.org product-number 123-456949 file.txt is tokenized as Thissentence can’t be tokenized Correctly by http://www.google.com or IBM or NATO 10.1.9.5 test@email.org product number 123-456949 file.txt

solr.PatternTokenizerFactory Uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: “pattern” and “group”.

  • “pattern” is the regular expression.
  • “group” says which group to extract into tokens.

group=-1 (the default) is equivalent to “split”. In this case, the tokens will be equivalent to the output from (without empty tokens): String.split(java.lang.String) Using group >= 0 selects the matching group as the token. For example, if you have:

  pattern = '([^']+)'
  group = 0
  input = aaa 'bbb' 'ccc'

the output will be two tokens: ‘bbb’ and ‘ccc’ (including the ‘ marks). With the same input but using group=1, the output would be: bbb and ccc (no ‘ marks)

solr.NGramTokenizerFactory Not clear when and where to use, but the idea is that input is split into 1-sized, then 2-sized, then 3-sized, etc tokens.Perhaps useful for partial matching…It takes “minGram” and “maxGram” arguments, but again, not clear how to set them. Example: email becomes e m a i l em ma ai il

 

Commonly used TokenFilterFactories

solr.WordDelimiterFilterFactory Splits words into subwords and performs optional transformations on subword groups.One use for WordDelimiterFilter is to help match words withdifferent delimiters. One way of doing so is to specifygenerateWordParts="1" catenateWords="1" in the analyzer used for indexing, andgenerateWordParts="1" in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as WhitespaceTokenizer). By default, words are split into subwords with the following rules:

  • split on intra-word delimiters (all non alpha-numeric characters).
    • "Wi-Fi" -> "Wi", "Fi"
  • split on case transitions (can be turned off – see splitOnCaseChange parameter)
    • "PowerShot" -> "Power", "Shot"
  • split on letter-number transitions (can be turned off – see splitOnNumerics parameter)
    • "SD500" -> "SD", "500"
  • leading and trailing intra-word delimiters on each subword are ignored
    • "//hello---there, 'dude'" -> "hello", "there", "dude"
  • trailing “‘s” are removed for each subword (can be turned off – see stemEnglishPossessive parameter)
    • "O'Neil's" -> "O", "Neil"
      • Note: this step isn’t performed in a separate filter because of possible subword combinations.

Splitting is affected by the following parameters:

  • splitOnCaseChange=”1″causes lowercase => uppercase transitions to generate a new part [Solr 1.3]:
    • "PowerShot" => "Power" "Shot"
    • "TransAM" => "Trans" "AM"
    • default is true (“1″); set to 0 to turn off
  • splitOnNumerics=”1″causes alphabet => number transitions to generate a new part [Solr 1.3]:
    • "j2se" => "j" "2" "se"
    • default is true (“1″); set to 0 to turn off
  • stemEnglishPossessive=”1″causes trailing “‘s” to be removed for each subword.
    • "Doug's" => "Doug"
    • default is true (“1″); set to 0 to turn off

There are also a number of parameters that affect what tokens are present in the final output and if subwords are combined:

  • generateWordParts=”1″causes parts of words to be generated:
    • "PowerShot" => "Power" "Shot" (ifsplitOnCaseChange=1)
    • "Power-Shot" => "Power" "Shot"
    • default is 0
  • generateNumberParts=”1″causes number subwords to be generated:
    • "500-42" => "500" "42"
    • default is 0
  • catenateWords=”1″causes maximum runs of word parts to be catenated:
    • "wi-fi" => "wifi"
    • default is 0
  • catenateNumbers=”1″causes maximum runs of number parts to be catenated:
    • "500-42" => "50042"
    • default is 0
  • catenateAll=”1″causes all subword parts to be catenated:
    • "wi-fi-4000" => "wifi4000"
    • default is 0
  • preserveOriginal=”1″causes the original token to be indexed without modifications (in addition to the tokens produced due to other options)
    • default is 0

Example of generateWordParts=”1″ and catenateWords=”1″:

  • "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"(where 0,1,1 are token positions)
  • "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"
  • "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super",
  • 1:"Duper",
  • 2:"XL",
  • 2:"SuperDuperXL",
  • 3:"500"
  • 4:"42",
  • 5:"Auto",
  • 6:"Coder",
  • 6:"AutoCoder"
solr.SynonymFilterFactory Matches strings of tokens and replaces them with other strings of tokens.

  • The synonyms parameter names an external file defining the synonyms.
  • If ignoreCase is true, matching will lowercase before checking equality.
  • If expand is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.

Synonym file format: i-pod, i pod => ipod, sea biscuit, sea biscit => seabiscuit

solr.StopFilterFactory Discards common words.<filter class=”solr.StopFilterFactory” words=”stopwords.txt” ignoreCase=”true”/> Stop words file should be in /solr/conf. Format:#Standard english stop words a an
solr.SnowballPorterFilterFactory Uses the Tartarus snowball stemmer framework for different languages. Set the “language” attribute.Not clear how this is different from PorterStemFilterFactory.Example: running gives run
solr.HyphenatedWordsFilterFactory Combines words split by hyphens. Use only at indexing time.
solr.KeepWordFilterFactory Retains only words specified in the “words” file.
solr.LengthFilterFactory Retains only tokens whose length falls between “min” and “max”
solr.LowerCaseFilterFactory Changes all text to lower case.
solr.PorterStemFilterFactory Transforms token stream according to the Porter stemming algorithm. The input token stream should already be lowercase (pass through a LowerCaseFilter).Example:running is tokenized to run
solr.ReversedWildcardFilterFactory
solr.ReverseStringFilterFactory
Useful if wildcard queries like “Apache*” should be supported.Factory for ReversedWildcardFilter-s. When this factory is added to an analysis chain, it will be used both for filtering the tokens during indexing, and to determine the query processing of this field during search. This class supports the following init arguments:

  • withOriginal – if true, then produce both original and reversed tokens at the same positions. If false, then produce only reversed tokens.
  • maxPosAsterisk – maximum position (1-based) of the asterisk wildcard (‘*’) that triggers the reversal of query term. Asterisk that occurs at positions higher than this value will not cause the reversal of query term. Defaults to 2, meaning that asterisks on positions 1 and 2 will cause a reversal.
  • maxPosQuestion – maximum position (1-based) of the question mark wildcard (‘?’) that triggers the reversal of query term. Defaults to 1. Set this to 0, and maxPosAsteriskto 1 to reverse only pure suffix queries (i.e. ones with a single leading asterisk).
  • maxFractionAsterisk – additional parameter that triggers the reversal if asterisk (‘*’) position is less than this fraction of the query token length. Defaults to 0.0f (disabled).
  • minTrailing – minimum number of trailing characters in query token after the last wildcard character. For good performance this should be set to a value larger than 1. Defaults to 2.

Note 1: This filter always reverses input tokens during indexing. Note 2: Query tokens without wildcard characters will never be reversed.

 

Predefined text field types (in v1.4.x schema)

The default deployment contains a set of predefined text field types. The following table gives their tokenization details and examples.

text Indexing behaviour:

1
2
3
4
5
6
7
8
9
10
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a ‘gap’ for more accurate phrase queries. -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

– Tokenizes at whitespaces

– Stop words are removed

– Words delimiters are used to generate word tokens.

generateWordParts=1 => wi-fi will generate wi and fi

generateNumberParts = 1 => 3.5 will generate 3 and 5

catenateWords=1 => wi-fi will generate wi, fi and wifi

catenateNumbers = 1 => 3.5 will generate 3,5 and 35

catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35.

catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35.

splitOnCaseChange=1 => camelCase will generate camel and case.

– All text is changed to lower case.

– The Snowball porter stemmer will convert running to “run”

Querying behaviour:

1
2
3
4
5
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

In querying, only the synonym filter is additional. So something like TV which is in the synonym group “Television, Televisions, TV, TVs” results in this query token stream: televis televis tv tvs (“televis” is because “television” has been stemmed by Snowball Porter).

textgen Very similar to “text” but without stemming.
Indexing behaviour:

1
2
3
4
5
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a ‘gap’ for more accurate phrase queries. -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>

– Tokenizes at whitespaces

– Stop words are removed

– Words delimiters are used to generate word tokens.

generateWordParts=1 => wi-fi will generate wi and fi

generateNumberParts = 1 => 3.5 will generate 3 and 5

catenateWords=1 => wi-fi will generate wi, fi and wifi

catenateNumbers = 1 => 3.5 will generate 3,5 and 35

catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35.

catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35.

splitOnCaseChange=1 => camelCase will generate camel and case.

– All text is changed to lower case.

Note that there is no stemmer, which is what makes this different from “text” type.

Querying behaviour:

1
2
3
4
5
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>

In querying, only the synonym filter is additional. So something like TV which is in the synonym group “Television, Televisions, TV, TVs” results in this query token stream: television televisions tv tvs

For file paths and filenames, “textgen” seems to give the most appropriate results.

textTight Very similar again to “text”, but differs in:- WordDelimiterFilter has generateWordParts=”0″ and generateNumberParts=”0″.So “wi-fi” will give just “wifi” . ”HELLO_WORLD” will give just “helloworld” ”d:filepathfilename.ext” will give just “dfilepathfilenameext”
text_ws Just simple whitespace tokenization.
text_rev Similar to “textgen”, this is for general unstemmed text field that indexes tokens normally and also reversed (via ReversedWildcardFilterFactory), to enable more efficient leading wildcard queries.

Copy from: http://www.pathbreak.com/blog/solr-text-field-types-analyzers-tokenizers-filters-explained

Advertisements
 
Leave a comment

Posted by on October 7, 2015 in Solr

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: