RSS

Category Archives: Solr

How to install and configure Solr 6 on Ubuntu 16.04

By: www.howtoforge.com

What is Apache Solr? Apache Solr is an open source enterprise-class search platform written in Java which enables you to create custom search engines that index databases, files, and websites. It has back end support for Apache Lucene. It can e.g. be used to search in multiple websites and can show recommendations for the searched content. Solr uses an XML (Extensible Markup Language) based query and result language. There are APIs (Applications program interfaces) available for Python, Ruby and JSON (Javascript Object Notation).

Some other features that Solr provides are:

  • Full-Text Search.
  • Snippet generation and highlighting.
  • Custom Document ordering/ranking.
  • Spell Suggestions.

This tutorial will show you how to install the latest Solr version on Ubuntu 16.04 LTS. The steps will most likely work with later Ubuntu versions as well.

Update your System

Use a non-root sudo user to login into your Ubuntu server. Through this user, you will have to perform all the steps and use the Solr later.

To update your system, execute the following command to update your system with latest patches and updates.

sudo apt-get update && apt-get upgrade -y

Install Ubuntu System updates.

Setting up the Java Runtime Environment

Solr is a Java application, so the Java runtime environment needs to be installed first in order to set up Solr.

We have to install Python Software properties in order to install the latest Java 8. Run the following command to install the software.

root@server1:~# sudo apt-get install python-software-properties
Reading package lists… Done
Building dependency tree
Reading state information… Done
The following additional packages will be installed:
libpython-stdlib libpython2.7-minimal libpython2.7-stdlib python python-apt
python-minimal python-pycurl python2.7 python2.7-minimal
Suggested packages:
python-doc python-tk python-apt-dbg python-apt-doc libcurl4-gnutls-dev
python-pycurl-dbg python-pycurl-doc python2.7-doc binutils binfmt-support
The following NEW packages will be installed:
libpython-stdlib libpython2.7-minimal libpython2.7-stdlib python python-apt
python-minimal python-pycurl python-software-properties python2.7
python2.7-minimal
0 upgraded, 10 newly installed, 0 to remove and 3 not upgraded.
Need to get 4,070 kB of archives.
After this operation, 17.3 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Press Y to continue.

Install Python.

After executing the command, add the webupd8team Java PPA repository in your system by running:

sudo add-apt-repository ppa:webupd8team/java

Press [ENTER] when requested. Now, you can easily install the latest version of Java 8 with apt.

First, update the package lists to fetch the available packages from the new PPA:

sudo apt-get update

Update Ubuntu 16.04

Then install the latest version of Oracle Java 8 with this command:

sudo apt-get install oracle-java8-installer

JDK

sudo apt-get install openjdk-8-jdk

JRE

sudo apt-get install openjdk-8-jre

root@server1:~# sudo apt-get install oracle-java8-installer
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
 binutils gsfonts gsfonts-x11 java-common libfontenc1 libxfont1 x11-common xfonts-encodings xfonts-utils
Suggested packages:
 binutils-doc binfmt-support visualvm ttf-baekmuk | ttf-unfonts | ttf-unfonts-core ttf-kochi-gothic | ttf-sazanami-gothic ttf-kochi-mincho | ttf-sazanami-mincho ttf-arphic-uming firefox
 | firefox-2 | iceweasel | mozilla-firefox | iceape-browser | mozilla-browser | epiphany-gecko | epiphany-webkit | epiphany-browser | galeon | midbrowser | moblin-web-browser | xulrunner
 | xulrunner-1.9 | konqueror | chromium-browser | midori | google-chrome
The following NEW packages will be installed:
 binutils gsfonts gsfonts-x11 java-common libfontenc1 libxfont1 oracle-java8-installer x11-common xfonts-encodings xfonts-utils
0 upgraded, 10 newly installed, 0 to remove and 3 not upgraded.
Need to get 6,498 kB of archives.
After this operation, 20.5 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Press Y to continue.

You MUST agree to the license available in http://java.com/license if you want to use Oracle JDK, clicking on the OK button.

Accept Java License

Downloading Java

The package installs a kind of meta-installer which then downloads the binaries directly from Oracle. After installation process, check the version of Java installed by running the following command

java -version

java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

Now you have installed Java 8 and we will move to the next step.

Installing the Solr application

Solr can be installed on Ubuntu in different ways, in this article, I will show you how to install the latest package from the source.

We will begin by downloading the Solr distribution. First finding the latest version of the available package from their web page, copy the link and download it using the wget command

For this setup, we will use  http://www.us.apache.org/dist/lucene/solr/6.0.1/

cd /tmp
wget http://www.us.apache.org/dist/lucene/solr/6.0.1/solr-6.0.1.tgz

root@server1:/tmp# wget http://www.us.apache.org/dist/lucene/solr/6.0.1/solr-6.0.1.tgz
--2016-06-03 11:31:54-- http://www.us.apache.org/dist/lucene/solr/6.0.1/solr-6.0.1.tgz
Resolving www.us.apache.org (www.us.apache.org)... 140.211.11.105
Connecting to www.us.apache.org (www.us.apache.org)|140.211.11.105|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 137924507 (132M) [application/x-gzip]
Saving to: ‘solr-6.0.1.tgz’

Now, run the given below command to extract the service installation file:

tar xzf solr-6.0.1.tgz solr-6.0.1/bin/install_solr_service.sh –strip-components=2

And install Solr as a service using the script:

sudo ./install_solr_service.sh solr-6.0.1.tgz

The output will be similar to this:

 root@server1:/tmp# sudo ./install_solr_service.sh solr-6.0.1.tgz
id: ‘solr’: no such user
Creating new user: solr
Adding system user `solr' (UID 111) ...
Adding new group `solr' (GID 117) ...
Adding new user `solr' (UID 111) with group `solr' ...
Creating home directory `/var/solr' ...

Extracting solr-6.0.1.tgz to /opt


Installing symlink /opt/solr -> /opt/solr-6.0.1 ...


Installing /etc/init.d/solr script ...


Installing /etc/default/solr.in.sh ...

? solr.service - LSB: Controls Apache Solr as a Service
 Loaded: loaded (/etc/init.d/solr; bad; vendor preset: enabled)
 Active: active (exited) since Fri 2016-06-03 11:37:05 CEST; 5s ago
 Docs: man:systemd-sysv-generator(8)
 Process: 20929 ExecStart=/etc/init.d/solr start (code=exited, status=0/SUCCESS)

Jun 03 11:36:43 server1 systemd[1]: Starting LSB: Controls Apache Solr as a Service...
Jun 03 11:36:44 server1 su[20934]: Successful su for solr by root
Jun 03 11:36:44 server1 su[20934]: + ??? root:solr
Jun 03 11:36:44 server1 su[20934]: pam_unix(su:session): session opened for user solr by (uid=0)
Jun 03 11:37:05 server1 solr[20929]: [313B blob data]
Jun 03 11:37:05 server1 solr[20929]: Started Solr server on port 8983 (pid=20989). Happy searching!
Jun 03 11:37:05 server1 solr[20929]: [14B blob data]
Jun 03 11:37:05 server1 systemd[1]: Started LSB: Controls Apache Solr as a Service.
Service solr installed.

Use this command to check the status of the service

service solr status

You should see an output that begins with this:

root@server1:/tmp# service solr status
? solr.service - LSB: Controls Apache Solr as a Service
 Loaded: loaded (/etc/init.d/solr; bad; vendor preset: enabled)
 Active: active (exited) since Fri 2016-06-03 11:37:05 CEST; 39s ago
 Docs: man:systemd-sysv-generator(8)
 Process: 20929 ExecStart=/etc/init.d/solr start (code=exited, status=0/SUCCESS)

Jun 03 11:36:43 server1 systemd[1]: Starting LSB: Controls Apache Solr as a Service...
Jun 03 11:36:44 server1 su[20934]: Successful su for solr by root
Jun 03 11:36:44 server1 su[20934]: + ??? root:solr
Jun 03 11:36:44 server1 su[20934]: pam_unix(su:session): session opened for user solr by (uid=0)
Jun 03 11:37:05 server1 solr[20929]: [313B blob data]
Jun 03 11:37:05 server1 solr[20929]: Started Solr server on port 8983 (pid=20989). Happy searching!
Jun 03 11:37:05 server1 solr[20929]: [14B blob data]
Jun 03 11:37:05 server1 systemd[1]: Started LSB: Controls Apache Solr as a Service.

Creating a Solr search collection:

Using Solr, we can create multiple collections. Run the given command, mention the name of the collection (here gettingstarted) and specify its configurations.

sudo su – solr -c “/opt/solr/bin/solr create -c gettingstarted -n data_driven_schema_configs”

root@server1:/tmp# sudo su - solr -c "/opt/solr/bin/solr create -c gettingstarted -n data_driven_schema_configs"

Copying configuration to new core instance directory:
/var/solr/data/gettingstarted

Creating new core 'gettingstarted' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=gettingstarted&instanceDir=gettingstarted

{
 "responseHeader":{
 "status":0,
 "QTime":4427},
 "core":"gettingstarted"}

The new core directory for our first collection has been created. To view the default schema file, got to:

/opt/solr/server/solr/configsets/data_driven_schema_configs/conf

 

Use the Solr Web Interface

The Apache Solr is now accessible on the default port, which is 8983. The admin UI should be accessible at http://your_server_ip:8983/solr. The port should be allowed by your firewall to run the links.

For example:

http://192.168.1.100:8983/solr/

The Solr web interface.

To see the details of the first collection that we created earlier, select the “gettingstarted” collection in the left menu.

Details of our data collection.

After you selected the “gettingstarted” collection, select Documents in the left menu. There you can enter real data in JSON format that will be searchable by Solr. To add more data, copy and paste the following example JSON onto Document field:

{
 "id": 1,
 "book_title": "My First Book",
 "published": 1985,
 "description": "All about Linux"
}

Click on the submit document button after adding the data.

Submit a document to Solr.

Status: success
Response:

{
 "responseHeader": {
 "status": 0,
 "QTime": 189
 }
}

Now we can click on Query on the left side then click on Execute Query,

Execute a query in Solr.

We will see something like this:

{
  "responseHeader":{
    "status":0,
    "QTime":24,
    "params":{
      "q":"*:*",
      "indent":"on",
      "wt":"json",
      "_":"1464947017056"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"1",
        "book_title":["My First Book"],
        "published":[1985],
        "description":["All about Linux"],
        "_version_":1536108205792296960}]
  }}

Virtual machine image download of this tutorial

This tutorial is available as ready to use virtual machine image in ovf/ova format for Howtoforge Subscribers. The VM format is compatible with VMWare and Virtualbox. The virtual machine image uses the following login details:

SSH / Shell Login

Username: administrator
Password: howtoforge

This user has sudo rights.

Please change all the above passwords to secure the virtual machine.

Conclusion

After successfully installing the Solr Web Interface on Ubuntu, you can now insert the data or query the data with the Solr API and Web Interface.

 

Copy from: https://www.howtoforge.com/tutorial/how-to-install-and-configure-solr-on-ubuntu-1604/

Advertisements
 
Leave a comment

Posted by on June 28, 2016 in Application Server, Linux, Solr, Ubuntu

 

Defining text field types in schema.xml

By: 

Overview

Solr’s world view consists of documents, where each document consists of searchable fields. The rules for searching each field are defined using field type definitions. A field type definition describes the analyzers, tokenizers and filters which control searching behaviour for all fields of that type.

 

When a document is added/updated, its fields are analyzed and tokenized, and those tokens are stored in solr’s index. When a query is sent, the query is again analyzed, tokenized and then matched against tokens in the index. This critical function of tokenization is handled by Tokenizer components.

 

In addition to tokenizers, there are TokenFiltercomponents, whose job is to modify the token stream.

There are also CharFilter components, whose job is to modify individual characters. For example, HTML text can be filtered to modify HTML entities like & to regular &.

 

Defining text field types in schema.xml

Here’s a typical text field type definition:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<fieldType name="textgen" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What this type definition specifies is:

  • When indexing a field of this type, use an analyzer composed of
    • a WhitespaceTokenizerFactory object
    • a StopFilterFactory
    • a WordDelimiterFilterFactory
    • a LowerCaseFilterFactory
  • When querying a field of this type, use an analyzer composed of
    • a WhitespaceTokenizerFactory object
    • a SynonymFilterFactory
    • a StopFilterFactory
    • a WordDelimiterFilterFactory
    • a LowerCaseFilterFactory

If there is only one analyzer element, then the same analyzer is used for both indexing and querying.

    It’s important to use the same or similar analyzers that process text in a compatible manner at index and query time. For example, if an indexing analyzer lowercases words, then the query analyzer should do the same to enable finding the indexed words.

 

Under the hood

Solr builds a TokenizerChain instance for each of these analyzers. A TokenizerChain is composed of 1 TokenizerFactory instance, 0-n TokenFilterFactory instances, and 0-n CharFilterFactoryinstances. These factory instances are responsible for creating their respective objects from the Lucene framework. For example, a TokenizerFactory creates a Lucene Tokenizer; its concrete implementation WhitespaceTokenizerFactory creates a Lucene WhitespaceTokenizer. image

The class design diagram shows how a TokenizerChain works:

  • Raw input is provided by a Reader instance
  • CharReader (is-a CharStream) wraps the raw Reader
  • Each CharFilterFactory creates a character filter that modifies input CharStream and outputs a CharStream. So CharFilterFactories can be chained.
  • TokenizerFactory creates a Tokenizer from the CharStream.
  • Tokenizer is-a TokenStream, and can be passed to TokenFilterFactories.
  • Each TokenFilterFactory modifies the token stream and outputs another TokenStream. So these can be chained.

Commonly used CharFilterFactories

solr.MappingCharFilterFactory Maps a set of characters to another set of characters.The mapping file is specified by mappingattribute, and should be present under /solr/conf.
Example: <charFilter class=”solr.MappingCharFilterFactory” mapping=”mapping-ISOLatin1Accent.txt”/>
The mapping file should have this format:
# Ä => A “u00C4″ => “A”
# Å => A “u00C5″ => “A”
solr.HTMLStripCharFilterFactory Strips HTML/XML from input stream.The input need not be an HTML document as only constructs that look like HTML will be removed.
Removes HTML/XML tags while keeping the content
Attributes within tags are also removed, and attribute quoting is optional.
Removes XML processing instructions: <?foo bar?>
Removes XML comments
Removes XML elements starting with <! and ending with >
Removes contents of <script> and <style> elements.
Handles XML comments inside these elements (normal comment processing won’t always work)
Replaces numeric character entities references like A or The terminating ‘;’ is optional if the entity reference is followed by whitespace.
Replaces all named character entity references.
terminating ‘;’ is mandatory to avoid false matches on something like “Alpha&Omega Corp” Examples: <charFilter class=”solr.HTMLStripCharFilterFactory”/>
The text
my <a href=”www.foo.bar”>link</a>
becomes
my link

 

Commonly used TokenizerFactories

solr.WhitespaceTokenizerFactory A tokenizer that divides text at whitespaces, as defined byjava.lang.Character.isWhiteSpace().Adjacent sequences of non-whitespace characters form tokens.

Example: HELLOtttWORLD.txt   is tokenized into 2 tokensHELLO WORLD.txt

solr.KeywordTokenizerFactory Treats the entire field as one token, regardless of its content.This is a lot like the “string” field type, in that no tokenization happens at all.Use it if a text field requires no tokenization, but does require char filters, or token filtering like LowerCaseFilter and TrimFilter. Example:http://example.com/I-am+example?Text=-Hello is retained as http://example.com/I-am+example?Text=-Hello
solr.StandardTokenizerFactory A good general purpose tokenizer.

  • Splits words at punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.
  • Not suitable for file names because the .extension is treated as part of token.
  • Splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Example: This sentencet can’t be “tokenized_Correctly” by http://www.google.com or IBM  or NATO 10.1.9.5 test@email.org product-number 123-456949 file.txt is tokenized as Thissentence can’t be tokenized Correctly by http://www.google.com or IBM or NATO 10.1.9.5 test@email.org product number 123-456949 file.txt

solr.PatternTokenizerFactory Uses regex pattern matching to construct distinct tokens for the input stream. It takes two arguments: “pattern” and “group”.

  • “pattern” is the regular expression.
  • “group” says which group to extract into tokens.

group=-1 (the default) is equivalent to “split”. In this case, the tokens will be equivalent to the output from (without empty tokens): String.split(java.lang.String) Using group >= 0 selects the matching group as the token. For example, if you have:

  pattern = '([^']+)'
  group = 0
  input = aaa 'bbb' 'ccc'

the output will be two tokens: ‘bbb’ and ‘ccc’ (including the ‘ marks). With the same input but using group=1, the output would be: bbb and ccc (no ‘ marks)

solr.NGramTokenizerFactory Not clear when and where to use, but the idea is that input is split into 1-sized, then 2-sized, then 3-sized, etc tokens.Perhaps useful for partial matching…It takes “minGram” and “maxGram” arguments, but again, not clear how to set them. Example: email becomes e m a i l em ma ai il

 

Commonly used TokenFilterFactories

solr.WordDelimiterFilterFactory Splits words into subwords and performs optional transformations on subword groups.One use for WordDelimiterFilter is to help match words withdifferent delimiters. One way of doing so is to specifygenerateWordParts="1" catenateWords="1" in the analyzer used for indexing, andgenerateWordParts="1" in the analyzer used for querying. Given that the current StandardTokenizer immediately removes many intra-word delimiters, it is recommended that this filter be used after a tokenizer that leaves them in place (such as WhitespaceTokenizer). By default, words are split into subwords with the following rules:

  • split on intra-word delimiters (all non alpha-numeric characters).
    • "Wi-Fi" -> "Wi", "Fi"
  • split on case transitions (can be turned off – see splitOnCaseChange parameter)
    • "PowerShot" -> "Power", "Shot"
  • split on letter-number transitions (can be turned off – see splitOnNumerics parameter)
    • "SD500" -> "SD", "500"
  • leading and trailing intra-word delimiters on each subword are ignored
    • "//hello---there, 'dude'" -> "hello", "there", "dude"
  • trailing “‘s” are removed for each subword (can be turned off – see stemEnglishPossessive parameter)
    • "O'Neil's" -> "O", "Neil"
      • Note: this step isn’t performed in a separate filter because of possible subword combinations.

Splitting is affected by the following parameters:

  • splitOnCaseChange=”1″causes lowercase => uppercase transitions to generate a new part [Solr 1.3]:
    • "PowerShot" => "Power" "Shot"
    • "TransAM" => "Trans" "AM"
    • default is true (“1″); set to 0 to turn off
  • splitOnNumerics=”1″causes alphabet => number transitions to generate a new part [Solr 1.3]:
    • "j2se" => "j" "2" "se"
    • default is true (“1″); set to 0 to turn off
  • stemEnglishPossessive=”1″causes trailing “‘s” to be removed for each subword.
    • "Doug's" => "Doug"
    • default is true (“1″); set to 0 to turn off

There are also a number of parameters that affect what tokens are present in the final output and if subwords are combined:

  • generateWordParts=”1″causes parts of words to be generated:
    • "PowerShot" => "Power" "Shot" (ifsplitOnCaseChange=1)
    • "Power-Shot" => "Power" "Shot"
    • default is 0
  • generateNumberParts=”1″causes number subwords to be generated:
    • "500-42" => "500" "42"
    • default is 0
  • catenateWords=”1″causes maximum runs of word parts to be catenated:
    • "wi-fi" => "wifi"
    • default is 0
  • catenateNumbers=”1″causes maximum runs of number parts to be catenated:
    • "500-42" => "50042"
    • default is 0
  • catenateAll=”1″causes all subword parts to be catenated:
    • "wi-fi-4000" => "wifi4000"
    • default is 0
  • preserveOriginal=”1″causes the original token to be indexed without modifications (in addition to the tokens produced due to other options)
    • default is 0

Example of generateWordParts=”1″ and catenateWords=”1″:

  • "PowerShot" -> 0:"Power", 1:"Shot" 1:"PowerShot"(where 0,1,1 are token positions)
  • "A's+B's&C's" -> 0:"A", 1:"B", 2:"C", 2:"ABC"
  • "Super-Duper-XL500-42-AutoCoder!" -> 0:"Super",
  • 1:"Duper",
  • 2:"XL",
  • 2:"SuperDuperXL",
  • 3:"500"
  • 4:"42",
  • 5:"Auto",
  • 6:"Coder",
  • 6:"AutoCoder"
solr.SynonymFilterFactory Matches strings of tokens and replaces them with other strings of tokens.

  • The synonyms parameter names an external file defining the synonyms.
  • If ignoreCase is true, matching will lowercase before checking equality.
  • If expand is true, a synonym will be expanded to all equivalent synonyms. If it is false, all equivalent synonyms will be reduced to the first in the list.

Synonym file format: i-pod, i pod => ipod, sea biscuit, sea biscit => seabiscuit

solr.StopFilterFactory Discards common words.<filter class=”solr.StopFilterFactory” words=”stopwords.txt” ignoreCase=”true”/> Stop words file should be in /solr/conf. Format:#Standard english stop words a an
solr.SnowballPorterFilterFactory Uses the Tartarus snowball stemmer framework for different languages. Set the “language” attribute.Not clear how this is different from PorterStemFilterFactory.Example: running gives run
solr.HyphenatedWordsFilterFactory Combines words split by hyphens. Use only at indexing time.
solr.KeepWordFilterFactory Retains only words specified in the “words” file.
solr.LengthFilterFactory Retains only tokens whose length falls between “min” and “max”
solr.LowerCaseFilterFactory Changes all text to lower case.
solr.PorterStemFilterFactory Transforms token stream according to the Porter stemming algorithm. The input token stream should already be lowercase (pass through a LowerCaseFilter).Example:running is tokenized to run
solr.ReversedWildcardFilterFactory
solr.ReverseStringFilterFactory
Useful if wildcard queries like “Apache*” should be supported.Factory for ReversedWildcardFilter-s. When this factory is added to an analysis chain, it will be used both for filtering the tokens during indexing, and to determine the query processing of this field during search. This class supports the following init arguments:

  • withOriginal – if true, then produce both original and reversed tokens at the same positions. If false, then produce only reversed tokens.
  • maxPosAsterisk – maximum position (1-based) of the asterisk wildcard (‘*’) that triggers the reversal of query term. Asterisk that occurs at positions higher than this value will not cause the reversal of query term. Defaults to 2, meaning that asterisks on positions 1 and 2 will cause a reversal.
  • maxPosQuestion – maximum position (1-based) of the question mark wildcard (‘?’) that triggers the reversal of query term. Defaults to 1. Set this to 0, and maxPosAsteriskto 1 to reverse only pure suffix queries (i.e. ones with a single leading asterisk).
  • maxFractionAsterisk – additional parameter that triggers the reversal if asterisk (‘*’) position is less than this fraction of the query token length. Defaults to 0.0f (disabled).
  • minTrailing – minimum number of trailing characters in query token after the last wildcard character. For good performance this should be set to a value larger than 1. Defaults to 2.

Note 1: This filter always reverses input tokens during indexing. Note 2: Query tokens without wildcard characters will never be reversed.

 

Predefined text field types (in v1.4.x schema)

The default deployment contains a set of predefined text field types. The following table gives their tokenization details and examples.

text Indexing behaviour:

1
2
3
4
5
6
7
8
9
10
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a ‘gap’ for more accurate phrase queries. -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

– Tokenizes at whitespaces

– Stop words are removed

– Words delimiters are used to generate word tokens.

generateWordParts=1 => wi-fi will generate wi and fi

generateNumberParts = 1 => 3.5 will generate 3 and 5

catenateWords=1 => wi-fi will generate wi, fi and wifi

catenateNumbers = 1 => 3.5 will generate 3,5 and 35

catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35.

catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35.

splitOnCaseChange=1 => camelCase will generate camel and case.

– All text is changed to lower case.

– The Snowball porter stemmer will convert running to “run”

Querying behaviour:

1
2
3
4
5
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>

In querying, only the synonym filter is additional. So something like TV which is in the synonym group “Television, Televisions, TV, TVs” results in this query token stream: televis televis tv tvs (“televis” is because “television” has been stemmed by Snowball Porter).

textgen Very similar to “text” but without stemming.
Indexing behaviour:

1
2
3
4
5
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a ‘gap’ for more accurate phrase queries. -->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>

– Tokenizes at whitespaces

– Stop words are removed

– Words delimiters are used to generate word tokens.

generateWordParts=1 => wi-fi will generate wi and fi

generateNumberParts = 1 => 3.5 will generate 3 and 5

catenateWords=1 => wi-fi will generate wi, fi and wifi

catenateNumbers = 1 => 3.5 will generate 3,5 and 35

catenateAll = 1 => wi-fi-35 will generate wi, fi, wifi, 35 and wifi35.

catenateAll = 0 => wi-fi-35 will generate wi, fi, wifi and 35, but not wifi35.

splitOnCaseChange=1 => camelCase will generate camel and case.

– All text is changed to lower case.

Note that there is no stemmer, which is what makes this different from “text” type.

Querying behaviour:

1
2
3
4
5
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>

In querying, only the synonym filter is additional. So something like TV which is in the synonym group “Television, Televisions, TV, TVs” results in this query token stream: television televisions tv tvs

For file paths and filenames, “textgen” seems to give the most appropriate results.

textTight Very similar again to “text”, but differs in:- WordDelimiterFilter has generateWordParts=”0″ and generateNumberParts=”0″.So “wi-fi” will give just “wifi” . ”HELLO_WORLD” will give just “helloworld” ”d:filepathfilename.ext” will give just “dfilepathfilenameext”
text_ws Just simple whitespace tokenization.
text_rev Similar to “textgen”, this is for general unstemmed text field that indexes tokens normally and also reversed (via ReversedWildcardFilterFactory), to enable more efficient leading wildcard queries.

Copy from: http://www.pathbreak.com/blog/solr-text-field-types-analyzers-tokenizers-filters-explained

 
Leave a comment

Posted by on October 7, 2015 in Solr

 

SQL Server in Solr using Data Import Handler

By: 

MS SQL Server connector

Download Microsoft JDBC Driver 4.0 for SQL Server from: http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=11774

copy file ‘sqljdbc4.jar’ to ‘contrib/dataimporthandler/lib’

solrconfig.xml

copy solrconfig.xml from an existing collection. Find my version of solrconfig.xml below in this gist.

edit solrconfig.xml by adding:

<lib dir="../../contrib/dataimporthandler/lib" regex=".*\.jar" />
<lib dir="../../dist/" regex="solr-dataimporthandler-.*\.jar" />

Make sure that ‘dist’ folder contains two files for data import handler:

  • solr-dataimporthandler-4.10.2.jar
  • solr-dataimporthandler-extras-4.10.2.jar

add these lines to solrconfig.xml:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
    <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

data-config.xml for SQL Server database

<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.microsoft.sqlserver.jdbc.SQLServerDriver" 
              url="jdbc:sqlserver://servername\instancename;databaseName=mydb"   
              user="sa" 
              password="mypass"/>
  <document>
    <entity name="product"  
      pk="id"
      query="select id,name from products"
      deltaImportQuery="SELECT id,name from products WHERE id='${dih.delta.id}'"
      deltaQuery="SELECT id FROM products  WHERE updated_at > '${dih.last_index_time}'"
      >
       <field column="id" name="id"/>
       <field column="name" name="name"/>       
    </entity>
  </document>
</dataConfig>

Copy from: https://gist.github.com/maxivak/3e3ee1fca32f3949f052

 

 
Leave a comment

Posted by on July 21, 2015 in Application Server, Solr

 

How To Install Solr 5.2.1 on Ubuntu 14.04

By: Koen Vlaswinkel

Written in collaboration with Solr

Introduction

Solr is a search engine platform based on Apache Lucene. It is written in Java and uses the Lucene library to implement indexing. It can be accessed using a variety of REST APIs, including XML and JSON. This is the feature list from their website:

  • Advanced Full-Text Search Capabilities
  • Optimized for High Volume Web Traffic
  • Standards Based Open Interfaces – XML, JSON and HTTP
  • Comprehensive HTML Administration Interfaces
  • Server statistics exposed over JMX for monitoring
  • Linearly scalable, auto index replication, auto failover and recovery
  • Near Real-time indexing
  • Flexible and Adaptable with XML configuration
  • Extensible Plugin Architecture

In this article, we will install Solr using its binary distribution.

Prerequisites

To follow this tutorial, you will need:

Step 1 — Installing Java

Solr requires Java, so in this step, we will install it.

The complete Java installation process is thoroughly described in this article, but we’ll use a slightly different process.

First, use apt-get to install python-software-properties:

  • sudo apt-get install python-software-properties

Instead of using the default-jdk or default-jre packages, we’ll install the latest version of Java 8. To do this, add the unofficial Java installer repository:

  • sudo add-apt-repository ppa:webupd8team/java

You will need to press ENTER to accept adding the repository to your index.

Then, update the source list:

  • sudo apt-get update

Last, install Java 8 using apt-get. You will need to agree to the Oracle Binary Code License Agreement for the Java SE Platform Products and JavaFX.

  • sudo apt-get install oracle-java8-installer

Step 2 — Installing Solr

In this section, we will install Solr 5.2.1. We will begin by downloading the Solr distribution.

First, find a suitable mirror on this page. Then, copy the link of solr-5.2.1.tgz from the mirror. For example, we’ll use http://apache.mirror1.spango.com/lucene/solr/5.2.1/.

Then, download the file in your home directory:

Next, extract the service installation file:

  • tar xzf solr-5.2.1.tgz solr-5.2.1/bin/install_solr_service.sh –strip-components=2

And install Solr as a service using the script:

  • sudo bash ./install_solr_service.sh solr-5.2.1.tgz

Finally, check if the server is running:

  • sudo service solr status

You should see an output that begins with this:

Solr status output
Found 1 Solr nodes: 

Solr process 2750 running on port 8983

. . .

Step 3 — Creating a Collection

In this section, we will create a simple Solr collection.

Solr can have multiple collections, but for this example, we will only use one. To create a new collection, use the following command. We run it as the Solr user in this case to avoid any permissions errors.

  • sudo su – solr -c “/opt/solr/bin/solr create -c gettingstarted -n data_driven_schema_configs”

In this command, gettingstarted is the name of the collection and -n specifies the configset. There are 3 config sets supplied by Solr by default; in this case, we have used one that is schemaless, which means that any field can be supplied, with any name, and the type will be guessed.

You have now added the collection and can start adding data. The default schema has only one required field: id. It has no other default fields, only dynamic fields. If you want to have a look at the schema, where everything is explained clearly, have a look at the file/opt/solr/server/solr/gettingstarted/conf/schema.xml.

Step 4 — Adding and Querying Documents

In this section, we will explore the Solr web interface and add some documents to our collection.

When you visit http://your_server_ip:8983/solr using your web browser, the Solr web interface should appear:

Solr Web Interface

The web interface contains a lot of useful information which can be used to debug any problems you encounter during use.

Collections are divided up into cores, which is why there are a lot of references to cores in the web interface. Right now, the collection gettingstarted only contains one core, named gettingstarted. At the left-hand side, the Core Selector pull down menu is visible, in which you’ll be able to selectgettingstarted to view more information.

After you’ve selected the gettingstarted core, select Documents. Documents store the real data that will be searchable by Solr. Because we have used a schemaless configuration, we can use any field. Let’sl add a single document with the following example JSON representation by copying the below into theDocument(s) field:

{
    "number": 1,
    "president": "George Washington",
    "birth_year": 1732,
    "death_year": 1799,
    "took_office": "1789-04-30",
    "left_office": "1797-03-04",
    "party": "No Party"
}

Click Submit document to add the document to the index. After a few moments, you will see the following:

Output after adding Document
Status: success
Response:
{
  "responseHeader": {
    "status": 0,
    "QTime": 509
  }
}

You can add more documents, with a similar or a completely different structure, but you can also continue with just one document.

Now, select Query on the left to query the document we just added. With the default values in this screen, after clicking on Execute Query, you will see 10 documents at most, depending on how many you added:

Query output
{
  "responseHeader": {
    "status": 0,
    "QTime": 58,
    "params": {
      "q": "*:*",
      "indent": "true",
      "wt": "json",
      "_": "1436827539345"
    }
  },
  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "number": [
          1
        ],
        "president": [
          "George Washington"
        ],
        "birth_year": [
          1732
        ],
        "death_year": [
          1799
        ],
        "took_office": [
          "1789-04-30T00:00:00Z"
        ],
        "left_office": [
          "1797-03-04T00:00:00Z"
        ],
        "party": [
          "No Party"
        ],
        "id": "1ce12ed2-add9-4c65-aeb4-a3c6efb1c5d1",
        "_version_": 1506622425947701200
      }
    ]
  }
}

Conclusion

There are many more options available, but you have now successfully installed Solr and can start using it for your own site.

 
Leave a comment

Posted by on July 21, 2015 in Application Server, Solr

 

Facet dynamic fields with apache solr

By: Nicholas Piasecki

I was in a similar situation when working on an e-commerce platform. Each item had static fields (Price,NameCategory) that easily mapped to SOLR’s schema.xml, but each item could also have a dynamic amount of variations.

For example, a t-shirt in the store could have Color (BlackWhiteRed, etc.) and Size(SmallMedium, etc.) attributes, whereas a candle in the same store could have a Scent(PumpkinVanilla, etc.) variation. Essentially, this is an entity-attribute-value (EAV) relational database design used to describe some features of the product.

Since the schema.xml file in SOLR is flat from the perspective of faceting, I worked around it by munging the variations into a single multi-valued field …

<field
  name="variation"
  type="string"
  indexed="true"
  stored="true"
  required="false"
  multiValued="true" />

… shoving data from the database into these fields as Color|BlackSize|Small, andScent|Pumpkin …

  <doc>
    <field name="id">ITEM-J-WHITE-M</field>
    <field name="itemgroup.identity">2</field>
    <field name="name">Original Jock</field>
    <field name="type">ITEM</field>
    <field name="variation">Color|White</field>
    <field name="variation">Size|Medium</field>
  </doc>
  <doc>
    <field name="id">ITEM-J-WHITE-L</field>
    <field name="itemgroup.identity">2</field>
    <field name="name">Original Jock</field>
    <field name="type">ITEM</field>
    <field name="variation">Color|White</field>
    <field name="variation">Size|Large</field>
  </doc>
  <doc>
    <field name="id">ITEM-J-WHITE-XL</field>
    <field name="itemgroup.identity">2</field>
    <field name="name">Original Jock</field>
    <field name="type">ITEM</field>
    <field name="variation">Color|White</field>
    <field name="variation">Size|Extra Large</field>
  </doc>

… so that when I tell SOLR to facet, then I get results that look like …

<lst name="facet_counts">
  <lst name="facet_queries"/>
  <lst name="facet_fields">
    <lst name="variation">
      <int name="Color|White">2</int>
      <int name="Size|Extra Large">2</int>
      <int name="Size|Large">2</int>
      <int name="Size|Medium">2</int>
      <int name="Size|Small">2</int>
      <int name="Color|Black">1</int>
    </lst>
  </lst>
  <lst name="facet_dates"/>
  <lst name="facet_ranges"/>
</lst>

… so that my code that parses these results to display to the user can just split on my | delimiter (assuming that neither my keys nor values will have a | in them) and then group by the keys …

Color
    White (2)
    Black (1)
Size
    Extra Large (2)
    Large (2)
    Medium (2)
    Small (2)

… which is good enough for government work.

One disadvantage of doing it this way is that you’ll lose the ability to do range facets on this EAV data, but in my case, that didn’t apply (the Price field applying to all items and thus being defined inschema.xml so that it can be faceted in the usual way).

Hope this helps someone!

 

Copy from: http://stackoverflow.com/questions/7512392/facet-dynamic-fields-with-apache-solr/14529566#14529566

 
Leave a comment

Posted by on January 21, 2014 in Solr

 

Set uniqueKey field out of string or uuid

By Sochinda

Open file called: solrconfig.xml and disable

<!--
<searchComponent name="elevator" >
    <str name="queryFieldType">string</str>
    <str name="config-file">elevate.xml</str>
</searchComponent>
-->
 
Leave a comment

Posted by on December 30, 2013 in Solr

 

Mapping Object Attribute in SolrNet

By: 

Solr fields defined in your schema.xml must be mapped to properties in a .NET class. There are currently three built-in ways to map fields:

Attributes (default)

With this method you decorate the properties you want to map with the SolrField and SolrUniqueKey attributes. The attribute parameter indicates the corresponding Solr field name.

Example:

public class Product {
    [SolrUniqueKey("id")]
    public string Id { get; set; }

    [SolrField("manu_exact")]
    public string Manufacturer { get; set; }

    [SolrField("cat")] // cat is a multiValued field
    public ICollection<string> Categories { get; set; }

    [SolrField("price")]
    public decimal Price { get; set; }

    [SolrField("inStock")]
    public bool InStock { get; set; }

    [SolrField("timestamp")]
    public DateTime Timestamp { get; set; }

    [SolrField("weight")]
    public double? Weight { get; set;} // nullable property, it might not be defined on all documents.
}

This way of mapping is implemented by the AttributesMappingManager class.

Index-time field boosting

You can also use the mapping attribute to apply a boost to a specific field at index-time.

[SolrField("inStock", Boost = 10.5)]
public bool InStock { get; set; }

.. this will add a boost of 10.5 to the InStock field each time the document is indexed.

All-properties

This maps each property of the class to a field of the exact same name as the property (note that Solr field names are case-sensitive). It’s implemented by the AllPropertiesMappingManager class. Note that unique keys cannot be inferred, and therefore have to be explicitly mapped. The same mapping as above could be accomplished like this:

public class Product {
    public string id { get; set; }
    public string manu_exact { get; set; }
    public ICollection<string> cat { get; set; }
    public decimal price { get; set; }
    public bool inStock { get; set; }
    public DateTime timestamp { get; set; }
    public double? weight { get; set; }
}

Then to add the unique key:

var mapper = new AllPropertiesMappingManager();
mapper.SetUniqueKey(typeof(Product).GetProperty("id"));

Manual mapping

This allows you to programmatically define the field for each property:

public class Product {
    public string Id { get; set; }
    public string Manufacturer { get; set; }
    public ICollection<string> Categories { get; set; }
    public decimal Price { get; set; }
    public bool InStock { get; set; }
    public DateTime Timestamp { get; set; }
    public double? Weight { get; set; }
}
var mgr = new MappingManager();
var property = typeof (Product).GetProperty("Id");
mgr.Add(property, "id");
mgr.SetUniqueKey(property);
mgr.Add(typeof(Product).GetProperty("Manufacturer"), "manu_exact");
mgr.Add(typeof(Product).GetProperty("Categories"), "cat_exact");
mgr.Add(typeof(Product).GetProperty("Price"), "price");
mgr.Add(typeof(Product).GetProperty("InStock"), "inStock");
mgr.Add(typeof(Product).GetProperty("Timestamp"), "timestamp");
mgr.Add(typeof(Product).GetProperty("Weight"), "weight");

Dictionary mappings and dynamic fields

Solr dynamicFields can be mapped differently depending on the use case. They can be mapped “statically”, e.g, given:

<dynamicField name="price_*"  type="integer"  indexed="true"  stored="true"/>

a particular dynamicField instance can be mapped as:

[SolrField("price_i")]
public decimal? Price {get;set;}

However, it’s often necessary to have more flexibility. You can also map dynamicFields as a dictionary, with a field name prefix:

[SolrField("price_")]
public IDictionary<string, decimal> Price {get;set;}

In this case, price_ is used as a prefix to the actual Solr field name, e.g. with this mapping, Price["regular"] maps to a Solr field named price_regular.

Another, even more flexible mapping:

[SolrField("*")]
public IDictionary<string, object> OtherFields {get;set;}

This acts as a catch-all container for any fields that are otherwise unmapped. E.g. OtherFields["price_i"] maps to a Solr field named price_i.

Fully loose mapping

An even more “dynamic” mapping can be achieved by using a Dictionary<string,object> as document type. In this document type, the dictionary key corresponds to the Solr field name and the value to the Solr field value. Statically typing the fields is obviously lost in this case, though.

When adding documents as Dictionary<string,object> SolrNet will recognize field value types as usual, e.g. you can use strings, int, collections, arrays, etc. Example:

Startup.Init<Dictionary<string, object>>(serverUrl);
var solr = ServiceLocator.Current.GetInstance<ISolrOperations<Dictionary<string, object>>>();
solr.Add(new Dictionary<string, object> {
  {"field1", 1},
  {"field2", "something else"},
  {"field3", new DateTime(2010, 5, 5, 12, 23, 34)},
  {"field4", new[] {1,2,3}},
});

When fetching documents as Dictionary<string,object> SolrNet will automatically map each field value to a .NET type, but it’s up to you to downcast the field value to a properly typed variable. Example:

ISolrOperations<Dictionary<string, object>> solr = ...
ICollection<Dictionary<string, object>> results = solr.Query(SolrQuery.All);
bool inStock = (bool) results[0]["inStock"];

Custom mapping

You can code your own mapping mechanism by implementing the IReadOnlyMappingManager interface.

To override the default mapping mechanism, see this page.

 

Copy from: https://github.com/mausch/SolrNet/blob/master/Documentation/Mapping.md

 
2 Comments

Posted by on December 30, 2013 in Solr