How to configure solr schema.xml and solrconfig.xml 07/06 Update SLTechnology News&Howtos

How to configure solr schema.xml and solrconfig.xml

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to configure solr schema.xml and solrconfig.xml". In daily operation, I believe many people have doubts about how to configure solr schema.xml and solrconfig.xml. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "how to configure solr schema.xml and solrconfig.xml". Next, please follow the editor to study!

1. Field configuration (schema)

Schema.xml is located in the solr/conf/ directory, similar to the datasheet configuration file

Defines the data types of indexed data, including type, fields, and other default settings.

1. Let's first take a look at the type node, which defines the FieldType child node, including some parameters such as name,class,positionIncrementGap.

Name: that's the name of the FieldType.

Class: point to the corresponding class name in the org.apache.solr.analysis package to define the behavior of this type.

< schema name = "example" version = "1.2" >

< types >

< fieldType name = "string" class = "solr.StrField" sortMissingLast = "true" omitNorms = "true" />

< fieldType name = "boolean" class = "solr.BoolField" sortMissingLast = "true" omitNorms = "true" />

< fieldtype name = "binary" class = "solr.BinaryField" />

< fieldType name = "int" class = "solr.TrieIntField" precisionStep = "0" omitNorms = "true" positionIncrementGap = "0" />

< fieldType name = "float" class = "solr.TrieFloatField" precisionStep = "0" omitNorms = "true" positionIncrementGap = "0" />

< fieldType name = "long" class = "solr.TrieLongField" precisionStep = "0" omitNorms = "true" positionIncrementGap = "0" />

< fieldType name = "double" class = "solr.TrieDoubleField" precisionStep = "0" omitNorms = "true" positionIncrementGap = "0" />

...

If necessary, fieldType also needs to define the parser analyzer to be used for indexing and querying this type of data, including word segmentation and filtering, as follows:

View plain print?

< fieldType name = "text_ws" class = "solr.TextField" positionIncrementGap = "100" >

< analyzer >

< tokenizer class = "solr.WhitespaceTokenizerFactory" />

< fieldType name = "text" class = "solr.TextField" positionIncrementGap = "100" >

< analyzer type = "index" >

< tokenizer class = "solr.WhitespaceTokenizerFactory" />

< filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" enablePositionIncrements = "true" />

< filter class = "solr.WordDelimiterFilterFactory" generateWordParts = "1" generateNumberParts = "1" catenateWords = "1" catenateNumbers = "1" catenateAll = "0" splitOnCaseChange = "1" />

< filter class = "solr.LowerCaseFilterFactory" />

< filter class = "solr.SnowballPorterFilterFactory" language = "English" protected = "protwords.txt" />

< analyzer type = "query" >

< tokenizer class = "solr.WhitespaceTokenizerFactory" />

< filter class = "solr.SynonymFilterFactory" synonyms = "synonyms.txt" ignoreCase = "true" expand = "true" />

< filter class = "solr.StopFilterFactory" ignoreCase = "true" words = "stopwords.txt" enablePositionIncrements = "true" />

< filter class = "solr.WordDelimiterFilterFactory" generateWordParts = "1" generateNumberParts = "1" catenateWords = "0" catenateNumbers = "0" catenateAll = "0" splitOnCaseChange = "1" />

< filter class = "solr.LowerCaseFilterFactory" />

< filter class = "solr.SnowballPorterFilterFactory" language = "English" protected = "protwords.txt" />

2. Let's take a look at the specific fields defined in the fields node (similar to database fields), which contain the following attributes:

Name: field name

Type: various FieldType previously defined

Indexed: whether to be indexed

Stored: whether it is stored (if you do not need to store the corresponding field values, set it to false as far as possible)

MultiValued: whether there are multiple values (set to true as far as possible for fields that may have multiple values to avoid throwing errors during indexing)

View plain print?

< fields >

< field name = "id" type = "integer" indexed = "true" stored = "true" required = "true" />

< field name = "name" type = "text" indexed = "true" stored = "true" />

< field name = "summary" type = "text" indexed = "true" stored = "true" />

< field name = "author" type = "string" indexed = "true" stored = "true" />

< field name = "date" type = "date" indexed = "false" stored = "true" />

< field name = "content" type = "text" indexed = "true" stored = "false" />

< field name = "keywords" type = "keyword_text" indexed = "true" stored = "false" multiValued = "true" />

< field name = "all" type = "text" indexed = "true" stored = "false" multiValued = "true" />

3. It is recommended to establish a copy field and copy all the full-text fields into one field for unified retrieval:

The following are the copy settings:

View plain print?

< copyField source = "name" dest = "all" />

< copyField source = "summary" dest = "all" />

4. Dynamic fields. For fields without specific names, use dynamicField fields.

For example, if name is * _ I and its type is defined as int, then when using this field, the fields in which the task results with _ I are considered to conform to this definition. Such as name_i, school_i

View plain print?

< dynamicField name = "*_i" type = "int" indexed = "true" stored = "true" />

< dynamicField name = "*_s" type = "string" indexed = "true" stored = "true" />

< dynamicField name = "*_l" type = "long" indexed = "true" stored = "true" />

< dynamicField name = "*_t" type = "text" indexed = "true" stored = "true" />

< dynamicField name = "*_b" type = "boolean" indexed = "true" stored = "true" />

< dynamicField name = "*_f" type = "float" indexed = "true" stored = "true" />

< dynamicField name = "*_d" type = "double" indexed = "true" stored = "true" />

< dynamicField name = "*_dt" type = "date" indexed = "true" stored = "true" />

Information in the comments in the schema.xml documentation:

1. In order to improve performance, the following measures can be taken:

Set the stored of all field that are only used for search and do not need to be used as a result (especially some larger field) to false

Set the indexed of the field that does not need to be used for the search but is returned as a result to false

Delete all unnecessary copyField statements

To minimize index fields and search efficiency, set the index of all text fields to field, then copy them all to a total text field using copyField, and then search for him.

To maximize search efficiency, clients written in java interact with solr (using streaming communication)

Run JVM on the server side (omitting network traffic) and use the highest possible Log output level to reduce log volume.

2 、

< schema name =" example " version =" 1.2 " >

Name: identifies the name of this schema

Version: the current version is 1.2

3 、 filedType

< fieldType name =" string " class =" solr.StrField " sortMissingLast =" true " omitNorms =" true " />

Name: just a logo.

Class and other attributes determine the actual behavior of this fieldType. (class starts with solr, all under the org.appache.solr.analysis package.)

Optional attributes:

The sortMissingLast and sortMissingFirst attributes are used on types that can be sorted using String inherently (including: string,boolean,sint,slong,sfloat,sdouble,pdate).

SortMissingLast= "true", the data without the field is ranked after the data with the field, regardless of the collation at the time of the request.

SortMissingFirst= "true", turn it upside down.

2 values are set to false by default

StrField types are not parsed, but are indexed / stored verbatim.

Both StrField and TextField have an optional attribute "compressThreshold" that ensures compression to no less than one size (in char)

< fieldType name =" text " class =" solr.TextField " positionIncrementGap =" 100 " >

Solr.TextField allows users to customize indexes and queries through a parser, which includes a word splitter (tokenizer) and multiple filters (filter).

PositionIncrementGap: an optional attribute that defines the white space interval for this type of data in the same document to avoid phrase matching errors.

Name: field type name

Class: java class name

Indexed: the default true. Indicates that this data should be searched and sorted, and if the data does not have indexed, then stored should be true.

Stored: the default true. It is appropriate to indicate that this field is included in the search results. If the data does not have a stored, then the indexed should be true.

SortMissingLast: means that the document without the specified field data comes after the document with the specified field data

SortMissingFirst: refers to the document that does not have the specified field data before the document with the specified field data

OmitNorms: the length of the field does not affect the score and set it to true when no boost is done when indexing. The general text field is not set to true.

TermVectors: set to true if the field is used as a feature of more like this and highlight.

Compressed: the field is compressed. This may slow indexing and search, but reduces storage space, and only StrField and TextField are compressible, which is usually suitable for fields longer than 200 characters.

MultiValued: can be set to true when the field has more than one value.

PositionIncrementGap: and multiValued

Use together to set the number of virtual whitespace between multiple values

< tokenizer class =" solr.WhitespaceTokenizerFactory " />

Space participle, exact match.

< filter class =" solr.WordDelimiterFilterFactory " generateWordParts =" 1 " generateNumberParts =" 1 " catenateWords =" 1 " catenateNumbers =" 1 " catenateAll =" 0 "splitOnCaseChange =" 1 " />

Consider "-" hyphens, alphanumeric boundaries, and non-alphanumeric characters when segmenting and matching, so that "wifi" or "wifi" can match "Wi-Fi".

< filter class =" solr.SynonymFilterFactory " synonyms =" synonyms.txt " ignoreCase =" true " expand =" true " />

Synonym

< filter class =" solr.StopFilterFactory " ignoreCase =" true " words =" stopwords.txt " enablePositionIncrements =" true " />

Increase the interval between phrases after the forbidden word (stopword) is deleted

Stopword: words that are ignored during indexing (indexing and searching), such as common words such as is this. Maintenance in conf/stopwords.txt.

4 、 fields

< field name =" id " type =" string " indexed =" true " stored =" true " required =" true " />

Name: just a logo.

Type: the previously defined type.

Indexed: whether it is used to build indexes (related to search and sorting)

Stored: whether to save or not

Compressed: [false], whether to use gzip compression (only TextField and StrField can compress)

MutiValued: whether to contain multiple values

OmitNorms: whether to ignore Norm or not can save memory space. Only full-text field and need an index-time boost field need norm. (I don't understand the details, there is a contradiction in the notes)

TermVectors: [false]. When true is set, term vector is stored. When using MoreLikeThis, the field used as a similar word should be stored.

TermPositions: stores address information in term vector, which consumes storage overhead.

TermOffsets: the offset of the storage term vector, which consumes storage overhead.

Default: if there are no attributes that need to be modified, you can use this flag.

< field name =" text " type =" text " indexed =" true " stored =" false " multiValued =" true " />

The all-inclusive (somewhat exaggerated) field, which contains all the searchable text fields, is implemented through copyField.

< copyField source =" cat " dest =" text " />

< copyField source =" name " dest =" text " />

< copyField source =" manu " dest =" text " />

< copyField source =" features " dest =" text " />

< copyField source =" includes " dest =" text " />

When adding an index, copy all data from the copied field (such as cat) to the text field

Function:

Search data from multiple field together at the same time to provide speed

Copying data from one field to another can be indexed in two different ways.

< dynamicField name =" *_i " type =" int " indexed =" true " stored =" true " />

If the name of a field does not match, a dynamic field is used to try to match the various patterns defined.

"*" can only appear at the front and end of the pattern.

Longer patterns will be matched first.

If two patterns match at the same time, the first defined takes precedence.

< dynamicField name =" * " type =" ignored " multiValued=" true " />

If none of the above matches are found, you can define this, and then define a type when String handles it. (it usually doesn't happen)

However, if it is not defined, an error will be reported if no match is found.

5. Other tags

< uniqueKey >

The unique identification of the document, this field must be filled in (unless the field is marked required= "false"), otherwise solr establishes the index to report an error.

< defaultSearchField >

Text

If no specific field is specified in the search parameters, then this is the default domain.

< solrQueryParser defaultOperator =" OR " />

Configure the logic between search parameter phrases, which can be "AND | OR".

II. Solrconfig.xml

1. Index configuration

The mainIndex tag section defines some of the factors that control Solr index processing.

UseCompoundFile: reduce the number of files in use by consolidating many Lucene internal files into a single file. This can help reduce the number of file handles used by Solr at the expense of performance. Unless the application runs out of file handles, the default value for false should be sufficient.

UseCompoundFile: reduce the number of files in use by consolidating many Lucene internal files into one file. This can help reduce the number of file handles used by Solr at the expense of performance. Unless the application runs out of file handles, the default value for false should be sufficient.

MergeFacor: determines how often Lucene segments are merged. Smaller values (minimum 2) use less memory but result in slower indexing time. Higher values make indexing time faster at the expense of more memory. (typical balanced configuration of time and space)

MaxBufferedDocs: before merging documents in memory and creating new segments, define the minimum number of documents required for indexes. Segments are Lucene files used to store index information. Higher values make indexing time faster but sacrifice more memory.

MaxMergeDocs: controls the maximum number of Document that can be merged by Solr. A smaller value (

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.