Skip to main content

Welcome

Getting overview what is DIMDATA Platform

DIMDATA Concepts

undefined

Index

  • Documents
  • Fields (Type, Boost)
  • Text Analyzer
  • Synonyms

Search portal

  • Custom domain
  • Search group (Tab)
  • Index to search
  • Custom faceting
  • Custom display

Features

Role-based Search Results

Various groups across the company need access to different levels of information. DIMDATA enterprise search adheres to security and access control privileges to ensure content results are tuned to be relevant, and permissible.

Search portal

DIMDATA Search is built in search portal deployment and hosting. support deployment of multiple search portal. Each search portal can config for different index or same index with hidden filter.

Global Language Support

DIMDATA Search support over 32 languages, including Chinese, Japanese, Korean, Thai and Arabic without any additional work required.

Faceting

Faceted search, also called faceted navigation or faceted browsing, is a technique for accessing information organized according to a faceted classification system, allowing users to explore a collection of information by applying multiple filters.

Stemming

A filter that provides access to (almost) all of the available stemming token filters through a single unified interface. Support over 32 languages standard stemming

Synonyms

The synonym token filter allows to easily handle synonyms during the analysis process. Examples:

  • ipod, i-pod, i pod
  • foozball , foosball
  • universe , cosmos
  • lol, laughing out loud

Vocabulary

Index

Index is an entity within DIMDATA Search where you import the data you want to search (indexing) and perform queries (search). An index is like a table in a relational database.

Document

A document is a JSON document which is stored in DIMDATA Search. It is like a row in a table in a relational database. Each document is stored in an index and has an id. A document is a JSON object which contains zero or more fields, or key-value pairs.

Field

A document contains a list of fields, or key-value pairs. The value can be a simple (scalar) value (eg a string, integer, date). A field is similar to a column in a table in a relational database. The mapping for each field has a field type which indicates the type of data that can be stored in that field, eg integer, string, object. The field also allows you to define (amongst other things) how the value for a field should be analyzed.

Id

The ID of a document identifies a document. The index/id of a document must be unique. If no ID is provided, then it will be auto-generated.

Analyzer

Analyzer is the process of converting full text to terms. Depending on which analyzer is used, these phrases: FOO BAR, Foo-Bar, foo,bar will probably all result in the terms foo and bar. These terms are what is actually stored in the index. A full text query (not a term query) for FoO:bAR will also be analyzed to the terms foo,bar and will thus match the terms stored in the index.

Tokenizers

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into the terms [Quick, brown, fox!].

Currently available tokenizers:

  • Standard Tokenizer: The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.
  • Lowercase Tokenizer: The lowercase tokenizer, like the letter tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.
  • ICU Tokenizer: Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.
  • Keyword Tokenizer: The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters like lowercase to normalise the analysed terms.

Token Filters

Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms). DIMDATA Search has a number of built in token filters which can be used to build

  • Synonym Token Filter: The synonym token filter allows to easily handle synonyms during the analysis process.
  • Lowercase Token Filter: A token filter of type lowercase that normalizes token text to lower case.
  • Stemmer Token Filter: A filter that provides access to (almost) all of the available stemming token filters through a single unified interface. The language/name parameter controls the stemmer with the following available values

    Arabic, Armenian, Basque, Bengali, Brazilian Portuguese, Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Kurdish (Sorani), Latvian, Lithuanian, Norwegian (Bokmål), Norwegian (Nynorsk), Portuguese, Romanian, Russian, Spanish, Swedish, Turkish


Notice something is incorrect or outdated?

First off, great eye! We appreciate your discovery and want to ensure it gets addressed immediately. Please let us know here.