nedelja, 13. oktober 2013

search sphinx within mysql/mariadb (linux edition)

I have used sphinx search engine ( http://sphinxsearch.com/ ) few times before and back then there was no option to use it inside mysql so the only way to do search was:

1. define search index (via mysql query)
2. do actual index and setup reindex via cron (at least that how i did it - now seems to be some kind of relat time index update option)
3. connect to searchd via php api with query - resposn was object of document ids best matching search criteria
4. do mysql query with returning id's as where in parameter
5. sort results as returned form searchd api call

This thing actually works just fine but its a bit complicated and could be done much simpler. So after doin a bit of research for new project i found this document ( https://mariadb.com/kb/en/about-sphinxse/ ) describing searching for results within mysql.

Let's asume that we have working searchd with our index (tasks 1 and 2)

Latest versions of MariaDB apparently include sphinx plugin that can be easily activate by

MariaDB [(none)]> INSTALL PLUGIN sphinx SONAME 'ha_sphinx.so';
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> show engines;
+------------+---------+--------------------------------------------+--------------+------+------------+
| Engine     | Support | Comment                                    | Transactions | XA   | Savepoints |
+------------+---------+--------------------------------------------+--------------+------+------------+
| SPHINX     | YES     | Sphinx storage engine 0.9.9                | NO           | NO   | NO         |
+------------+---------+--------------------------------------------+--------------+------+------------+


We can see plugin has been successfully running inside MySQL/MariaDB database.
Next thing would be to define local table with connection to searchd engine parameter

MariaDB [(none)]>CREATE TABLE t1
(
    id          INTEGER UNSIGNED NOT NULL,
    weight      INTEGER NOT NULL,
    query       VARCHAR(3072) NOT NULL,
    group_id    INTEGER,
    INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://127.0.0.1:9312/test1";


So inside out database we now have two tables -- one that is actually holding data* and the other "fake" table that connects to searchd and pull out results form our search query

MariaDB [nfo]> show tables;
+---------------+
| Tables_in_nfo |
+---------------+
| predb         |
| t1            |
+---------------+

  
* i use 27 Years of Warez Scene Release Info Leaked in Giant Database ( http://torrentfreak.com/27-years-of-warez-scene-release-info-leaked-in-giant-database/ ) as my source - 2.6 mio records

So now we can achieve tasks 3,4 and 5 within mysql with just one query like this

MariaDB [nfo]> SELECT predb.id, rlsname, grp FROM t1 JOIN predb on t1.id = predb.id WHERE query='tie fighter';
+---------+-----------------------------------------------------------+---------+
| id      | rlsname                                                   | grp     |
+---------+-----------------------------------------------------------+---------+
| 2338491 | X-Wing.vs.Tie.Fighter.Balance.of.Power-HBD                | HBD     |
| 2338559 | TIE.Fighter.Collector.Series-iLLoGiC                      | iLLoGiC |
| 2338560 | TIE.Fighter.Collector.Series.Complete.Intro.Addon-iLLoGiC | iLLoGiC |
| 2338562 | TIE.Fighter.Collector.Series.Music.Addon-iLLoGiC          | iLLoGiC |
| 2338565 | TIE.Fighter.Collector.Series.3-D.Fix-iLLoGiC              | iLLoGiC |
| 2340174 | X-Wing.vs.Tie-Fighter-HBD                                 | HBD     |
+---------+-----------------------------------------------------------+---------+
6 rows in set (0.00 sec)


Final words/benchmarks

MariaDB [nfo]> SELECT id,rlsname FROM predb WHERE rlsname LIKE '%zaxxon%';
+---------+---------------------------------------------+
| id      | rlsname                                     |
+---------+---------------------------------------------+
|  804051 | Bassplugger-Zaxxon-ILEC006-Advance-2004-RFL |
| 1847329 | Zaxxon-DPS                                  |
+---------+---------------------------------------------+
2 rows in set (1.79 sec)

MariaDB [nfo]> SELECT * FROM t1 WHERE query='zaxxon';
+---------+--------+--------+----------+
| id      | weight | query  | group_id |
+---------+--------+--------+----------+
|  804051 |      1 | zaxxon |        0 |
| 1847329 |      1 | zaxxon |        0 |
+---------+--------+--------+----------+
2 rows in set (0.01 sec)

MariaDB [nfo]> SELECT predb.id, rlsname, grp FROM t1 JOIN predb on t1.id = predb.id WHERE query='zaxxon';
+---------+---------------------------------------------+-----+
| id      | rlsname                                     | grp |
+---------+---------------------------------------------+-----+
|  804051 | Bassplugger-Zaxxon-ILEC006-Advance-2004-RFL | RFL |
| 1847329 | Zaxxon-DPS                                  | DPS |
+---------+---------------------------------------------+-----+
2 rows in set (0.04 sec)