PHP, Web Development

Things I Have Learnt in The First 5 Minutes of Using Elastic Search

After hearing all the raves about Elastic Search and how it was awesome and “rad”, or whatever “hip-young” programmers want to say I decided I would give it a go.

To get the point, since this might be a bit tl;dr: I am not overly fond of it. I am unsure what companies like GitHub see in it.

It has a queue, no need for a river

Exactly that, implement it into your active record and you don’t need to river.

I would in fact advise against the river, it uses the oplog which can be slow. Not only that but you are adding yet another lock on top of the secondaries, which may increase your chances of replication falling behind, this is of course dependant upon how often the river pings your oplog and how many new ops you have in that window.

This is a good point. It is something that makes ES stand out as better.

It has terrible documentation

Its documentation is great at explaining the API, no doubt about it but if you want to actually find out how something works and why something is then you have to constantly ask Stack Overflow.

It just describes what parameters to put in and then leaves the rest up to you, thinking that you don’t want to bother yourself with those details. We do though, we are not “band-wagoning” your product, we want to know how sharding and replication work, how indexes work, how to manage the product, and way more.

Even when looking at the API documentation it can sometimes be… Unhelpful. Mainly due to its huge font-size, yet, tiny, middle centred layout, English language problems, and disorganisation.

Overall, I came out less than impressed about Elastic Search’s documentation.

I actually Google search everything first so I don’t have to navigate that mess.

Lucene is not a good base

Yes, Lucene is one of the originals when it comes to FTS techs but this isn’t a good thing. It was made back when people didn’t care about speed or scaling and all they cared about was that Lucene could Google.

This does mean that Lucene has many problems, like no infixing which translates to Elastic Search. This means that, at times, to get the effect you want you must use prefix or wildcard searches which are so slow they are a nail in the coffin for any database, especially one which serves an FTS tech.

That is just one of the many problems that plague modern Lucene, including a mix and match querying language from the years of being backward compatible yet trying to keep up with changes.

It has some of the most verbose querying in the known universe

When querying a single keyword takes this much writing:

	    $cursor = glue::elasticSearch()->search(array('type' => 'help', 'body' =>
	        array('query' => array('filtered' => array(
	            'query' => array(
	                'bool' => array(
	                    'should' => array(
	                        array('multi_match' => array(
	                            'query' => glue::http()->param('query',$keywords), 
	                            'fields' => array('title', 'blurb', 'tags', 'normalisedTitle', 'path')
	                        )),
	                    )
	                )
	            )
	        )))
	    ));

And querying more than one takes:

        $res = glue::elasticSearch()->search(
                array('body' => array(
                        'query' => array(
                                'filtered' => array(
                                        'query' => array(
                                                'bool' => array(
                                                        'should' => array(
                                                                 
                                                                array('prefix' => array('username' => 'the')),

                                                                array('prefix' => array('username' => 'n')),
                                                                array('prefix' => array('username' => 'm')),
                                                                array('match' => array('about' => 'the')),
                                                        )
                                                )
                                        ),
                                        'filter' => array(
                                                'and' => array(
                                                        array('range' => array('created' => array('gte' => date('c',time()-3600), 'lte' => date('c',time()+3600))))
                                                )
                                        ),
                                        'sort' => array()
                                )
                        )
                )));

You do start to feel yourself slipping away.

You must do your own tokenizing if you wish to prefix on two keywords separately

Elastic Search won’t do this, it will actually search for a phrase by default even when you don’t use the phrase searcher.

It has some of the most complex querying in the known universe

When you have many ways to represent the same operator and 6 different operators for what could essentially be the same thing…

I believe this is legacy from Lucene, one of the many downsides for being old and backward compatible with every version.

It has no exact filtering without turning off the analyzer

Yep, you read that right; you want to filter (filter, not query) on deleted? Your going to have to make sure it isn’t analyzed, buddy.

It has no easy way to define indexes server side

I have no idea why Elastic Search does this but they make you define the indexes client side. In all my life I have never had any reason to do that and if you want to comment saying you do, think about it carefully: “DO YOU REALLY???”.

Only about 1% of cases need client side index definition and the other 99% think they do.

Either way, I am now stuck with having to have an Elastic Search setup script in my application about 500 lines long which has to be run in patches since you can’t run the delete index command and then the recreate index command, something about it being done sync and async.

It says to not use delete yet provides no feasible alternative

Its exact words are:

Note, delete by query bypasses versioning support. Also, it is not recommended to delete “large chunks of the data in an index”, many times, it’s better to simply reindex into a new index.

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html

So, if you need to delete a user’s videos from the videos type index then think again and since you have just shy of 200m records you can’t simply re-index.

Creating mappings is painful

No default mapping ability which means similarities between types are duplicated.

EDIT: There is a default mapping it was hidden under “dynamic mapping”: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-dynamic-mapping.html

It is actually quite slow

Sphinx did it in about 1ms or less, Elastic Search gets no results in 6ms.

Querying is not standardised

from = skip/offset
size = limit

Seeing the problem yet?

Its PHP SDK is larger than my application

It is 5.42MB in size!!! To compare Yii is 58MB. That may not sound like a lot but Yii is a full stack web framework… Yup.

Its PHP API only takes composer

NOT EVERYONE USES COMPOSER!!!

The PHP API uses 9MB memory (base) per request

Under Sphinx that was 3.25MB

Decent AWS usage is an optional extra

Welcome to hell kid: http://www.elasticsearch.org/tutorials/elasticsearch-on-ec2/

Setting up Elastic Search properly was tedious and soul breaking

Lack of documentation and constantly going backwards and forwards from Stack Overflow made me look about 10 years older.

Elastic Search returns no results if no keywords are provided

Super annoying…

Schema flexible but not

I recently made a mistake in my documents which meant I needed to re-save an object as string, of course, Lucene is not schema flexible which means, despite trying to be, Elastic is not either.

So what do I do now? Re-update the index through a specially designed script (WTF?? Why can’t I do this shit in the fucking config???) and then reapply all documents? Or delete the index and reapply all types and mappings again and reinsert all indexes again…

No easy document management

It is like providing MongoDB or MySQL without a console or shell.

No easy way to reindex

Nope.

Keeping the config file safe

Having the configuration done client side means that you must, as I said, have a PHP (or whatever) file in your app structure which can run the config.

This immediately poses a problem. How do you keep this at hand in a browser runable location without making specific sever rules? You can’t…

You have to turn your setup file into a full blown console file that SHOULD BE IN ELASTIC SEARCH.

The diagnostics are terrible

I recently started getting red status for my nodes all the time so I, as suggested by the user group, looked into the logs only to find:

[2013-12-22 12:44:24,257][INFO ][node                     ] [Dominic Fortune] version[0.90.8], pid[1494], build[909b037/2013-12-18T16:08:16Z]
[2013-12-22 12:44:24,257][INFO ][node                     ] [Dominic Fortune] initializing ...
[2013-12-22 12:44:24,265][INFO ][plugins                  ] [Dominic Fortune] loaded [], sites [HQ, head]
[2013-12-22 12:44:26,818][INFO ][node                     ] [Dominic Fortune] initialized
[2013-12-22 12:44:26,818][INFO ][node                     ] [Dominic Fortune] starting ...
[2013-12-22 12:44:26,891][INFO ][transport                ] [Dominic Fortune] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.183.128:9300]}
[2013-12-22 12:44:29,919][INFO ][cluster.service          ] [Dominic Fortune] new_master [Dominic Fortune][5ddMpkRQTZa3TqQ-ljUabg][inet[/192.168.183.128:9300]], reason: zen-disco-join (elected_as_master)
[2013-12-22 12:44:29,951][INFO ][discovery                ] [Dominic Fortune] elasticsearch/5ddMpkRQTZa3TqQ-ljUabg
[2013-12-22 12:44:29,979][INFO ][http                     ] [Dominic Fortune] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.183.128:9200]}
[2013-12-22 12:44:29,980][INFO ][node                     ] [Dominic Fortune] started
[2013-12-22 12:44:29,987][INFO ][gateway                  ] [Dominic Fortune] recovered [0] indices into cluster_state
[2013-12-22 12:45:07,323][INFO ][cluster.metadata         ] [Dominic Fortune] [main] creating index, cause [api], shards [5]/[2], mappings []
[2013-12-22 12:45:17,669][INFO ][cluster.metadata         ] [Dominic Fortune] [main] create_mapping [_default_]
[2013-12-22 12:45:17,680][INFO ][cluster.metadata         ] [Dominic Fortune] [main] create_mapping [help]
[2013-12-22 12:47:19,818][INFO ][node                     ] [Dominic Fortune] stopping ...
[2013-12-22 12:47:19,845][INFO ][node                     ] [Dominic Fortune] stopped
[2013-12-22 12:47:19,845][INFO ][node                     ] [Dominic Fortune] closing ...
[2013-12-22 12:47:19,856][INFO ][node                     ] [Dominic Fortune] closed
[2013-12-22 12:47:45,495][INFO ][node                     ] [Stryker, William] version[0.90.8], pid[1695], build[909b037/2013-12-18T16:08:16Z]
[2013-12-22 12:47:45,496][INFO ][node                     ] [Stryker, William] initializing ...
[2013-12-22 12:47:45,502][INFO ][plugins                  ] [Stryker, William] loaded [], sites [HQ, head]
[2013-12-22 12:47:48,068][INFO ][node                     ] [Stryker, William] initialized
[2013-12-22 12:47:48,068][INFO ][node                     ] [Stryker, William] starting ...
[2013-12-22 12:47:48,140][INFO ][transport                ] [Stryker, William] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.183.128:9300]}
[2013-12-22 12:47:51,170][INFO ][cluster.service          ] [Stryker, William] new_master [Stryker, William][rMklMXasRDS4lURA0wQ7lQ][inet[/192.168.183.128:9300]], reason: zen-disco-join (elected_as_master)
[2013-12-22 12:47:51,198][INFO ][discovery                ] [Stryker, William] elasticsearch/rMklMXasRDS4lURA0wQ7lQ
[2013-12-22 12:47:51,222][INFO ][http                     ] [Stryker, William] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.183.128:9200]}
[2013-12-22 12:47:51,223][INFO ][node                     ] [Stryker, William] started
[2013-12-22 12:47:51,242][INFO ][gateway                  ] [Stryker, William] recovered [1] indices into cluster_state

… Nothing. Their diagnostics are “terribad”, to say the least. Currently, their advice is to remove the data directory and lose ALL YOUR DATA IN THE PROCESS.

That last sentence brings me onto the next point:

No real journal

No easy recovery if things go terribly wrong.

Elastic Search works by Quorum (if you don’t know what democracy is try searching Wikipedia if you can…)

Elastic search works by Quorum.

That is why if you set 5 shards and 2 replicas with no data you will always get red status on restart…

Auto join sharding is a really bad idea

Great in theory but what if you accidently start two nodes on the same machine because you were tired or something? OH NOES I JUST LOST MY ENTIRE CONFIG!!! When you restart that node without the other extra one on it you are left with nothing short of a disaster and you have no indication on how to fix it.

I have now had to delete all my data three times due to noob mistakes making learning Elastic Search ironically more difficult than if I had to learn the sharding commands separately.

I heard once MongoDB was going to do something similar, please for the love of God – DON’T.

Advertisements