I decided to release an initial version of Pickmybrain.
In no way its a finalized product but it should give some clue about what is to be expected.
Pickmybrain is licenced under GPLv3, which means you are free to download, modify and redistribute it as you wish
( as long as you remember to honour the other license conditions :) ).
The beta version has still some quirks. Here is a list of currently known shortcomings:
1. There is no working memory management system in place yet.
2. The temporary table system for token hits and prefixes is flawed in some ways
3. Memory consumption of PMBApi could still be optimized further
The whole keyword dictionary is kept in memory during indexing, which speeds up the operation but requires quite a lot of memory.
If multiprocessing is enabled, each process will keep its own dictionary and the memory consumption will add up.
Other pitfall is the process (token_compressor.php) in which temporarily stored token matches are fetched from the database, compressed and inserted again
into database table PMBTokens as compressed binary strings. The same thing applies also for the prefix compression process (prefix_compressor.php) and the table PMBPrefixes.
These two last ones should actually be somewhat easy to fix with simple memory consumption monitoring and write buffer tuning.
The first one requires a some sort of compromise between memory usage and performance.
When keywords are tokenized, they are inserted into a temporary MyISAM table named PMBdatatemp (checksum (int), token_id(mediumint), document_id(int), count(tinyint), token_id_2(mediumint), field_id(tinyint)).
The checksum column is a crc32 checksum of the actual token and it is stored because the data needs to be read in an ascending order to ensure decent inserting performance
for the final InnoDB table PMBTokens with columns (checksum(int), token(varbinary40), doc_matches(int), doc_ids(mediumblob)) and a composite primary key of (checksum, token).
MyISAM tables have great insert performance, especially when the indexes are disabled right after the table creation and enabled after the table has been fully populated with data.
Initially the data is in random order, but after the covering index is enabled with the proper ALTER TABLE command,
a sorted index is created and it satisfies the whole select query providing good performance. However this method has three downsides:
1. the inserted temporary data is not compressed, thus it takes a lot of space
2. the resulting index takes a lot of space
3. creating the index takes quite a long time, especially for tables that have billions of rows.
One solution could be to write the temporary data in compressed form, since it needs to be compressed at some point.
This would use less space and quite possibly the writes would be faster too. The table could, for example, have three columns: checksum, token_id and doc_ids.
However, this would create a new problem: how the read the table in a correct order.
Creating an index on the checksum column would be trivial, but the real problem is the index would be non-covering.
This would be ok if we just would like to fetch the checksums in correct order, but now every entry in the index points to a data row that is at random position in the disk
and it is a real dealbreaker for conventional hard disk drives.
This pretty much where I got stuck. The current method works though, but I know in my gut that there is a better method, which I haven't discovered yet.
If you're reading this and have some wild ideas, don't hesitate to mail me tips or try to solve it yourself ;)
PMBApi and memory consumption
The PMBApi needs to store matching document ids into arrays. If user provides multiple keywords when querying a large search index,
it is possible that the resulting array(s) have a lot of items. And as you probably know, PHP arrays take a lot of space.
However, I don't feel this is a real dealbreaker and I will personally pay some attention to it in the future releases.
That is about it. I feel I am probably halfway there creating a suitable multi-purpose search engine purely with PHP and MySQL.
As this is the initial release, I would love to have some feedback on the work so far. If you have something to share,
don't be shy :) Cheers!
P.S. Remember to upgrade to PHP 7 ( if you haven't already )