Henri Ruutinen 05.05.2021 22:18
- Now PMB comes with an easy to use demo search page, demo/demo.php
- SetFieldWeights had a bug, which is now fixed
- 32bit version of PMBApi had a bug in SetFilterBy and SetFilterRange functions
- Cleaned up some redundant code + created a new method for commonly used code in PMBApi
Henri Ruutinen 27.04.2021 23:00
- Fixed deprecated curly bracket syntax for array indexes (PHP 7.3+)
- In some cases, an integer needle may have been passed to the strpos() function in Search() and SearchFocuser() methods.
Henri Ruutinen 27.02.2021 21:12
To offer better support for shared hosting environments, 32bit PHP runtime environments are now supported.
The control panel includes an automatic detection for your PHP environment type and prompts you to semi-manually or manually to replace the original binaries with 32bit ones if required.
If you are using the cli configuration tool, please copy the php files from the bin32 folder to the main folder and replace the existing files.
Even if implementing a variable byte encoding&decoding is much more tricky in 32bit environments, searching and indexing performance took luckily only a very minor hit (~5-10%).
There are also numerous smallish improvements, such as:
- better support for async curl script execution method with https protocol
- the index specific search window in control.php will now hide properly. Before it was hidden, but you still couldn't click or select any items under it
- the control.php self tests are now run only once after logging in. This should make the control panel slightly more responsive.
- exec() script execution method cannot be chosen, if it's detected to be unavailable.
Henri Ruutinen 09.04.2020 00:40
Henri Ruutinen 23.11.2017 14:01
# Sort results by the internal @id attribute
$pickmybrain->SetSortMode(PMB_SORTBY_ATTR, "@id DESC");
# Search these keywords, skip the 40 first results and return 20 matches after that.
$result = $pickmybrain->Search("flat earth society", 40, 20);
# this index is set, when the total number of results is only an approximation
# If offset ( 40 above ) is too high, this flag will be set
Henri Ruutinen 18.07.2017 01:20
# if you want to provide a focused version of your ( long )
# indexed data field with highlighted keywords
$row["content"] = $pickmybrain->SearchFocuser($row["content"], $query, $chars_per_line, $max_len);
$query = "mykeyword"; // your search query
$stem_language = "fi"; // language for stemming ( en | fi )
$chars_per_line = 90; // how many chars before forced linebreak ( <br> )
$max_len = 150; // how many chars the focused result may contain in total
Henri Ruutinen 14.07.2017 20:20
Henri Ruutinen 01.07.2017 20:54
Henri Ruutinen 20.06.2017 18:13
Henri Ruutinen 15.06.2017 18:12
Henri Ruutinen 24.05.2017 18:41
Henri Ruutinen 04.04.2017 23:15
Henri Ruutinen 21.03.2017 16:34
Henri Ruutinen 14.03.2017 20:00
Henri Ruutinen 09.03.2017 23:25
Henri Ruutinen 03.03.2017 21:09
Henri Ruutinen 30.08.2016 12:25
Henri Ruutinen 25.08.2016 12:25
Henri Ruutinen 24.07.2016 01:07
Henri Ruutinen 22.07.2016 17:50
Henri Ruutinen 20.07.2016 18:32
Henri Ruutinen 18.07.2016 00:32
As the sudden jump in the version number may indicate, this is a rather big update. Pickmybrain 0.90 BETA utilizes linux sort ( if it is available ) for sorting temporary match data. Unlike before, the temporary data is not written into database but into a text file instead. Not only is this already faster, but it also uses less space than the database table + covering index design. And when it comes to sorting the actual data, linux sort blows MySQL away. For big indexes, the sorting time advantage for linux sort can be tenfold or even more. This is only good news, but of course, your web enviroment has to support the exec() function and have the linux sort program preinstalled. For the unlucky ones ( and Windows users ) the older MySQL sort method is still supported and will be supported on the future versions too.
As of result I had to change the compressed inverted index architecture a bit. Bad news is that you have to purge your old search indexes but everything else is good. The new design uses less disk space and actually contains more information about the token positions in the indexed documents. This makes it possible to create new, more sophisticated ranking algorithms in the future if such needs arise. Also the program has to do less work during the actual indexing phase. Token matches are no longer stored as token pairs but as field positions instead which greatly reduces the need for random I/0 during indexing.
This version has also numerous bug fixes. The token_compressor_merger file had a design flaw which led to primary key collision(s) if certain circumstances
were met. Also, the prefix_composer lacked a string multibyte definition and had an incorrect start offset for growing indexes which led to inconsistent
number of created prefixes if multiprocessing was enabled.
- the indexer.php file has a new parameter: purge ( this removes all previously indexed data from the current index )
- PMB_MATCH_STRICT matching mode will now require that the first provided keyword is a first token of some field for the document to be considered as a match. ( before it wast last provided keyword = last token of some field )
In the future versions I will be concentrating more on the application programming interface's performance.
Henri Ruutinen 21.06.2016 03:32
- The problems with memory consumption have now been (mostly) fixed. In earlier versions tokens were kept in the memory during indexing but now they are inserted into a temporary table with an unique index. Match data and tokens are matched later with help of a 48-bit checksum ( or 32-bit and 16-bit checksums to be exact ). Performance did not suffer either, initial tests show promising results.
- Performance upgrade: replaced prepared statements with PDO's quote-method. This method turned out to be twice as fast and there should not be any security implications either.
- The prefix data merging method is now better. Earlier the temporary prefix data was kept and all the prefixes were compressed again when the indexer was run. Now the new data is merged directly with the old data, which greatly reduces the disk impact.
- Temporary data for token matches and prefixes is now removed after indexing is done.
- Added index id into the web-based control panel. Earlier you just had to figure it out :)
- Removed some old, redundant code.
Henri Ruutinen 07.06.2016 18:54
- Pickmybrain can now be used from the command-line, see files clisetup.php and clisearch.php
- as a result, configuration files are now text-based and not php files
- Fixed incorrect index definition on the PMBDocinfo table for web-crawler indexes
- The indexed documents counter should now be more accurate on database indexes
- Fixed a bug that prevented dialect processing from working properly when user made changes to the charset
- When indexer is launched, the index state will now be updated quicker to the database
Henri Ruutinen 28.05.2016 21:55
- MyISAM temporary tables now use compressed indexes ( ~20-25% less space usage + very very small performance gain )
- Fixed some issues with the external read-only database configuration
- Made a small change to the prefix compressor, now there is less incoming traffic from the database server
- Fixed a bug that caused manually defined attributes textarea to have "2" as default value on new indexes
Henri Ruutinen 26.05.2016 20:56
- Fixed a deltadecoding bug, which essentially stopped prefixes from working
Henri Ruutinen 22.05.2016 15:47
I decided to release an initial version of Pickmybrain. In no way its a finalized product but it should give some clue about what is to be expected. Pickmybrain is licenced under GPLv3, which means you are free to download, modify and redistribute it as you wish ( as long as you remember to honour the other license conditions :) ).
The beta version has still some quirks. Here is a list of currently known shortcomings:
1. There is no working memory management system in place yet.
2. The temporary table system for token hits and prefixes is flawed in some ways
3. Memory consumption of PMBApi could still be optimized further
The whole keyword dictionary is kept in memory during indexing, which speeds up the operation but requires quite a lot of memory. If multiprocessing is enabled, each process will keep its own dictionary and the memory consumption will add up. Other pitfall is the process (token_compressor.php) in which temporarily stored token matches are fetched from the database, compressed and inserted again into database table PMBTokens as compressed binary strings. The same thing applies also for the prefix compression process (prefix_compressor.php) and the table PMBPrefixes. These two last ones should actually be somewhat easy to fix with simple memory consumption monitoring and write buffer tuning. The first one requires a some sort of compromise between memory usage and performance.
When keywords are tokenized, they are inserted into a temporary MyISAM table named PMBdatatemp (checksum (int), token_id(mediumint), document_id(int), count(tinyint), token_id_2(mediumint), field_id(tinyint)).
The checksum column is a crc32 checksum of the actual token and it is stored because the data needs to be read in an ascending order to ensure decent inserting performance
for the final InnoDB table PMBTokens with columns (checksum(int), token(varbinary40), doc_matches(int), doc_ids(mediumblob)) and a composite primary key of (checksum, token).
MyISAM tables have great insert performance, especially when the indexes are disabled right after the table creation and enabled after the table has been fully populated with data.
Initially the data is in random order, but after the covering index is enabled with the proper ALTER TABLE command,
a sorted index is created and it satisfies the whole select query providing good performance. However this method has three downsides:
1. the inserted temporary data is not compressed, thus it takes a lot of space
2. the resulting index takes a lot of space
3. creating the index takes quite a long time, especially for tables that have billions of rows.
One solution could be to write the temporary data in compressed form, since it needs to be compressed at some point. This would use less space and quite possibly the writes would be faster too. The table could, for example, have three columns: checksum, token_id and doc_ids. However, this would create a new problem: how the read the table in a correct order. Creating an index on the checksum column would be trivial, but the real problem is the index would be non-covering. This would be ok if we just would like to fetch the checksums in correct order, but now every entry in the index points to a data row that is at random position in the disk and it is a real dealbreaker for conventional hard disk drives. This pretty much where I got stuck. The current method works though, but I know in my gut that there is a better method, which I haven't discovered yet. If you're reading this and have some wild ideas, don't hesitate to mail me tips or try to solve it yourself ;)
The PMBApi needs to store matching document ids into arrays. If user provides multiple keywords when querying a large search index, it is possible that the resulting array(s) have a lot of items. And as you probably know, PHP arrays take a lot of space. However, I don't feel this is a real dealbreaker and I will personally pay some attention to it in the future releases.
Henri Ruutinen 21.05.2016 21:20
It has been almost one year since I started this project. The original goal was to create a simple web crawler and a background system for indexing websites without too much hassle. Well, my ambitions grew along the way. Instead of a simple web crawler it actually started to make sense - for me at least - to expand the abilities of the background system, or a search engine as some might say, to support indexing of various databases. Some people would certainly want a more controlled way for indexing data and surely in many situations it is more simple to read the data directly from a database rather than output it as a group of web pages at first. Simplicity in terms of usability was one of my main goals for the project but I personally feel those searching for a very simple and easy to implement solution won't be dissapointed either.
I chose MySQL as a data storage solution for my project. It is widely available on shared web hosting services and it's supported on many other platforms as well. Databases such as MySQL have also built-in methods for data caching, which is really great since I didn't feel like inventing the wheel again. But from the beginning I had no intentions of using the existing full-text search feature. It has proved to be notoriously inconsistent over different MySQL versions and it is not very performant either. The tricky part was of course to create something better with data structures and limitations of MySQL. I started with a traditional relational approach. And for a long time I really struggled to create something that would be versatile ( feature-wise ) and performant at the same time. In the end it just seemed impossible - after many hours of optimization the queries would run reasonably fast when already cached, but the real problem was the caching itself. Not that it would not work, but an inverted index like this with every token hit recorded as its own row simply took too much space and filled the precious buffer bool way too early. It was time to think different.
I was researching on how to compress integers efficiently and ran into an article about variable byte codes. Normally when an integered is stored, the nominal size of the integer is exactly how many bits will be used for storing the actual value. If I would like to store a number two ( 10 in binary ) as an 32-bit integer, it would be stored with 30 leading zeros. Variable byte encoding aims to reduce the number of space consuming leading zeros by dividing the integer into smaller parts ( into bytes for example ). First bit of each part ( or byte ) is reserved for indicating if the number ends at that part. In the case of number two, we would end up with a bit sequence of 10000010. That is already 4 to 1 compression ratio. If we want to encode an array of integers, we can can apply an another trick as well. If the values can be sorted in ascending order, only the first value needs to be stored as a complete number - the following values can be stored as delta values. For example an array with values (1, 100, 150, 230) could be delta encoded into (1, 99, 50, 80). Combined with variable byte encoding, delta encoding provides additional space savings.
This would of course mean moving from the common one column one value approach to variable length binary columns. The relational database model would also need to be completely discarded. But hey, anything that works! Many parts of the program code had to be rewritten and the code base became inherently more complex since compressing, decompressing, writing and reading binary data all need customized functions. At the same time I had a new old idea: for some years I had been developing a library for sentiment analysis and in fact it was already in use in an another project. I had pretty much finished libraries for english and finnish languages but the actual analyzer still lacked something. The weakness of my original approach started to raise its head when analyzed texts started to get long. Normally the writer expresses his/her feelings on many different subjects and just by analyzing the text as whole does not give enough information on which topic the writes likes or dislikes. But then I realized combining a search engine and a sentiment analyzer makes actually a lot of sense. A search engine provides a natural way to tokenize the document and storing the score context of individual tokens is trivial when you've already got the right architecture. So that is what Pickmybrain turned out to be - a combined search engine and sentiment analyzer.