Hadoop: Handling of Binary Data

Posted on June 14, 2009 at 7:42 pm in

Several large companies have recently implemented Hadoop on their cloud based computing platforms such as aws.amazon.com. Now small players with large amounts of data to be processed have access to thousands of computers for data processing purposes.

Binary data processing could include everything from image converting, scaling, watermarks. Converting video to a web ready format such as .flv. Handling/parsing other custom binary formats. Converting wav to mp3 etc.

However currently there is no one easy way to process binary data with Hadoop. In order to read/write binary data with Hadoop you need to implement a custom jar and extend the Hadoop java classes for reading and writing the data.
I’ve implemented a binary version of these classes and have several examples of how to use the classes FileInputFormat, FileOutputFormat RecordReader and RecordWriter.

Click here to view my products using Hadoop with binary data readers.

My laymans explaination of Map/Reduce:

Map/Reduce is a 2 step process.

  1. Performing a map function on a key,value pair
  2. Performing a reduce function on a collection of key,value pairs.

Each piece of data is referenced by a key=>value
Map( key, value ) => Array( (key1, value1), (key2,value2) )
Reduce( ((key1,value1), (key1,value2)) ) => ( key1, value3 )

Typically you would run the Map and Reduce functions in a distributed network on many cpu’s.
So you could have many map functions concurrently running and then once the data has been all mapped it gets reduced. Again the reduce process happens in a distributed network on many cpu’s.

Additional reading:
Get familiar with the map/reduce concept
Read more about Hadoop implementation of map reduce.


MySQL Insert

Posted on June 14, 2009 at 4:15 pm in

Ok, this is a simple post on something that really is annoying about MySQL and Navicat 8.

Basically anytime you execute a single INSERT statement it’s slow as heck on MySQL.
Very very very very slow.
I honestly think there is a bug / very poor implementation issue that should be fixed.

For instance I was inserting about 22,000 rows using an SQL file produced by a tool called Navicat 8.
It took about 5 minutes to do that. That’s really not an acceptable amount of time when you have users like me waiting on that data being inserted.

So in light of this after a bit of googling I found an actual page on the mysql site highlighting
Speed of Insert statements.

Specifically it mentions that you can insert more then 1 row per an insert statement.
OK? So lets try that out:

BOOM 22,000 records inserted in less then one SECOND.

Now that’s what I’m talking about. I’ve taken 22,000 inserts and made them take about the same amount of time as 1 single insert. It’s incredible.

    Here’s an actual example of multiply row inserts per a single insert statement.
LOCK TABLES a WRITE;  // makes the inserts even faster
INSERT INTO table_a VALUES (1,23),(2,34),(4,33); // insert 3 rows into table_a
INSERT INTO table_b VALUES (8,26),(6,29); // insert 2 rows into table_b
...
UNLOCK TABLES;

I’m not sure who to blame more here MySql 5 or Navicat 8.

For MySql 5 there shouldn’t be such a big difference in speed between a single insert and an insert /w many rows at once.

Navicat 8 has a poorly implemented Export Wizard for SQL files. The alernative to Navicat 8 export is mysqldump –user username –password database table which is what most people use when their GUI tool breaks down like this.

Originally I was going to blame PHP for my problems but after writing and rewriting a function to load the SQL files from DISK I found that the speed issues were really MySql’s problem worsened by the SQL queries produced by Navicat 8.

Here’s a simple function I wrote for reading a SQL file from disk and installing it to whatever database you specify in the settings.

	/**
	 * Simple function for installing sql files.
	 * http://dev.mysql.com/doc/refman/5.0/en/insert.html
	 *
	 * **WARNING**
	 * A single insert with many values is about 1000000x faster
	 * then multiple inserts due to way mysql is coded.
	 * Use text search and replace to fix multiply inserts into a single insert.
	 * **WARNING**
	 *
	 * @param unknown_type $filename An SQL file to read in.
	 */
	function fast_install( $filename )
	{
		$mysqli = new mysqli();

		$mysqli->connect( DB_SERVER, DB_USERNAME, DB_PASSWORD, DATABASE );

		$content = file_get_contents( $filename );

		// set a 16 meg limit on the query
		$result = $mysqli->multi_query( "SET GLOBAL max_allowed_packet=16*1024*1024;" ) or die( mysqli_error( $mysqli ) );

		$result = $mysqli->multi_query( $content ) or die( mysqli_error( $mysqli ) );

		$mysqli->close();
	}

Top