By Srikanth Ramamurthy on Sep 11, 2013 7:51:24 AM
As programmers, we all know what a CSV file is all about. For novice, a CSV file is a "Comma Separated Values" files having some kind of data. Each line represents a record while each value is separated by a comma ",".
Let's take a simple example:
Say the CSV file contains order data. Each order record contains data about a particular order like order number, product code, quantity, ordered by, date etc. Now lets take a hypothetical requirement which says, the user needs to know how many orders were ordered per product.
Well given this requirement, as a programmer what are our options?
Option 1] At the basic level, read the order file, parse each line and write the logic to get count of orders per product.
Option 2] Load this CSV file into a database like MySQL and write database queries to get the orders per product.
Option 1 is good if the file size is small and when we have to deal with less number of orders. If the file size is big then we will get into other issues like performance, memory etc.
Option 2 is good but it requires database administrative capabilities to load the data into database. Of course we can use in-memory database but still we may need to use database utilities to load the files into the database. This will lead to duplication of data that is data in CSV file as well as same data loaded to database table.
So where does this leave us? Have we run out of options? I am sure we would have tried & used the above two options . But I wanted a different approach. Following questions were lingering in my mind:
1] Why can't I write SQL queries against the CSV file itself. After all it's like any RDBMS table.
2] Why should I load the CSV file into some database before I query it?
3] Why can't I create a database table & attach the CSV file to it?
Well my quest for the answers led me to "HSQL Text tables". HSQL is a database which can run in-memory. No pre-installation, configuration is required to use this database. It has a feature to create a "Text" table and attach the CSV file as a source to that table. No need of manually loading the data into this table.
So for our requirement, we used HSQL Text tables. Below are the steps to use Text tables:
1] Set the system property:
This will enable HSQL to load the CSV file from any path. In our case, we wanted to load the file from network path.
2] Create Text table using the below command:
"CREATE TEXT TABLE tbl_Order ord_no varchar(30), prod_code varchar(30).....";
3] Create any indexes, if required.
4] Attach the CSV file as a source to the table using the below command:
SET TABLE tbl_Order SOURCE "x:\\server\\order.csv";
Voila. That's it. Once the above steps are done, we can execute SQL queries as if its a normal table.
In a nut shell, HSQL Text tables are powerful & easy to use. It eliminated the lower level of file reading & parsing the file as well duplication of data and still met the requirement.