In my previous post I blogged about upgrading to MariaDB 5.2 RC. This is the first blog post where I will look at a new feature in MariaDB 5.2: Virtual Columns. This feature was originally contributed to MySQL by Andrey Zhakov in 2008. MariaDB is now the first to include it and ship it in an upcoming stable release. (I believe it is also included in Drizzle, at least Monty Taylor was working with Andrey on it some time ago.) The MariaDB knowledgebase explains it in detail, but the basic idea is easy to explain:
> create table table1 (
-> a int not null,
-> b varchar(32),
-> c int as (a mod 10) virtual,
-> d varchar(5) as (left(b,5)) persistent);
Query OK, 0 rows affected (0.06 sec)
> insert into table1 values (11, 'some text',default,default);
> select * from table1;
+-----+----------------+------+-------+
| a | b | c | d |
+-----+----------------+------+-------+
| 11 | some text | 1 | some |
The "virtual" flavor of virtual columns is computed on the fly when data is queried (like a VIEW). The "persistent" flavor otoh is computed when data is inserted, and stored in the table (like a materialized view, which MySQL or MariaDB doesn't provide yet). Persistent columns can be used in secondary indexes and partitioning keys. More on that later.
The obvious use case isn't actually interesting (imho)
So we are all excited about new features, especially as it was contributed from the outside of MySQL AB. But what is it good for? The obvious use case for virtual columns would probably be to compute some value out of something else available in the table:
MariaDB [test]> CREATE TABLE product (
productname VARCHAR(25),
price_eur DOUBLE,
xrate DOUBLE,
price_usd DOUBLE AS (price_eur*xrate) VIRTUAL);
Query OK, 0 rows affected (0.28 sec)
MariaDB [test]> INSERT INTO product VALUES ('toothpaste', 1.5, 1.39, default);
Query OK, 1 row affected (0.02 sec)
MariaDB [test]> SELECT * FROM product;
+-------------+-----------+-------+-----------+
| productname | price_eur | xrate | price_usd |
+-------------+-----------+-------+-----------+
| toothpaste | 1.5 | 1.39 | 2.085 |
+-------------+-----------+-------+-----------+
1 row in set (0.00 sec)
Nitpick: If I use INSERT without supplying an explicit list of columns, I cannot leave out the virtual columns from the list. Otoh I obviously cannot supply a value for them either, so I just have to write "default" for that place. Annoying, but I suppose this makes sense, since any column could be virtual, in a large table it would be easy to lose count if they were just omitted. So now, if I'm an American and I'd like to know the USD price for a product, I can just select it and don't need to know that the official price is actually set in a dangerous foreign currency. The only problem is... this isn't the right way to use a database! I've perhaps seen examples like this in SQL books in the 90's, but that was then. You could actually have "SQL for secretaries" type of courses back then. Today, at least I'm one of those that believe that any computation should always happen in the application layer and the database layer is just for storing the data. This is more portable, and usually more efficient. And secretaries don't use SQL to directly access the database!
Using virtual columns in a data warehouse
So suppose we have a table sales
which contains records of items sold, including a column for the date when the sale happened. This would be typical in a data warehousing scenario.
CREATE TABLE sales (product VARCHAR(25), price DECIMAL(6,2), date_sold DATE);
But suppose then that for Business Intelligence purposes, we are interested in investigating sales per weekday. That is a bit tricky with the above table. So we could add a persistent virtual column:
MariaDB [test]> CREATE TABLE sales (
product VARCHAR(25),
price DECIMAL(6,2),
date_sold DATE,
dayofweek INT AS (WEEKDAY(date_sold)) PERSISTENT,
KEY (dayofweek));
Query OK, 0 rows affected (0.04 sec)
MariaDB [test]> INSERT INTO sales VALUES ('toothpaste', 1.5, '2010-10-24', default);
Query OK, 1 row affected (0.01 sec)
MariaDB [test]> select * from sales;
+------------+-------+------------+-----------+
| product | price | date_sold | dayofweek |
+------------+-------+------------+-----------+
| toothpaste | 1.50 | 2010-10-24 | 6 |
+------------+-------+------------+-----------+
1 row in set (0.00 sec)
MariaDB [test]> select * from sales WHERE dayofweek=6;
+------------+-------+------------+-----------+
| product | price | date_sold | dayofweek |
+------------+-------+------------+-----------+
| toothpaste | 1.50 | 2010-10-24 | 6 |
+------------+-------+------------+-----------+
1 row in set (0.01 sec)
MariaDB [test]> EXPLAIN select * from sales WHERE dayofweek=6;
+----+-------------+-------+------+---------------+-----------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+-----------+---------+-------+------+-------------+
| 1 | SIMPLE | sales | ref | dayofweek | dayofweek | 5 | const | 1 | Using where |
+----+-------------+-------+------+---------------+-----------+---------+-------+------+-------------+
1 row in set (0.00 sec)
Note that for a PERSISTENT virtual column I can add an index, so searching per weekday not only becomes easier, but also efficient.
Let's combine it with partitioning
Speaking of data warehousing, in partitioning we often need a key like weekday or month to enable easy deletion of old records on a rolling basis. So let's combine the above with partitioning. Before virtual columns, I would create a partitioned table of past sales like such:
MariaDB [test]> CREATE TABLE sales (
product VARCHAR(25),
price DECIMAL(6,2),
date_sold DATE)
PARTITION BY RANGE (WEEKDAY(date_sold)) (
PARTITION mon VALUES LESS THAN (1),
PARTITION tue VALUES LESS THAN (2),
PARTITION wed VALUES LESS THAN (3),
PARTITION thu VALUES LESS THAN (4),
PARTITION fri VALUES LESS THAN (5),
PARTITION sat VALUES LESS THAN (6),
PARTITION sun VALUES LESS THAN (7));
Query OK, 0 rows affected (0.07 sec)
MariaDB [test]> INSERT INTO sales VALUES ('beer', 5, '2010-10-23');
Query OK, 1 row affected (0.01 sec)
MariaDB [test]> select * from sales;
+------------+-------+------------+
| product | price | date_sold |
+------------+-------+------------+
| beer | 5.00 | 2010-10-23 |
| toothpaste | 1.50 | 2010-10-24 |
+------------+-------+------------+
2 rows in set (0.00 sec)
MariaDB [test]> EXPLAIN PARTITIONS select * from sales WHERE WEEKDAY(date_sold)=6;
+----+-------------+-------+-----------------------------+------+---------------+------+
---------+------+------+-------------+
| id | select_type | table | partitions | type | possible_keys| key |
key_len | ref | rows | Extra |
+----+-------------+-------+-----------------------------+------+---------------+------+
---------+------+------+-------------+
| 1 | SIMPLE | sales | mon,tue,wed,thu,fri,sat,sun | ALL | NULL | NULL |
NULL | NULL | 7 | Using where |
+----+-------------+-------+-----------------------------+------+---------------+------+
---------+------+------+-------------+
1 row in set (0.00 sec)
In the above I've created a table sales
where all records from the same weekday is in the same partition. This allows me to easily drop all records of a given day, in particular I could do a nightly delete of the partition that is 7 days old, so that I keep a history of one week. (And you'd use the same system to store records worth of 12 months, 3 years... whatever the required retention time is.) But in the above there still isn't any index that would make it efficient to do SELECT queries using the WEEKDAY(). For a large table that is simply not possible. Also, even if we have nicely partitioned the records by weekday, you can see that when querying a weekday, MariaDB is still searching all partitions, because it doesn't understand that it could actually just look in the "sun" partition (this is called partition pruning). So using virtual columns I could instead do:
MariaDB [test]> CREATE TABLE sales (
product VARCHAR(25),
price DECIMAL(6,2),
date_sold DATE,
dayofweek INT AS (WEEKDAY(date_sold)) PERSISTENT,
KEY (dayofweek))
PARTITION BY RANGE (dayofweek) (
PARTITION mon VALUES LESS THAN (1),
PARTITION tue VALUES LESS THAN (2),
PARTITION wed VALUES LESS THAN (3),
PARTITION thu VALUES LESS THAN (4),
PARTITION fri VALUES LESS THAN (5),
PARTITION sat VALUES LESS THAN (6),
PARTITION sun VALUES LESS THAN (7));
Query OK, 0 rows affected (0.17 sec)
MariaDB [test]> INSERT INTO sales VALUES (
'toothpaste', 1.5, '2010-10-24', default);
Query OK, 1 row affected (0.00 sec)
MariaDB [test]> INSERT INTO sales VALUES (
'beer', 5, '2010-10-23', default);
Query OK, 1 row affected (0.00 sec)
MariaDB [test]> select * from sales;
+------------+-------+------------+-----------+
| product | price | date_sold | dayofweek |
+------------+-------+------------+-----------+
| beer | 5.00 | 2010-10-23 | 5 |
| toothpaste | 1.50 | 2010-10-24 | 6 |
+------------+-------+------------+-----------+
2 rows in set (0.00 sec)
MariaDB [test]> EXPLAIN PARTITIONS select * from sales WHERE dayofweek=6;
+----+-------------+-------+------------+------+---------------+-----------+
---------+-------+------+-------------+
| id | select_type | table | partitions | type | possible_keys | key |
key_len | ref | rows | Extra |
+----+-------------+-------+------------+------+---------------+-----------+
---------+-------+------+-------------+
| 1 | SIMPLE | sales | sun | ref | dayofweek | dayofweek |
5 | const | 1 | Using where |
+----+-------------+-------+------------+------+---------------+-----------+
---------+-------+------+-------------+
1 row in set (0.00 sec)
So using the virtual column dayofweek
we get both an index search and partition pruning.
There is more than one way to do it
In Perl there is the motto TIMTOADI: There is more than one way to do it. Also here I'm thorn between the two approaches available. The above is helpful and elegant. On the other hand I could also just have a normal integer column and make sure my application or ETL tool inserts the correct week of day. So which is better? Points against virtual columns (doing the calculation outside the database) Simpler SQL is more portable. Simpler SQL is often more efficient/scalable, since it is easier to scale outside the database. In MySQL we usually avoid triggers and stored procedures for the same reason. Points for virtual columns A nice way of getting indexes and partition pruning on a piece of data that is a derivative of the raw data. If you want a normalized database, this is the right way to do it, since date_sold and dayofweek are duplicate data. (But a data warehouse is often de-normalized anyway.) If your application is complex, or if many applications use the same database, then changing all applications to correctly insert a dayofweek column may not be practical. Using virtual columns the column can be added in a centralized way and the DBA can enforce database integrity, without having to trust the application developers.
I like it
Perhaps it is "yet another SQL feature" following views, stored procedures and triggers (from MySQL 5.0), but I like it. For those of us that still want the database to be normalized and responsible for integrity, this is a very useful feature that enables use of indexes and partitions in a way not possible before.
Coming up next: Using virtual columns when MariaDB is used as a key-value document store
In a follow-up blog post I will show how I used virtual columns for a use case where MariaDB is used as a document store, and thanks to virtual columns we can add secondary indexes alongside the normal key-value access pattern, or we can opt to SELECT just a snippet of the document instead of the whole BLOB. Before I write the next blog, I invite you to guess the result of the benchmark. I had two conflicting rules of thumb as hypothesis:
- SELECTING via a virtual column only one word instead of a large XML document saves bandwidth and is therefore faster than the standard key-value fetch.
- SELECTING via a virtual column only one word instead of a large XML document uses more CPU in the database and is therefore slower.
I tested using both VIRTUAL
and PERSISTENT
virtual columns, and on a CPU bound, disk bound and network bound (ie performance dominated by network latency). Place your bets, the results will be published later this week!
Other use cases?
I'd still be interested to hear about more good use cases for virtual columns. Please comment or henrik.ingo [at] avoinelama.fi (email) if you have some. If you've used this feature on Oracle or SQL Server, please tell us what it is used for?
- Log in to post comments
- 31597 views
Drizzle pulled it
"(I believe it is also included in Drizzle, at least Monty Taylor was working with Andrey on it some time ago.)"
Drizzle cut virtual col's a long time ago.
Re: Drizzle pulled it
Ok, thanks for helping out. I thought that might be the case, since I couldn't find more info on it, but otoh I didn't want to be caught "leaving out" Drizzle either, so that's why I mentioned it.
Also migration friendly
As far as I know other RDBMS have an option to specify a column using functions. Almost sure it applies to SQL Server at least. Virtual columns will make migration to MySQL/MariaDB easier.
(but an 'off-topic': what I miss most in MySQL DDL is CHECK CONSTRAINTS. Any chance that MariaDB will add this?)
You can actually effectively
You can actually effectively have check constraints in MySQL using views and the WITH CHECK OPTION part of the CREATE VIEW statement. http://dev.mysql.com/doc/refman/5.1/en/create-view.html (this is in MySQL 5.0 and up, I give the docs for 5.1 because that's the supported version right now).
It's not perfect, and obviously it would be better to actually support CHECK CONSTRAINTS, but for folks who are waiting for check constraints in order to migrate, this might be all they need.
Leaving out columns
If you specified the column names of your inserts, would you still need to put DEFAULT in there?
I generally avoid doing inserts without specifying columns as I can never remember the order of columns and its a bit clearer if the column names are spelled out.
e.g.
INSERT INTO sales(product, price, date_sold) VALUES ('toothpaste', 1.5, '2010-10-24');
Re: Leaving out columns
No, in that case you of course leave out the virtual columns from your explicit list.
In real world situations I always specify the columns too, so the requirement to use 'default' isn't then an issue really. It was just annoying when typing simple examples, that's all.
Virtual column use
I have a table with a list of person and their birth date. When I discovered the MariaDB virtual columns, the first example that came in my mind was to add a column calculating their age. Unfortunately, this was refused by MariaDB, probably because the result is non-deterministic.
I can understand this restriction for permanent column, but what prevents accepting non-deterministic values for virtual columns?
Interesting question. I think
Interesting question. I think the answer probably is that they made the restrictions the same for permanent and virtual columns. But I don't know, you need to ask Philip or Igor at https://launchpad.net/~maria-developers
It could be argued though that the user has the right to expect that the contents of a table are deterministic, and non-deterministic results can be had via a VIEW or PROCEDURE only. I don't know if such a restriction makes sense, but it's something to think about.
Email addresses
It's not terribly efficient, especially for searching, to store complete email addresses in fields by themselves. It's fine if you're only ever looking for an exact match on a complete address, but as soon as you need to find all addresses in a domain it's suboptimal. So you might break it out into local part and domain fields (or even break out domains into a separate table), but then you've lost the easy syntax for finding a single address. Seems like a perfect use for virtual fields to me! So now the question is which way around to split it?
a) a single real field to store the full address in, and two virtuals for local and domain elements
b) local and domain in real fields and a virtual field concating them together with an @ for the full address
searching on domain is the most common scenario for this schema, so is there any difference in indexing a persistent virtual and a real field?
Tried to add Age
I tried to calculate an Age column but I get a 1901 error, Function or expression is not allowed for column 'Age'
ALTER TABLE members ADD COLUMN Age TINYINT AS (DATE_FORMAT(FROM_DAYS(TO_DAYS(NOW())-TO_DAYS(dob)), '%Y')+0) virtual AFTER DOB;
I am using MariaDB 5.5.30
Any ideas what I'm doing wrong?
You cannot use NOW() for a
You cannot use NOW() for a virtual column as it is not deterministic.
an ENUM works better
Except that it is not a scalar (so you can't do things like VALUES LESS THAN), I use an ENUM to put a day name with dates.
ALTER TABLE `s_vehicle_log` ADD `Day` ENUM('Sun','Mon','Tue','Wed','Thu','Fri','Sat') AS (DATE_FORMAT(`Date`, '%a')) PERSISTENT AFTER `Date`;
An ENUM uses less storage (three bits) and should be faster. I like having it there for when I'm "data grooming" by looking for bad records to correct.
By the way, in 10.1.6 (at least), constants are supported, at least in PERSISTENT columns.
ALTER TABLE `z_test` ADD `Const` INT(5) AS (`ID` * 2) PERSISTENT AFTER `ID`;
MariaDB [EcoReality]> SELECT ID, Const from z_test;
+----+-------+
| ID | Const |
+----+-------+
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
+----+-------+
4 rows in set (0.00 sec)
I'm going to go push the envelope on other restrictions to see if they've been relaxed.
(oops)
(To those who are paying attention, the AS expression is obviously supposed to be "`ID` + 1".)