Welcome to the Treehouse Community
Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.
Looking to learn something new?
Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.
Start your free trialAlex Flores
7,864 PointsShould I create many different tables and join or fewer tables?
This may have been answered in one of the videos, but I didn't catch it. It seems like a lot of work to create a bunch of tables that only have one or two data categories. In the example in the video, why couldn't we just create one table that includes the Movie, Year made, and the genre? Why create two different tables and then join them?
Is it strictly for speed or does it provide more flexibility when combining data tables?
Thanks
1 Answer
Eric Butler
33,512 PointsAlex, This is a good question because it's an important step in learning how to build relational database systems. The common wisdom is that you should "normalize" all tables, which means reduce duplication (redundancy) of data as much as possible. The reason to do this is to make your database easier to maintain and less likely to contain errors (it should also keep your overall database size smaller). Normalizing does help with flexibility, but it does not help with speed; in fact, it will make all of your SQL responses (infinitesimally) slower [because it's a more complicated request for SQL to interpret] -- but unless you are handling an enterprise-level amount of requests per second it's a meaningless difference. What's funny is that I've heard that super-huge database-driven sites (like Facebook) actually do what you're suggesting, and go against common wisdom and use the anti-pattern of keeping more data in fewer tables, in order to get that teeny-tiny speed gain which, for a site like Facebook, would expand out to real performance savings. Let me reiterate though, that any database you build, even for a decently successful corporate client, will not even come close to needing to resort to this bad practice, so you should normalize your tables and make your database-administration life a lot easier.
Here are some off-the-cuff examples of why to normalize:
Using the movies database example, let's say you have 150,000 movies in the table, with the genre written out in the same record -- no separate "genres" table. Now you want your website to have a drop-down menu so users can select a genre to filter movies by. How do you create that drop-down? If your genres are all hard-written into the 150,000 records, you'd have to query every record in the entire database and return a set of all the unique genre values. That's a pretty significant performance problem. And what if some instances of a genre are spelled wrong? On the other hand, if you have a table named "genres" (and just use the genre's ID in the "movies" table), you simply have to loop through the dozen or so records in that table, which is a trivial SQL command (SELECT genre_name FROM genres), and you're done. You also eliminate any possibility of stray spelling mistakes.
Another example. Let's say you have a database for a small social-sharing website. One of your tables keeps track of a post's "likes." It is not normalized. Its fields are: the post itself, the username that liked it, and the date/time they liked it. Let's say this website has been running a couple years, it has 20,000 users and you have 6 million records in this table. The website allows users to edit their posts, so whenever a post is edited, every record for that post in this table -- not to mention in every other table where the post is referenced -- has to be updated to reflect the new content. The website also allows users to change their username, so whenever any user changes their username, the database has to update every record in this table -- and every other table where their username is referenced -- to match their new username. Do either of these seem like a good idea? Or do they seem like major performance problems? Not only in speed of execution, but also in their impact on your data's integrity: if something about the query was mis-formed, either of these updates could easily compromise hundreds or thousands of records, damaging posts's like-counts and users's like-histories across your site. If, on the other hand, the "likes" table simply used the post's ID and user's ID, and there was a separate table called "posts" that kept track of the post's ID, content, etc., and a third table called "users" that kept track of the user's ID, username, etc., then whenever a post gets updated, or a user changes their username, only that single record in the "posts" or "users" table needs to be updated, and all relationships between all the data remain intact.
The more you can keep your types of data separate, and only use IDs and "mapping tables" to draw relationships between those types of data, the easier you'll be able to access and maintain all the data. It's easy to think, when you have 10 records in a starter database, that normalizing seems like overkill, but as you get more comfortable making database architecture decisions, always imagine how your plan would work if the table had 10 million records in it. (A whole lot of real-use databases do.)
Hope that helps you a little, let me know if you have any questions or thoughts.
Alex Flores
7,864 PointsAlex Flores
7,864 PointsEric, I really appreciate the detailed response. That was one of the best responses I think I've ever received and it was extremely helpful. I'm putting together my first database schema and it's much more difficult than I thought it was going to be.
I believe it's a pretty basic database. I'm storing customer information, customer product information and a email_queye table. I have about 15 tables right now and I'm tempted to just use one primary key (customer_id) in the main table and then just use that primary key in a reference in 12/15 other tables. To me it makes sense to use just one primary key to associate many of the other tables, but I'm hesitant to do this, because I've been looking at examples of other database models and I noticed that most of them don't do this. I'm wondering why this would be the case?