Transcription

EBOOKREDSHIFT VS TERADATAAN IN-DEPTH COMPARISONAMAZON REDSHIFTTERADATA

Table of ContentsRedshift Vs Teradata1Redshift Architecture & Its Features1Teradata Architecture & Its Features2Redshift Data Model4Teradata Data ModelProsCons789Teradata Pros and ConsProsCons121213Features supported only by Teradata, not Redshift15Redshift Vs Teradata In A Nutshell16Pricing and Effort Comparison20When and How to Migrate data from Teradata to Redshift21Summary22ETL Challenges While Working With Amazon Redshift23

1Redshift Vs TeradataRedshift versus Teradata has been one of the most debatable datawarehouse comparisons. In this ebook, we will cover the detailedcomparison between Redshift and Teradata.Redshift Architecture & Its FeaturesRedshift is a fully managed petabyte scale data warehouse on the cloud.You can even start working from a few Gigabytes or Terabytes of data.Additionally, you can also scale it up to petabytes depending upon yourbusiness requirement. Redshift engine is also called a cluster and it isbuilt up from one or more nodes. There are two types of nodes calledCompute and Leader node. Compute node contains 2 or more slicesdepending upon node types. Leader node does multiple roles whichinclude communicating with JDBC/ODBC client and creating the queryexecution plan to transfer it to compute node(s). Also, the cluster isincomplete without a Leader node.You can check out our blog for a detailed article on R edshift Architecture .

2Teradata Architecture & Its FeaturesTeradata is an RDBMS, meant for a data warehouse with an on-premisesetup. It requires installation since it is unavailable on cloud platforms.Although Teradata is not over the cloud, you can spin up a Teradatainstance on a cloud VM. Teradata is designed on MPP shared nothingarchitecture.Here is a diagrammatic representation of Teradata Architecture.

3The four major components of Teradata are as follows:1. Node: The primary component of Teradata is called Node, which is abasic unit of Teradata. It has its own OS, CPU, RAM, disk space etc.2. Parsing Engine: Parsing Engine or PE is responsible for preparing thequery execution plan.3. BYNET: BYNET receives query execution plan from PE and transfers itto AMPs aka Virtual Processor and vice versa. It is also called as MessageParsing layer.4. Access Module Processor (AMP): AMP is an important component ofTeradata. AMP manages the processing of data by storing it in vDisks.Data can be stored in any AMP depending on the hash algorithm. In casethe first BYNET fails there is an additional BYNET to take over. BYNET isresponsible to communicate between the AMPs. In multi-node systems,Teradata will have at least two BYNETs to make the system fault tolerant.

4Redshift Data ModelRedshift data model is designed for Data warehousing purposes. Theunique features of Redshift make it a smart Data warehouse choice.1. Redshift is a fully managed data warehouse. You don't have to worryabout setting up and installing the database. You just have to spin upyour cluster and the database is ready.2. Redshift’s backup and restore are fully automatic. Through automaticsnapshots, data in Redshift automatically gets backed up in S3 internallyat regular intervals.3. Data is fully secured by inbound security rule and SSL connection. Ithas VPC for VPC mode and inbound security rule for classic mode cluster.4. Redshift stores data in the columnar format, unlike other datawarehouses storage. For example, if you hit your query for a specificcolumn, Redshift will exclusively search in that specific column instead ofthe entire row. This saves an enormous amount of time in queryprocessing.5. Data is stored in blocks of 1 MB instead of typical blocks of 8 KB or 64KB which helps Redshift to store more data in a single block.6. Redshift does not have the concept of indexes. Instead, it has zonemaps. With the help of zone map Redshift easily identifies which blockhas lowest and highest value for that column. Zone maps inform thecluster about all the blocks that are needed to be read.7. Redshift has column compression (encoding). ANALYZECOMPRESSION command automatically tells what compression strategyto apply for that table. Redshift provides various encoding techniques.Refer AWS documentation for more details on encoding.

58. Redshift has a feature of caching the result of repeat queries for fasterperformance. To check whether your query has used cache, you can seethe output of column source query available in SVL QLOG. If your queryhas used cache it will store the value of query id of which was run by thespecific user id.Example:SELECT USERID, QUERY, ELAPSED, SOURCE QUERY from SVL QLOG WHEREUSERID in (600, 601);In the below example, QUERY ID 853219 of USERID 601 has used thecache. (QUERY ID 123456 of USERID 600). Also, QUERY ID 853219 ranby userid 601 has utilized the cache and elapsed time in microsecondshas reduced drastically.USERID QUERY ID ELAPSED SOURCE QUERY-------- ------------- ---------- --------------600 123456 90000 NULL600 567890 80000 NULL601 853219 30 1234569. Redshift data model is similar to a typical data warehouse when itcomes to analytical queries. You can create fact tables, dimension tables,and views. It supports all major query execution strategy i.e., Inner join,Outer join, Subquery, and Common Table Expressions (with clause).10. From a storage perspective Redshift cluster maintains multiple copiesof your data as part of fault tolerance.

6Teradata Data Model1. Teradata is a massive parallel Data warehouse with shared-nothingarchitecture. However, unlike Redshift, the data is stored in a row-basedformat.2. Teradata uses a different kind of indexes for fast data retrieval. Indexesinclude Primary, Secondary, Join, and Hash Indexes, etc. Please note thatSecondary Index does not affect the distribution of rows across AMPs.Although, the secondary index takes extra processing overhead.3. Teradata supports and enforces Primary and Secondary index.4. Teradata has a hybrid storage concept where frequently used data isstored in SSD while the less accessed data is stored in HDD. Teradatahas a higher storage capacity than Redshift.5. Teradata does support Table partitioning feature, unlike Redshift.6. Teradata uses the Hash algorithm to distribute data into various diskstorage units.7. Teradata can scale up to 2048 nodes. It has a storage capacity rangingfrom 10 TB to 94 petabytes thus providing higher storage capacity thanRedshift.8. Teradata supports all kinds of major SQL related features (PrimaryIndex, Secondary Index, Sequences, Stored Procedures, User DefinedFunctions, and Macros etc) which are compulsorily needed as part of DataWarehouse RDBMS.9. Teradata's data model is designed to be fault tolerant. It is alsodesigned to be scalable with redundant network connectivity to ensurethroughout data connectivity and availability.

7Redshift Pros and ConsPros1. Loading and unloading of data is exceptionally fast. You can load datain parallel mode. Redshift, even for a high volume of data, supports dataloading from the zipped file. Redshift recommends loading the data fromthe COPY command for faster performance.2. You can load data from NoSQL database service, AWS DynamoDB.Refer AWS documentation for more detailed information aboutDynamoDB.3. You have an option to choose the node type (Dense Storage or DenseCompute) of your cluster depending upon your data needs and businessrequirements.4. You can scale your cluster's storage and CPU for better performance atany instant without any impact to the cluster.5. You can migrate your data from various data warehouses into Redshiftwithout much hassle. AWS does provide a service for the same calledDatabase Migration Service (DMS). Refer to A WS documentation formore detailed information.6. You do not have to worry about the security as you can build yourcluster inside a VPC and also use SSL encryption for further protection.7. Redshift backup and restore feature is pretty simple. Throughautomatic snapshots, your data is automatically backed up regularly.Snapshots are incremental, so you do not have to worry about anymisses. You can also copy data to another region in case of any businessneed. Kindly refer A WS documentation for more details on working withsnapshots.

88. Redshift has an advanced feature called Redshift Spectrum. UsingRedshift Spectrum you can query huge amounts of data directly from S3.While doing so, you can skip the loading of data through COPY commandor any other method. You can refer to the detailed guide on R edshiftSpectrum for more information.9. Using Sort Keys, data can be pre-sorted based on specific columns.Also, the query performance can be improved automatically.10. Using Distribution Keys, data can be easily distributed across nodesequally to increase the query performance.11. Redshift provides various pre-built system tables and views to helpdevelopers and designers to help out during ETL and other processes.12. Setup related commands can be run through various modes such asAWS console, Command Line Interface (CLI), API, etc.13. AWS Redshift applies some patches and upgrades to the clusterautomatically through maintenance window (configurable value). enceyou do not have to worry about applying patches.Cons1. In Redshift, there is no concept of function, triggers, and procedures.2. There is no concept of sequence column in Redshift. You need tohandle it through your ETL logic in case you need to generate sequencenumber of your column.3. Unlike other common data warehouses, Redshift does not enforcePrimary keys or Foreign keys which can create data integrity issues.

94. Only S3, DynamoDB, and EMR support a parallel load in Redshift. Incase you want to load data from other services you need to write ETLscripts or use ETL solutions such as H evo .5. It requires a good understanding of Sort and Dist key. There are somebasic ground rules to set for sort and dist keys. If set improperly then itcould lead to hampering of performance.6. Distribution keys cannot be changed once it is created. You need to beextremely careful while designing your tables. Wrong distribution keyscould hamper the overall performance.7. In Redshift, there is no concept of DBLink, you cannot directly connectto another database/data warehouse tables for your queries.8. In Redshift, VACUUM and ANALYZE are mandatory on key tables. Itcan hamper the performance badly if run during business hours. Hence itneeds to be handled carefully.9. In Redshift cluster, there is a limit on the number of nodes, databases,tables, etc. Maximum storage limit is still lesser than data warehouses likeTeradata. Here is the node limitation list:Node TypevCPUStorage per NodeNode Rangedc1.large2160 GB SSD1-32dc1.8xlarge322.56 TB SSD2-128dc2.large2160 GB NVMe-SSD1-32dc2.8xlarge322.56 TB NVMe-SSD2-128ds2.xlarge42 TB HDD1-32ds2.8xlarge3616 TB HDD2-128You can refer to AWS documentation to know more about the limits inAmazon Redshift.

1010. Although Redshift in classic mode is still in use, its clusterperformance is relatively modest.11. Redshift still supports only a single AZ environment and does notsupport multi-AZ environment.12. Redshift has a limit on query concurrency of 15. You can have amaximum of 8 queues in a cluster. If your queues are unmanaged, then ithinders the performance.13. Your design should make sure that the cluster is not in use during themaintenance window period, else job will fail.14. There is no concept of table partitioning in Redshift.15. In Redshift, you do not have a concept of SET and MULTISET tables(SET tables are the tables that do not allow duplicates). This needs to behandled programmatically else it could lead to reporting errors if handledinappropriately.You can refer to Hevo’s blog which talks about the Pros and Cons ofAmazon Redshift in complete detail.

11Teradata Pros and ConsPros1. Teradata is a massively parallel data warehouse with shared nothingarchitecture.2. Teradata has provided pre-built utilities i.e. Fastload, Multiload, TPT,BTEQ etc.3. Teradata is linearly scalable. If data volume rises, AMPs or Nodes canalso be increased.4. Teradata also has fallback feature. In case one AMP is down, anotherAMP will take over for data retrieval.5. Teradata provides an impressive tool called Teradata Visual Explain. Itvisually shows the execution plan of queries in a graphical manner. Thishelps developers/designers to fine-tune their queries.6. Teradata provides Ferret utility to set and display storage spaceutilization.

12Cons1. One of the biggest cons of Teradata is that it is not cloud-based unlessscaled up to run over the cloud. It requires some initial setup or you needto integrate with other cloud service providers i.e, AWS or Azure.2. It is not a columnar data warehouse.3. Since Teradata is not a columnar DB, it runs entire row even if yousearch over a single column. You may end up with performance issuesunless your data warehouse is properly designed.4. If a query runs on a set of different columns over the bigger dataset, itcould lead to performance issues; unless query has been run on theindexed columns.5. Teradata only supports a maximum of 128 joins in a single query. If youwant to perform more joins, you need to break them into chunks andhandle it accordingly.6. Redshift outperforms Teradata in Analytical performance, Visualisationon storage, & CPU utilization visualization. Everything can be viewed in asingle AWS console or through the Cloudwatch monitor in Redshift. Onthe other hand, Teradata provides separate visual tools while for fewothers checks and commands need to be hit in Teradata client.7. Teradata has no default column compression mechanism. Columncompression needs to be done manually, and you can perform up to 256unique column value compression per column.8. There are a lot of limitations on the number of columns, table value, andtable name length in Teradata. You can refer to T eradata documentationfor more detailed information.

13Features supported only by Redshift, notTeradata1. The most valuable feature of Redshift is that it is cloud-based and fullymanaged. Although, Teradata has a Teradata Database Developer (SingleNode) a full-featured data warehouse software.2. No need to worry about backup and restore as manual snapshots andrestore can also be done.3. Backed up data (snapshot) is automatically stored in S3. No need toworry about storing data in tape or any outside system.4. Redshift has an excellent feature of loading data through COPYcommand that too in the parallel mode where all nodes/slices canparticipate together to make the performance faster.5. Redshift performs automatic column level compression, and it suggestscompression mechanisms on all table columns (command is ANALYZECOMPRESSION).6. Due to the VPC feature in AWS, Redshift security is too tight and wellcontrolled.

14Features supported only by Teradata, notRedshift1. Teradata supports various features including Procedures, Triggers, etc.2. Teradata has a column sequencing feature while Redshift doesn't.3. Teradata provides various load and unload utilities i.e. TPT, FastLoad,FastExport, Multiload, TPump, and BTEQ. You can use them dependingupon data volume, business logic, and leverage it in your ETL logic.4. Teradata has a few visual utilities which Redshift should have such asTeradata Visual Explain. In Redshift, you need to hit query to view Explainplan.5. Teradata supports MULTISET and SET tables while Redshift doesn't.6. Teradata supports Macros but Redshift doesn't. Macros are a set ofpredefined SQL statements logically stored in Database. Macros alsoreduce LAN traffic.Example:CREATE MACRO Get Sales AS (SELECT SalesId, StoreId, StoreName, StoreAddress FROM Stores ORDER BYStoreId;);Exec Get Sales; Thismacro execute command will retrieve all rows from Stores table.

15Redshift Vs Teradata In A NutshellItemsCloud perspectiveRedshiftFully managed Data Warehouse Core Data Warehouse is notover cloud.over the cloud. Initial setup isrequired by DBAs/Export.Teradata can be scaled to runover the cloud (AWS/Azure)with pay-as-you-go model.Backups are automatically takencare of through the snapshotBackup and restore feature. Snapshots are storedstrategyinternally stored in S3, which ishighly durable.Data Load andUnloadTable StorageTeradataTeradata backup and restorecan be manual or automated(using BAR) but data is storedin an outside system.Redshift leverages data loadthrough COPY command andunload through UNLOADcommand. Using COPYcommand, data is loadedautomatically so that all nodescan participate equally for fasterperformance.In Teradata, we have separateutilities to handle load/unload.Teradata provides TPT,FastExport, FastLoad, etc. Theycan be leveraged accordinglyfor your ETL/ELT.Redshift follows columnarstorage format. If the query is hitbased on a specific set of thecolumns or only on specificcolumn then it provides animpressive performance. Hence,aggregates are very fast inRedshift as it leverages columnlevel hit.Teradata follows row levelstorage. Teradata requires aproper indexing on columns sothat data can be storedproperly in AMPs. If indexesare not proper or table hit isdone on non-indexed columnthen it could causeperformance issue.

16Internal StorageReferentialIntegrity ModelIn Redshift, data is stored overchunks of 1 MB blocks of eachcolumn. Each block follows zonemapping. Using zone mapping,blocks stores minimum andmaximum value of that column.Redshift tables do have PrimaryKeys and Foreign Keys but itdoes not follow enforcement.You need to apply your logicsuch that referential integritymodel is applied on Redshifttables.In Teradata, the data storage ismanaged by AMPs undervDisks and data is distributedbased on hash algorithm (i.e.based on index defined etc)and data is retrievedaccordingly.Teradata tables have PrimaryKeys and Foreign Keys and itfollows enforcement.Hence, it has an additionaloverhead of doing referencechecks while processing.There is no concept of columnYou can define Sequence on asequencing. If you want to create column.a sequence on any column youSequence Support need to handle itprogrammatically.Triggers, StoredProceduresVisual FeaturesMax ConcurrencyIn Redshift, there is no concept You can create Triggers orof Triggers or Stored Procedures. Stored Procedures in Teradata.Redshift is a part of AWS,an integrated service. EntireRedshift performance can bemonitored through AWSconsole, Cloudwatch, andautomatic alerts.It has few visual tools likeTeradata Visual Explain butthey are cluttered.Maximum 15 concurrent queries. Runs more than 15 concurrent

17Macros SupportBy default its concurrency is 5.queries.No concept of Macros.Supports Macros.Although, Redshift cannot loadNoSQL data from other vendorsbut it can load data fromNo such feature supported yet.NoSQL to Redshift DynamoDB.FeatureMaximum StorageCapacityColumnCompression2 PB(16*128 DS2.8xlarge 2 PB)Storage capacity of much morethan 2 PB of data.In Redshift, when the table iscreated it automatically createsdefault compression on allcolumns. It also provides acommand called ANALYSECOMPRESSION to help oncolumn compression.In Teradata, you need tospecify column compress onindividual columns. You cancompressup to 128 unique values percolumn in a table.Maximum 1600 columns perMaximum Columns table.Maximum 258 columns perrow.Per TableMaximum JoinsNo limit as such.Data WarehouseMaintenance/UpdatesRedshift applies regular patchesand does automatic maintenance In Teradata, DBAs need to takeinside maintenance window.care of all these activitiesmanually or through some tool.Table IndexesIt does not have table indexconcept but its performance is64 joins per query block.Teradata does provide varioustypes of index i.e. Primary

18unaffected due to zone mapping Index, Secondary Index, etc.and sort key features.Table partitioningFault ToleranceRedshift Spectrum has butRedshift doesn’t.Redshift is Fault Tolerant. Incase, there is any node failure,Redshift will automaticallyreplace the failed node with thereplacement node. Although,multi-AZ is not supported inRedshift.Tables can be partitioned.Teradata is also fault tolerant.In case, there is a failover inAMP, fallback AMP will takeover automatically.

19Pricing and Effort ComparisonRedshift leads Teradata in effort and in-house pricing. Redshift is cheaperand easier than Teradata. For Redshift, you only need to turn on thecluster, set up security settings, few other options (maintenance windowperiod, snapshot enabling option, etc), and you are ready to go. This wayDBAs efforts get reduced.However, in terms of storage, Teradata has upper hand because Redshiftcluster has limitations. However, in Redshift, we can still handle thatthrough S3 as it does not have any space limitation.Remember, both Teradata and Redshift Data Warehouses are designedto solve different purposes.You can refer to Redshift and Teradata to know about pricing.

20When and How to Migrate data from Teradata toRedshiftThere are various considerations that need to be made on whether tomigrate from Teradata to AWS/cloud.1)2)3)4)5)6)How stable is your Teradata Warehouse?How much is your Teradata data volume?How complex is your Teradata data model?How much is your current Teradata data latency?How good is your Teradata RDBMS performance?How many BI tools are you using on your Teradatatables/views/cubes?7) Are you using plenty of unsupported features of Redshift inTeradata?8) Will migrating your data warehouse from Teradata to Redshiftbreak your system?9) Your budget of maintaining the Redshift and other key AWSservices post-migration.If all conditions are satisfied, you easily migrate your data from Teradatato Redshift. AWS provides a useful service called Data Migration Service(DMS) and Schema Conversion Tool (SCT). Although, this pretty handyservice is not fully automated as some minor manual efforts are required.Please refer to AWS documentation for migrating data from Teradata toRedshift .

21SummaryChoosing between Redshift and Teradata is a tough question to answeras both are solving different purposes. Redshift performs analytics andreporting extremely well. Since Redshift is a columnar base datawarehouse, its performance is really good when it comes to hitting thetable/view based columns and aggregate functions (sum, avg, count(*),etc). As Redshift is a part of AWS service, it is integrated with all vitalAWS services. Hence you don't need to store millions of data in Redshiftalone as you can archive old data in S3. If required, you can leverageRedshift Spectrum to build your analytics and reports on top of it. Storedprocedures can be handled through AWS Lambda Service. In terms ofage, Redshift is a comparatively newer data warehouse. Redshift is stilldeveloping features which other key data warehouses offer.On the other hand, Teradata is pretty matured and old. Teradata as anRDBMS may not provide similar performance as Redshift unless it has aproperly designed data model, fully leveraged features (FastLoad,Multiload, TPT, BTEQ, etc), and table/views are properly tuned. Although,some established customers might be reluctant to migrate from Teradatato Redshift. They can also look for the hybrid model option.In conclusion, it is still an ongoing debate, both Redshift and Teradatahave its pros and cons.

22ETL Challenges While Working With Amazon RedshiftData loading is one of the biggest challenges of Redshift. To perform ETLto Redshift, you would need to invest precious engineering resources toextract, clean, enrich, and build data pipelines. However, writing complexscripts to automate all of this is not easy. It gets harder if you want tostream your data real-time. Data loss becomes an everyday phenomenondue to issues that crop up with changing sources, unstructured & uncleandata, incorrect data mapping at the warehouse, and more.Using a data integration platform like Hevo can solve all your RedshiftETL problems. With Hevo you can move any data into Redshift in minutesin a hassle-free fashion. Hevo integrates with a variety of data sourcesranging from SQL, NoSQL, SaaS, File Storage Base, Webhooks, etc. withthe click of a button.Sign up for a f ree trial here or view a quick video on how Hevo can help.About Author:Ankur Shrivastava is a AWS Solution Designer with hands-on experienceon Data Warehousing, ETL, and Data Analytics. He is an AWS CertifiedSolution Architect Associate. In his free time, he enjoys all outdoor sportsand practices.

Looking for a simple and reliable way to bring Datafrom Any Source to AWS Redshift?TRY HEVOSIGN UP FOR FREE TRIAL

Redshift Architecture & Its Features 1 Teradata Architecture & Its Features 2 Redshift Data Model 4 Teradata Data Model 7 Pros 8 Cons 9 Teradata Pros and Cons 12 Pros 12 Cons 13 Features supported only by Teradata, not Redshift 15 Redshift Vs Teradata In A Nutshell 16 Pricing and Effort Comparison 20 When a