Data Science


Extract Transform and Load (ETL)

With many Database Warehousing tools available in the market. Following is a curated list of best opensource/commercial ETL tools with key features and download links.


QuerySurge

QuerySurge is ETL testing solution developed by RTTS. It is built specifically to automate the testing of Data Warehouses & Big Data. It ensures that the data extracted from data sources remains intact in the target systems as well.

Features:

  • Improve data quality & data governance
  • Accelerate your data delivery cycles
  • Helps to automate manual testing effort
  • Provide testing across the different platform like Oracle, Teradata, IBM, Amazon, Cloudera, etc.
  • It speeds up testing process up to 1,000 x and also providing up to 100% data coverage
  • It integrates an out-of-the-box DevOps solution for most Build, ETL & QA management software
  • Deliver shareable, automated email reports and data health dashboards

QuerySurge link: https://goo.gl/exoNtY

MarkLogic

MarkLogic is a data warehousing solution that makes data integration easier and faster using an array of enterprise features. This tool helps to perform very complex search operations. It can query data including documents, relationships, and metadata.

Features:

  • The Optic API can perform joins and aggregates over documents, triples, and rows.
  • It allows specifying more complex security rules for all the elements within documents
  • Writing, reading, patching, and deleting documents in JSON, XML, text, or binary formats
  • Database Replication for Disaster Recovery
  • Specify Output Options on the App Server Configuration
  • Importing and Exporting Configuration Information

MarkLogic download link: https://developer.marklogic.com/products

Oracle - ORACLE DATA INTEGRATOR

Oracle data warehouse software is a collection of data which is treated as a unit. The purpose of this database is to store and retrieve related information. It helps the server to reliably manage huge amounts of data so that multiple users can access the same data.

Features:

  • Distributes data in the same way across disks to offer uniform performance
  • Works for single-instance and real application clusters
  • Offers real application testing
  • Common architecture between any Private Cloud and Oracle's public cloud
  • Hi-Speed Connection to move large data
  • Works seamlessly with UNIX/Linux and Windows platforms
  • It provides support for virtualization
  • Allows connecting to the remote database, table, or view

Oracle ODI link: https://www.oracle.com/database/data-warehouse/index.html

Amazon RedShift

Amazon Redshift is an easy to manage, simple, and cost-effective data warehouse tool. It can analyze almost every type of data using standard SQL.

Features:

  • No Up-Front Costs for its installation
  • It allows automating most of the common administrative tasks to monitor, manage, and scale your data warehouse
  • Possible to change the number or type of nodes
  • Helps to enhance the reliability of the data warehouse cluster
  • Every data center is fully equipped with climate control
  • Continuously monitors the health of the cluster. It automatically re-replicates data from failed drives and replaces nodes when needed

Amazon RedShift link: https://aws.amazon.com/redshift/

Domo

Domo is a cloud-based Data warehouse management tool that easily integrates various types of data sources, including spreadsheets, databases, social media and almost all cloud-based or on-premise Data warehouse solutions.

Features:

  • Help you to build your dream dashboard
  • Stay connected anywhere you go
  • Integrates all existing business data
  • Helps you to get true insights into your business data
  • Connects all of your existing business data
  • Easy Communication & messaging platform
  • It provides support for ad-hoc queries using SQL
  • It can handle most concurrent users for running complex and multiple queries

Domo link: https://www.domo.com/product

Teradata

The Teradata Database is the only commercially available shared-nothing or Massively Parallel Processing (MPP) data warehousing tool. It is one of the best data warehousing tool for viewing and managing large amounts of data.

Features:

  • Simple and Cost Effective solutions
  • The tool is best suitable option for organization of any size
  • Quick and most insightful analytics
  • Get the same Database on multiple deployment options
  • It allows multiple concurrent users to ask complex questions related to data
  • It is entirely built on a parallel architecture
  • Offers High performance, diverse queries, and sophisticated workload management

Teradata link: https://downloads.teradata.com/


SAP

SAP is an integrated data management platform, to maps all business processes of an organization. It is an enterprise level application suite for open client/server systems. It has set new standards for providing the best business information management solutions.

Features:

  • It provides highly flexible and most transparent business solutions
  • The application developed using SAP can integrate with any system
  • It follows modular concept for the easy setup and space utilization
  • You can create a Database system that combines analytics and transactions. These next next-generation databases can be deployed on any device
  • Provide support for On-premise or cloud deployment
  • Simplified data warehouse architecture
  • Integration with SAP and non-SAP applications

SAP link: https://support.sap.com/en/my-support/software-downloads.html

SAS

SAS is a leading Datawarehousing tool that allows accessing data across multiple sources. It can perform sophisticated analyses and deliver information across the organization.

Features:

  • Activities managed from central locations. Hence, user can access applications remotely via the Internet
  • Application delivery typically closer to a one-to-many model instead of one-to-one model
  • Centralized feature updating, allows the users to download patches and upgrades.
  • Allows viewing raw data files in external databases
  • Manage data using tools for data entry, formatting, and conversion
  • Display data using reports and statistical graphics

SAS link: https://www.sas.com/en_in/home.html

IBM – DataStage

IBM data Stage is a business intelligence tool for integrating trusted data across various enterprise systems. It leverages a high-performance parallel framework either in the cloud or on-premise. This data warehousing tool supports extended metadata management and universal business connectivity.

Features:

  • Support for Big Data and Hadoop
  • Additional storage or services can be accessed without need to install new software and hardware
  • Real time data integration
  • Provide trusted ETL data anytime, anywhere
  • Solve complex big data challenges
  • Optimize hardware utilization and prioritize mission-critical tasks
  • Deploy on-premises or in the cloud

IBM – DataStage link: http://www-01.ibm.com/support/docview.wss?uid=swg24037518

Informatica

Informatica PowerCenter is Data Integration tool developed by Informatica Corporation. The tool offers the capability to connect & fetch data from different sources.

Features:

  • It has a centralized error logging system which facilitates logging errors and rejecting data into relational tables
  • Build in Intelligence to improve performance
  • Limit the Session Log
  • Ability to Scale up Data Integration
  • Foundation for Data Architecture Modernization
  • Better designs with enforced best practices on code development
  • Code integration with external Software Configuration tools
  • Synchronization amongst geographically distributed team members

Informatica link : https://informatica.com/

MS SSIS

SQL Server Integration Services is a Data warehousing tool that used to perform ETL operations; i.e. extract, transform and load data. SQL Server Integration also includes a rich set of built-in tasks.

Features:

  • Tightly integrated with Microsoft Visual Studio and SQL Server
  • Easier to maintain and package configuration
  • Allows removing network as a bottleneck for insertion of data
  • Data can be loaded in parallel and various locations
  • It can handle data from different data sources in the same package
  • SSIS consumes data which are difficult like FTP, HTTP, MSMQ, and Analysis services, etc.
  • Data can be loaded in parallel to many varied destinations

Microsoft SQL Server Integration Services link : https://www.microsoft.com/en-us/download/details.aspx?id=39931

Talend Open Studio

Open Studio is an open source data warehousing tool developed by Talend. It is designed to convert, combine and update data in various locations. This tool provides an intuitive set of tools which make dealing with data lot easier. It also allows big data integration, data quality, and master data management.

Features:

  • It supports extensive data integration transformations and complex process workflows
  • Offers seamless connectivity for more than 900 different databases, files, and applications
  • It can manage the design, creation, testing, deployment, etc of integration processes
  • Synchronize metadata across database platforms
  • Managing and monitoring tools to deploy and supervise the jobs

Talend Open Studio link: [1]


The Ab Initio software

The Ab Initio is a data analysis, batch processing, and GUI based parallel processing data warehousing tool. It is commonly used to extract, transform and load data.

Features:

  • Meta data management
  • Business and Process Metadata management
  • Ability to run, debug Ab Initio jobs and trace execution logs
  • Manage and run graphs and control the ETL processes
  • Components can execute simultaneously on various branches of a graph

The Ab Initio software link : https://www.abinitio.com/en/

Dundas

Dundas is an enterprise-ready Business Intelligence platform. It is used for building and viewing interactive dashboards, reports, scorecards and more. It is possible to deploy Dundas BI as the central data portal for the organization or integrate it into an existing website as a custom BI solution.

Features:

  • Data warehousing tool for Business Users and IT Professionals
  • Easy access through web browser
  • Allows to use sample or Excel data
  • Server application with full product functionality
  • Integrate and access all kind of data sources
  • Ad hoc reporting tools
  • Customizable data visualizations
  • Smart drag and drop tools
  • Visualize data through maps
  • Predictive and advanced data analytics

Dundas link :[2]

Sisense

Sisense is a business intelligence tool which analyses and visualizes both big and disparate datasets, in real-time. It is an ideal tool for preparing complex data for creating dashboards with a wide variety of visualizations.

Features:

  • Unify unrelated data into one centralized place
  • Create a single version of truth with seamless data
  • Allows to build interactive dashboards with no tech skills
  • Query big data at very high speed
  • Possible to access dashboards even in the mobile device
  • Drag-and-drop user interface
  • Eye-grabbing visualization
  • Enables to deliver interactive terabyte-scale analytics
  • Exports data to Excel, CSV, PDF Images and other formats
  • Ad-hoc analysis of high-volume data
  • Handles data at scale on a single commodity server
  • Identifies critical metrics using filtering and calculations

Sisense link: https://www.sisense.com/get/watch-demo/



Pentaho

[[]]