Advanced research computing
high performance computing and storage needs that are too complex to be handled by a standard desktop workstation.
Algorithm
a mathematical formula placed in software to perform a mathematical analysis of a dataset.
Analytics platform
an area of computer science concerned with the development of computers that are able to engage in human-like thought processes.
Artificial Intelligence
software/hardware that provides the tools and computational power needed to build and perform various analytical queries.
Avro
a language-neutral data serialization system. It is helpful quite helpful, as it deals with data formats that can be processed by multiple languages. It is also the preferred tool to serialize data in Hadoop.
Behavioral analytics
using data about people’s behavior to understand how and why people behave they do.
Big Data
a term often used to describe the massive amounts of generated structured and unstructured data that is too large, complex and varied for traditional processing software.
Business intelligence
is an umbrella term that includes the applications, infrastructure and tools, and best practices that enable access to and analysis of information to improve and optimize decisions and performance.
Clickstream Analytics
users’ web activity analysis based on the items the clicked on a page.
Cloud computing
data and computing resources available and accessed on-demand over the internet instead of local computers and devices.
Clustering analysis
a multivariate method which aims to classify a sample of subjects (or objects) on the basis of a set of measured variables into a number of different groups such that similar subjects are placed in the same group.
Columnar database/column-oriented database
a database which stores data in columns, rather than rows. The big advantage of this type of database is faster hard disk access.
Comparative analysis
a form of analysis ensuring a step-by-step approach to comparing and calculating designed to notice patterns within giant data sets.
Cross-channel analytics
analysis attributing sales, showing average order value and the lifetime value.
Data Aggregation
the process of compiling information from different data sources to prepare for data processing and analysis.
Data Analyst
the main responsibilities of a data analyst are to collect, manipulate and analyze data. Based on this data, they prepare reports, which may include charts, graphs, dashboards and other visualizations.
Data architecture & design
defines how data is stored, managed, and used in a system, establishing common guidelines for data operations enabling organizations to model, gauge, predict, and control the flow of data in the system.
Data cleansing
conducting a review and revision of data to correct misspellings, remove duplicate entries, provide more consistency and add missing data.
Data Engineering
the process of the organization of storing, processing and consuming of data.
Data Feed
a continuous stream of structured data providing users with updates of current information from multiple sources.
Data Governance
a set of processes or rules that ensure the integrity of the data and that data management best practices are met.
Data Integration
the process of combining data from different sources and presenting it in a single view.
Data Mart
a subdivision of the data warehouse, usually geared towards a specific business team or line.
Data Mining
an analytical process where large data sets are explored or “mined” to search for meaningful patterns, relationships and insights.
Data Modelling
structuring the data for communication between technical and functional people to show the data needed for business processes.
Data Profiling
the process of collecting statistics and information about data in an existing source.
Data Quality
measuring data to determine its worthiness for planning, operations and decision making.
Data Science
a discipline that incorporates statistics, data visualization, data mining, machine learning and database engineering to solve complex problems.
Data Visualization
a visual conceptualization of data designed to derive meaning or facilitating the communication of information.
Data Warehouse
a place to store data for the purpose of reporting and analysis.
Data-driven decision making
using data to support the decision-making process.
Database
a digital collection of data and the structure around which the data is organized. The data is usually stored into and accessed a database management system (DBMS).
Dataset
a collection of data usually in tubular form.
Document Oriented database
is a computer program designed for retrieving, managing and storing document-oriented information, sometimes referred to as semi-structured data.
Exploratory analysis
finding patterns within data without standard procedures or methods. It is a means of discovering the data and to find the data sets main characteristics.
Extract, transform, and load (ETL)
a process used in data warehousing to prepare data for use in reporting or analytics.
Graph Databases
Employing graph structures (a concrete set of certain entities or ordered pairs), with properties, nodes and edges for data storage, thus providing index-free adjacency i.e. every element is directly linked to its neighbour element.
Hadoop
a framework that allows distributed processing of large datasets across clusters of computers using a simple programming model.
HDFS (Hadoop Distributed File System)
was created to reliably store enormous amounts of data sets and stream those data sets to user applications at high bandwidth to user. Thousands of servers in a large cluster, both directly host the attached storage and execute user application tasks.
In-memory database
any database system that relies on memory for data storage.
JSON
an open-standard file format. It uses human-readable texts to transmit data objects which consist of array data types and attribute–value pairs (or any other serializable value).
Key Value Store
created to securely store enormous data sets and stream those data sets to user applications at high bandwidth. If the cluster is large, thousands of servers both execute user application tasks and host directly attached storage.
Latency
any delay in a response or delivery of data from one point to another.
Machine learning
the use of algorithms to allow a computer to analyze data for the purpose of “learning” which action to perform when specific events or patterns occur.
MapReduce
a general term referring to the process of breaking down a problem into pieces that are disseminated across many computers within the same network or cluster.
Metadata
data about data; gives information about what the data is about.
MPP database
a database optimized to be processed simultaneously allowing multiple operations to be performed by multiple processing units at a time.
Natural Language Processing
a branch of AI, which deals with making human language (in both spoken and written forms) understandable to computers.
NoSQL
a type of database management system which does not employ the relational model. It was designed to handle large volumes that do not follow a fixed scheme.
Object Databases
databases which store data in the form of objects, as used by object-oriented programming. They differ from graph or relational databases and most of them offer a query language that allows object to be found with a declarative programming approach.
Online analytical processing (OLAP)
multidimensional data analysis using three operations: drill-down (the ability for users to see the underlying details) consolidation (the aggregation of available), and slice and dice (users are able to select subsets and view them from different perspectives).
Online transactional processing (OLTP)
the process of providing users with access to large amounts of transactional data in a way that they can derive meaning from it.
Operational data store (ODS)
a location to gather and store data from multiple sources so that more operations can be performed on it before sending to the data warehouse for reporting.
Parquet
a columnar storage format open to all projects in the Hadoop ecosystem, regardless of data model, the choice of data processing framework or programming language.
Predictive analytics
the use of statistical functions on a single or multiple dataset to predict future trends and events.
Query
asking for information to answer a certain question.
Real-time Analytics
data that is created, stored, analyzed, processed and visualized within milliseconds.
Recommendation Engine
an algorithm on an ecommerce site which analyzes that analyzes customers’ purchases and actions and then uses that data to make recommendations for complementary products.
Reference data
Data that describes an object and its properties. The object may be physical or virtual.
Schema
the structure that defines the organization of data in a database system.
Semi-Structured data
data that is not structured by a formal data model, but provides other means of describing the data and hierarchies.
Sentiment analysis
the application of statistical functions on comments people make on the web and through social networks to determine how they feel about a product or company.
Sharding
is a method for distributing data across multiple machines.
SQL
a programming language for retrieving data from a relational database.
Structured data
any data organized into structured fields such as a spreadsheet or database.
Text analytics
the application of statistical, linguistic, and machine learning techniques on text-based sources to derive meaning or insight.
The V’s
Big data – and the business challenges and opportunities associated with it – are often discussed or described in the context of multiple V’s.
Unstructured data
information that either does not have a pre-defined data model or is not organized into a pre-defined manner and is therefore not stored in a structured field.
Value
this is the most important “V” from the business perspective. Big Data’s value usually comes from pattern recognition and insight discovery which leads to stronger customer relations, more effective operations and other clear and quantifiable business benefits.
Variability
data, by it’s nature, is always changing and companies are constantly on to capture, analyze and manage it. This can be in text or sentiment analytics, changes in the definition of phrases and keywords.
Variety
the range and diversity of various data types, including semi-structured data, unstructured data, and raw data.
Velocity
the speed at which companies receive, store and manage data – e.g., the specific number of social media posts or search queries received within a day, hour or other unit of time.
Veracity
the true accuracy of information and data assets, which usually determines confidence at the executive-level.
Volume
the size and amounts of big data that companies manage and analyze.