One of the new features in Pentaho Data Integration 8.1 is the ability to directly connect to Google Drive. PDI uses the Virtual File System (VFS) which allows you to connect to a variety of file systems in a transparent way.
On May, 16th 2018, Hitachi Vantara released Pentaho 8.1 Although this is a minor follow-up release to 8.0 as far as version numbers go, but nevertheless a lot of new exciting features and improvements have been added.
Analytics projects are often treated as ad-hoc projects. Code and content are often managed in a version control system (git), but often without full release management. Deployment of infrastructure and releases are often done manually. In this post, we'll take a look at why it makes sense to manage your analytics projects as full-blown software development projects.
Although a lot of the components usually are in place, analytics teams often are reluctant to go the extra mile and automate every aspect of the project life cycle.
A first step is to automate infrastructure deployment, or to apply "Infrastructure as Code" (IaC). According to Wikipedia, "Infrastructure as code (IaC) is the process of managing and provisioning computer data centers through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools".
Briefly put, this means that all your installation and (environment and release specific) configuration should be run from code (scripts, templates) and managed as software development projects. Taking this beyond just infrastructure, the benefits of treating "Everything as Code", including deployment, testing etc, significantly outweigh the downsides.
Let's look at a number of these benefits in more detail.
What size is this?
Suppose you want to predict what the length or width of a flower petal.
For this we can look for a relation between the two.
What's weird about this?
At certain times you might be faced with unexpected patterns or events appearing in your data. Let's take a look on how we can tackle anomalies, by detecting them.
How is this related?
In this post, we'll take a look at how we can find out in what way data is structured or related.
Is this A, or B?
As a follow-up to last week's machine learning tidbit let's look at an example of how we can solve a classification problem using machine learning (on recreational data).
What is a graph database?
Although graph theory has been around for centuries, graph databases started to appear relatively recently.
‘Traditional’ relational databases store data in tables. These tables have a fixed format (fixed number of columns, each with a fixed data type). Tables are linked through the primary key in one table and the corresponding foreign keys in other tables. When a query is executed, the database engine fetches the primary keys from one table and links them to the corresponding foreign keys in other tables through SQL joins.
Although this works well to insert, select, update and delete individual records over one or a limited number of tables (CRUD operations), performing joins during query execution over large schemas and data volumes is expensive and slow.
Property graph databases (like the one offered by market leader Neo4J) use nodes with labels and properties to store data instead of the relational database tables and columns. Instead of fitting all data into a fixed, predefined table structure, each node is a new instance with a structure that can be different from other nodes.
This schema-less structure offers more flexibility than the relational database table. On top of the additional flexibility, graph databases treat relationships as first class citizens. Instead of being built during query execution, relationship are persisted in a graph database. Having all relationships stored with the data not only allows to extremely outperform relational databases on relationship-heavy queryies, it allows use cases that simply aren’t possible in relational databases.
In this post, we’ll have a look at a couple of analytical use cases with graph databases.