Big Data solution for one of India’s fastest growing private sector bank

Problem Statement / Opportunity
  • To set up a big data analytics platform on the cloud to enable the bank towards better data-driven decisions
  • The platform will be able to collect, store and process myriad data feeds (transactional, semi-structured, unstructured, batch, stream etc.) to enable advanced analytics.
  • Frameworks to ensure safe data practices and quicker model implementation and deployment

Oneture’s Role

Oneture has been development partner in this journey almost from the start; following is the list of use cases/features that we have partnered in implementing for the bank

  • Building of the Big data lake for execution of various types of Big data applications including Data science and Automation
  • Customer 360 data for searches and analysis
  • Identification of suspicious transactions in Aadhaar enabled payment system
  • Automation of credit card quarterly and half yearly loss reports
  • Customer segmentation for cross-sell based on transactions behavior
  • Identification of transaction and micro-ATM breeches

Proposed Solution & Architecture

Following considerations are undertaken to design the solution.

Cloud Infrastructure

  • Create S3 directory structure for dev/prod
  • Setup encryption/decryption on S3
  • Setup S3 lockdown rules
  • Setup EMR
  • Setup encryption/decryption on EMR
  • Setup data purge policies
  • Cloud Formation Templates/Service Portal
  • Data Governance

Data Ingestion

  • Identify data sources that need to be ingested and get small data samples
  • Identify data ingest cadence for each data set
  • Decide upon file paths (directory structure)
  • Identify, Provision and Setup Staging Area (provision, install, ports etc.)
  • Create ETL pipelines to push data to S3 at specified cadence

Data Storage and Discovery

  • S3 part covered as part of cloud infrastructure
  • Data Discovery using Glue/ES
  • Metadata management

Data Processing

  • Create applications for various required use cases
  • Fine-tune performance
  • Schedule jobs

Data Serving

  • Application dashboards
  • Application reports
  • DB or shared S3 bucket for application results

Code Integration and Deployment

  • Code commit
  • Test and Build on Laptop
  • Deploy on UAT
  • Deploy on Prod

User Management

  • Identify User Groups (data warehouse, big data developers, analysts/data scientists. business users, ad-hoc users, external) and their access roles
  • Data governance

Tools and Technologies Used

Technology domain Tools
Front End Apache Zeppelin
Build tools SBT, Apache Maven, Apache Ant- Apache Ivy
Big data tools Apache Spark, Apache Hadoop, Apache Hive, Apache Oozie
SDKs Oracle Java Development Kit
Development Environments Scala IDE for Eclipse, Jetbrains IntelliJ IDEA
Others Jenkins, Jfrog Artifactory, Apache NiFi
Code Commit Apache Subversion
Amazon Web Services EC2, EMR, Glue, KMS, S3, IAM , Cloudwatch, Cloudtrail

Value Delivered

  • This solution has given the customer an own development and deployment platform for big data analytics to help them provide solutions to many of their existing problems and challenges.
  • Faster and cheaper implementation of new analytical models to improve customer experience by offering the right product to the right customer at the right time
  • Better visibility of risk/fraud by continuously looking for anomalies in data. Considering one of the use-case, it has helped the customer to save many man-hours in manually identifying these suspicious transactions and other related details. This has saved tremendous efforts and time of the firm.
  • Increased operational efficiency by using templates in respective data processes

Lessons Learned

  • Focus on data management : Need to determine where the data and apps will reside, on premise system or together in a cloud implementation. Most of the value of Big Data comes from co-locating it with knowledgeable end users, at the edges of the organization, where they can tinker with and glean insights from their own data.
  • Scale and speed : Hadoop can process a lot of data, but it is a batch process. Using Apache Spark and NoSQL helps scale up and improve the speed drastically.
  • Data visualization : Front line professionals and others who are expected to be able to take action based on Big Data insights need an easily digestible delivery mechanism.
  • Big Data implementations : Data and the applications should be accessible via a platform as a service (PaaS) approach. POCs creation can be done on-premises but its production environment is recommended on cloud. Few of the advantages of having them in cloud are robustness, flexibility, scalability, availability, durability, no maintenance cost, etc.
  • Data and application togetherness : Creating a data lake to support the big data platform gives the choices of data and tools availability to perform tasks at a much higher rate, saves a lot of network delays and increases productivity hugely.
  • Integration of compliance and security : Securing the channels to and from the data lake also the accessibility within it, in order to make it much more secure and easy for governance.
  • On your marks, Get set, Go : There are numerous factors considering the performance of any application. With big data applications handling massive loads of data, performance tuning becoming a very important concern. With POCs proving working solutions, calculating and planning numbers of executors/workers help in handling such kind of job with varying loads. This also helps in memory optimization which indeed helps to make such jobs run smoothly without any excess overload on the system at lower cost.

Looking for Big Data Solution for Your Company? Let's Talk