Data-driven Analysis and Forecasting on Big Distributed Data File Systems

In today?s world, data is the most important source and changing the face of our world. It might be part of a study, boost a company?s revenue, detection systems and much more. Mainly in computing and business, large-scale data is needed for analysis, called Big Data. Big data is a field that deals

Project Title

Project Area of Specialization

Computer Science

Project Summary

In today’s world, data is the most important source and changing the face of our world. It might be part of a study, boost a company’s revenue, detection systems and much more. Mainly in computing and business, large-scale data is needed for analysis, called Big Data. Big data is a field that deals with methods for analyzing, effectively extracting information from, or otherwise working with data volumes that are too large or complicated for traditional data-processing application software to handle. We got to know that there are some challenges of Big Data encountered by companies where systems store/hold the data in large quantities especially in the banking sector. The problem is to cater to a large amount of data, since the data might be structured or unstructured and can be in several TB or as big as 1PB per day. As we see in banking systems, the data of transactions, account holders, credit card info, etc. are in a very large quantity and it is difficult to run queries on such large data to skim out the needed data, it either consumes a lot of time or results in crashing of the system entirely or down the system. Our idea is to work on a platform that deals with big data, analyzes and gives the optimized result, queries that won’t crash the system and different functions can be processed in an efficient manner. Our main goal is to work on different tools for Big Data Predictive analytics, and to analyze the problem while working on large data, will do benchmarking by using different efficient tools and methods, setup Hadoop clusters (collection of computers), generate large scale data, load data files on HDFS, and do predictive analysis and forecasting on sales data in order to increase sales in upcoming years, profitability and earnings, number of accounts etc.

Project Objectives

To build a platform for bank that deal with complex large amount of bank’s data and to perform predictive analysis on big data.
To forecast future outcomes on different parameters of bank’s data.
To evaluate and compare the performance of big data systems and architectures by doing big data benchmarking.

Project Implementation Method

With processing latency as a top priority, two methods for big data processing have been proposed and implemented: batch-based stored data processing and real-time data-stream processing. Above tools are the most promising methods and have been explored along with a discussion of when to use which in the following paragraphs.

Hadoop is an open-source distributed processing system for big data applications that controls data processing and storage. HDFS (Hadoop Distributed File System) is a key part of the many Hadoop ecosystem technologies for storing, processing vast amount of data efficiently and running applications and this can be done by using clusters of hardware. Further Steps will be the clustering of computers that are networked together, HDFS splits the data into blocks and nodes and these blocks are further divided. HDFS also makes copies of data so data will not be lost. After this, Hive LLAP will be used for query execution on data files, Hive LLAP (Long-Lived Analytical Processing), is the latest version of Hive at the time of writing, a SQL-on-Hadoop processing framework that promises low latency SQL queries in Hadoop. Hive LLAP offers a hybrid execution model. LLAP is 100% compatible with Hive SQL queries and data formats. Using LLAP gives you the advantage of interactive, sub-second SQL, while keeping all your data in Apache Hadoop.

Then will move on Spark and Impala with Iceberg. Spark is a framework, just as Hadoop is, which provides a number of interconnected platforms, systems, and standards for Big Data projects. It's an open-source network, which means it can be used freely by anyone. It can also be altered by anyone to produce a custom version for a particular purpose/problem. Spark is designed to work by processing data in chunks “in memory”. Impala brings scalable parallel database technology to Hadoop that enables users to issue low-latency SQL queries on data stored in HDFS and Apache HBase without the need to move or transform data and Apache Iceberg is an open table format for storing huge analytic datasets, slow-moving tabular data. The function of a table layout is to determine how you manage, organize, and track all the files that make up a table.

For Predictive Analysis, first features will be extracted from the pattern recognition cycle. Feature extraction is used to determine the features which will be used for learning. The pattern’s description and properties are known. Then on the basis of these features, the distributed forecasted system will predict some parameters like sales increase analysis in upcoming years.

Benefits of the Project

Beneficial for Banks, Customers of the Bank and also for the organizations that are dealing with big data.
Expected outcome will be a platform for banks, where big data can be handled easily and predictive analysis on big data can be done.

Technical Details of Final Deliverable

Tools used for the project are Hadoop HDFS, Apache Spark, Apache Hive, Impala with Iceberg.

Different industries are facing problems in handling and finding the best resources for the analysis of big data. They are facing a lot of barriers in managing large amount of data and no right resources to solve this problem. The scope of our project is to address these problems. The aim of this project is to build a platform for Banks that deal ample amount of data and will be able for predictive analysis on different parameters. Benchmarking will be done to give optimized results and predictive analysis on big data using efficient tools and methods of Machine learning.

Final Deliverable of the Project

Software System

Core Industry

Finance

Other Industries

Core Technology

Big Data

Other Technologies

Artificial Intelligence(AI)

Sustainable Development Goals

Industry, Innovation and Infrastructure

Required Resources

Item Name	Type	No. of Units	Per Unit Cost (in Rs)	Total (in Rs)
512GB SATA SSD	Equipment	4	12500	50000
12GB RAM	Equipment	4	5000	20000
Big Data Books, Proposal & Research Paper & Report Printing	Miscellaneous	1	10000	10000
			Total in (Rs)	80000

If you need this project, please contact me on contact@adikhanofficial.com

Comments 0

IOT Based Power Transformer Fault Detection System

Protection of the power system is an important aspect to protect electrical components aga...

Adil Khan

11 months ago

Room light automation system

In our project, we will see the Automatic Room Lights using Arduino and PIR Sensor, w...

Adil Khan

11 months ago

Traffic Density Estimation System

The goal of this project is to introduce and present a machine learning application that a...

Adil Khan

11 months ago

Smart Traffic Signal Controlling System

The increase in population is causing an increase in vehicles traffic on the road. Poor an...

Adil Khan

11 months ago

Smart appointment system for pateints

Our project consist of a website for online appoinement system , it is consist of a p...

Adil Khan

11 months ago