Category Archives: Data Mining

Talking Points from my pitch at the RPI E-Ship Problem Pitch Competition (Spring 2019)

50920364_10156866208832357_1237573604908990464_o.jpgPhoto Credit: RPI

  • Data is everywhere, but not all data is created equally. About 2.5 quintillion bytes of data are generated each day.[1] However, the vast majority of this “Big Data” is unstructured and unlabeled, preventing businesses and institutions from harnessing the full power of Analytics and AI to drive decision making.
  • Because of its unstructured nature, Big Data requires significant preprocessing, the scale of which prohibits all but the largest of organizations from building and refining novel datasets that save companies like Netflix $1 billion a year in customer retention.
  • It is estimated that “a third of business intelligence professionals spend 50% to 90%” of their time preprocessing data with the remaining time left for model building and evaluation.[2] Given that a 3TB Big Data project costs $1 million for each month it is active, $500,000 per month is out the door before the “real” analysis work begins.[3]
  • This may explain why 60%-85% of big data projects are never completed, despite that enormous 130% ROI such projects yield organizations compared to competitors not using big data.[4][5]
  • The “Big Data” structuring problem is relatively new, which is one of the reasons that it remains open-ended. Right now the solution is to throw more processors and people at it.
  • However, with the rise of IoT and Web 4.0, companies need to structure and mine this data efficiently to remain competitive in the 21st century. The current data engineering techniques and platforms are incapable of doing this efficiently, making the need for a new solution urgent. 

Sources:

  1. https://techcrunch.com/2017/07/21/why-the-future-of-deep-learning-depends-on-finding-good-data/
  2. https://www.zdnet.com/article/big-datas-biggest-problem-its-too-hard-to-get-the-data-in/
  3. https://www.cooladata.com/blog/true-cost
  4. https://www.techrepublic.com/article/85-of-big-data-projects-fail-but-your-developers-can-help-yours-succeed/
  5. https://existek.com/blog/big-data-solutions-development-cost-example-software/

 

Data Visualization

graph-3186081_1280.png

Data visualization is the process (or the process output) of communicating key aspects of a dataset through the use of “visual elements like charts, graphs, and maps,” among others.[1] Given the complexity of contemporary datasets, especially those related to Big Data, the role of data visualization in business intelligence, data analytics, and managerial decision making is becoming increasingly important. While many software programs offer some elementary suite of data visualization tools, like Microsoft Excel, flagship visualization tools include Tableau and Qlik.[2] Data-science friendly scripting languages such as Python and R also have various libraries for creating advanced visualizations.[3][4]

At the most basic level, a data visualization must articulate some trend or component of a dataset in a way that is unambiguous to the users. Great visualizations, however, “are quick to decipher and easy to be remembered as well as being appropriate to the data they represent.”[5] Because data visualization is very user-centric, there is no closed-form solution when it comes to designing visualizations. However, all visualizations should tell a story, provide sufficient context without overwhelming the user, and elicit a call to action.[6] The latter principle is especially important because “The more easily understood your call-to-action, the more people will willingly interact with your insight, brainstorm solutions and implement recommendations.”[6] In regards to the technical principles, most data visualizations should include a baseline, “sufficient contrast between colors,” a limited number of colors, clear and concise labels, and intuitive ordering.[7]

Businesses use data visualization techniques to create dashboards “that track company performance across key performance indicators and visually interpret the results.”[2] For instance, IT call centers use dashboards to convey information regarding agent handle time, rate of service ticket closure, and average client feedback rating. Using these dashboards, managers can determine which agents should be given a raise for their job performance, and which agents require more training to boost their performance.[8] In a broad sense dashboards and data visualization techniques also serve to enforce transparency and accountability. In 2016, the White House released an interactive treemap representing the Federal Budget. The goal of this visualization was “to communicate with taxpayers about where their tax dollars would go.”[9]

Data scientists and machine learning architects use data visualization techniques to visualize the performance of Deep Convolutional Neural Networks during network training. During each training epoch, a set of metrics is computed including metrics like training data and validation data loss and accuracy. However, drawing conclusions from any one epoch is meaningless as the metrics in that epoch could correspond to an overshot model or other random processes in the model development. Instead, it is important to analyze and draw conclusions from the overall trend of the metrics. Multiple line graphs provide machine learning practitioners with a quick, reliable way of doing this; without them debugging complex models would be difficult, if not impossible.[10]

Many bioinformatics programs like those offered by Geneious and Dnastar include visualization tools that allow biologists to examine 3D models of DNA, proteins, and other molecules.[11][12] Furthermore, medical students, practitioners, and professors are now using apps like Human Anatomy Atlas by Visual Body to visually examine the complex networks of the human body for training, disease and injury diagnosis, and medical equipment engineering.[13] Among other features, the interactive visualizations found in Human Anatomy Atlas allow users “see structures from all body systems,” “delve into the microanatomy of tissue and special organs,” and “watch muscle movements demonstrated in rotatable moving 3D models.”[13]

Data visualization is changing the way political scientists and the general public analyze political policy and climate. One day after President Trump’s 2019 State of the Union (SOTU), American news site Axios posted a news article summarizing the Trump’s SOTU talking points. The article includes a massive interactive visualization which features his speech overlaid with various colors, each of which corresponds to a distinct topic.[14] For starters, this visualization allows users to quickly identify topics they care most about and examine the President’s words verbatim. However, from a big picture, it also allows users to gauge the amount of time the President talked about each topic. Whatsmore is that the 2019 SOTU transcript is placed alongside the 2018 and 2017 SOTU transcripts (also color-coded) allowing users to compare the President’s agenda over time.[14] This type of visualization demonstrates how complex data can be converted into a comprehensive and visually-appealing graphic that allows users to “drill down” into data as they see fit.

Sources

  1. https://www.tableau.com/learn/articles/data-visualization
  2. https://searchbusinessanalytics.techtarget.com/definition/data-visualization
  3. https://www.analyticsvidhya.com/blog/2015/07/guide-data-visualization-r/
  4. https://mode.com/blog/python-data-visualization-libraries
  5. https://www.tapclicks.com/innovative-data-visualization-examples/
  6. https://www.gooddata.com/blog/8-ways-turn-good-data-great-visualizations
  7. https://www.columnfivemedia.com/25-tips-to-upgrade-your-data-visualization-design
  8. https://www.klipfolio.com/resources/dashboard-examples/call-center
  9. https://www.tableau.com/learn/articles/best-beautiful-data-visualization-examples
  10. https://medium.com/@kapilvarshney/how-to-plot-the-model-training-in-keras-using-custom-callback-function-and-using-tensorboard-41e4ce3cb401
  11. https://www.geneious.com/about/
  12. https://www.dnastar.com/software/structural-biology/
  13. https://www.visiblebody.com/anatomy-and-physiology-apps/human-anatomy-atlas
  14. https://www.axios.com/what-trump-talked-about-in-the-state-of-the-union-803bc486-d1b9-4817-933d-172735103b46.html

 

Data Mining and Big Data Analytics Applications

data mining

Data mining is the process of extracting and analyzing “relationship and patterns in stored transaction data to get information which will help for better business decisions.”[1] It is typically done using regression, clustering, or classification algorithms that operate on small or large relational databases and structured data.[1] Data mining is rooted in statistics. It is used to unmask potentially useful trends underlying a particular dataset. Big data analytics is closely related but different in that it emphasizes processing and storing very large heterogeneous datasets, those over 1 terabyte large, that contain structured, semi-structured, and unstructured data.[1][2] Big data analytics in some regards is data mining on steroids. Companies and governments around the globe are leveraging both data mining and Big data analytics to make more informed decisions using the data they have at their disposal.[3] It is estimated that by 2020 there will be “5,200 Gbs of data on every person in the world” leaving no shortage of data to be mined.[1]

In the healthcare field, data mining and data analytics are used to quickly and effectively discover the unintended side-effects of certain medications, specifically when multiple medications are taken together. “By mining tens of thousands of electronic patient records, researchers at Stanford University quickly discovered an unexpected answer: people who take both drugs [Paxil and Pravachol] have higher blood glucose levels.”[4] Clinical trials are expensive and they take a long time to complete due to patient health and safety regulations. Without data mining, it could have taken years if not a decade for researchers to discover that mixing drugs like Pravachol and Pravachol is dangerous.

In national defense, data mining and data analytics are used to increase the effectiveness and efficiency of tactical and surveillance systems. Most of the data mining software used by the military is developed by the Defense Advanced Research Projects Agency (DARPA), the same agency that created ARPANET, the precursor to the Internet.[5] One such example is the Chaff Electronic Countermeasure System, which is used to “detect missile near the ship, calculate the missile’s trajectory towards the ship and release chaff rockets to deflect the incoming missile to another calculated trajectory away from the ship.”[5] This data mining-powered anti-missile system can react with greater speed and accuracy than a human can, increasing the overall safety of a combat ship and its crew.[5]

Businesses utilize data mining and big data analytics to increase operational effectiveness and strategic advantage over rivals. Netflix is one prominent example. Every time a Netflix user scrolls through a row of movie titles, clicks on a trailer, or stops watching a particular television show to find another one, their action is recorded by Netflix. Netflix uses this data to create personalized video-rankings, group similar movie and television titles, and decide what content to license or produce in-house.[6] These features have allowed Netflix to save $1 billion in customer retention and surpass Disney as the world’s largest entertainment company with only 5,000 employees compared to Disney’s 200,000 employees.[7] Similarly, Amazon uses big data analytics and a subfield of that called Predictive Analytics “for targeted marketing to increase customer satisfaction and build company loyalty.”[8] Specifically, Amazon employs a robust collaborative filtering engine (CFE) that “analyzes what items you purchased previously, what is in your online shopping cart or on your wish list…to recommend additional products that other customers purchased when buying those same items.”[8] Amazon has found that their CFE generates 35% of their annual revenue.[8] Amazon also uses analytics to dynamically update the prices of products sold through its website every 10 minutes in order to increase annual profits.[8]

Aided by powerful supercomputers, NASA stores, processes, and distributes “12.1TB of data every single day from thousands of sensors and systems dotted across the world and space.”[9] It uses this data to “help analyze the challenging projects, from solar flare and space weather scenarios to detailed space vehicle designs.”[10] At the present, NASA along with several other US government agencies are working on sophisticated models to monitor and predict the long-term effects of global warming and climate change. The accuracy of these climate change models is only possible because of the advances in computational power, Big Data Analytic techniques, and the vast amount of recorded climate data.[11]

Data mining and big data analytics also play a major role in cybersecurity. Traditional cybersecurity defensive measures include firewalls, role-based access controls, and strong passwords.[12] While these measures are still necessary to eliminate low-level threats, they are not enough. New cybersecurity defensive measures include the use of analytics software that can detect suspicious user behavior and cut-off access to a potential attacker before they can compromise an organization’s network.[13]

Research shows that companies that utilize some form of data mining or big data analytics to drive business strategy are “twice as likely to rank in the top quarter for financial performance, five times more likely to make timely decisions, and three times more likely to execute their decisions and plans.”[13] For this reason the big data analytics market is expected to be worth over $200 billion by 2020.[14]

Sources:

  1. “Big Data vs Data Mining – Find Out The Best 8 Differences.” EDUCBA.com, EDUCBA, 5 Oct. 2018, www.educba.com/big-data-vs-data-mining/.
  2. Labbe, Mark, et al. “What Is Big Data Analytics? – Definition from WhatIs.com.” SearchBusinessAnalytics, Sept. 2018, searchbusinessanalytics.techtarget.com/definition/big-data-analytics.
  3. “5 Data Mining Applications.” ExpertSystem.com, Expert System, 7 July 2016, www.expertsystem.com/5-data-mining-applications/.
  4. Savage, Neil. “Mining Data for Better Medicine.” MIT Technology Review, MIT Technology Review, 30 Dec. 2013, http://www.technologyreview.com/s/425466/mining-data-for-better-medicine/.
  5. “Data Mining in the Military.” MIS Class Blog, MIS Class Blog, 14 Mar. 2017, misclassblog.com/data-analytics/data-mining-in-the-military/.
  6. “How Netflix Uses Big Data to Drive Success – Predictive Analytics Times – Machine Learning & Data Science News.” Predictive Analytics – The Power to Predict Who Will Click, Buy, Lie, or Die, 18 Sept. 2018, www.predictiveanalyticsworld.com/patimes/how-netflix-uses-big-data-to-drive-success/9693/.
  7. Dans, Enrique. “How Analytics Has Given Netflix The Edge Over Hollywood.” Forbes, Forbes Magazine, 30 May 2018, www.forbes.com/sites/enriquedans/2018/05/27/how-analytics-has-given-netflix-the-edge-over-hollywood/#5017d8546b23.
  8. Wills, Jennifer. “7 Ways Amazon Uses Big Data to Stalk You.” Investopedia, Investopedia, 20 Oct. 2018, www.investopedia.com/articles/insights/090716/7-ways-amazon-uses-big-data-stalk-you-amzn.asp.
  9. Gorey, Colm. “The Volume of Data NASA Has to Manage Is Mind-Boggling.” Silicon Republic, 26 Oct. 2017, www.siliconrepublic.com/enterprise/nasa-data-figures.
  10. Skytland, Nick. “What Is NASA Doing with Big Data Today? | OpenNASA.” NASA, NASA, 4 Oct. 2012, open.nasa.gov/blog/what-is-nasa-doing-with-big-data-today/.
  11. Northon, Karen. “NASA Releases Detailed Global Climate Change Projections.” NASA, NASA, 8 June 2015, www.nasa.gov/press-release/nasa-releases-detailed-global-climate-change-projections.
  12. “10 Basic Cybersecurity Measures: Best Practices to Reduce Exploitable Weaknesses and Attacks .” Ics-Cert.us-Cert.gov, Water Information Sharing & Analysis Center (WaterISAC), June 2015, ics-cert.us-cert.gov/sites/default/files/documents/10_Basic_Cybersecurity_Measures-WaterISAC_June2015_S508C.pdf.
  13. “The Growing Importance of Business Analytics.” Villanova.edu, Villanova School of Business, taxandbusinessonline.villanova.edu/resources-business/article-business/the-growing-importance-of-business-analytics.html.
  14. Press, Gil. “6 Predictions For The $203 Billion Big Data Analytics Market.” Forbes, Forbes Magazine, 20 Jan. 2017, www.forbes.com/sites/gilpress/2017/01/20/6-predictions-for-the-203-billion-big-data-analytics-market/#354935c20838.