Dima Statz

Deep Signal, Part 1: Problem Statement

DeepSignal — an innovative open-source framework designed to redefine real-time video and audio processing on the Apache Spark platform.

Aedo: Real-Time Content-Driven Ad Insertion Framework

ChatPal is a new VR game for Oculus, blending language learning with interactive fun. Inspired by Tamagotchi, kids engage with a virtual parrot to improve English skills through conversations and activities., ...

ChatPal: English Learning Enhanced with VR

ChatPal is a new VR game for Oculus, blending language learning with interactive fun. Inspired by Tamagotchi, kids engage with a virtual parrot to improve English skills through conversations and activities., ...

AI-Driven Ventures: A Framework for Developer Success

AI is a general purpose technology, meaning it is not usefull just for one thing, but it can be applied for a lots of different applications. Probably a good way to think about AI as a collection of tools: Supervised Learning, ...

Semantic Kernel: Chat With Your Data

LLMs are amazing at generating text and have a wide range of applications, they're not a substitute for domain-specific knowledge and expertise.

Building Virtual Assistants using LangChain

ChatGPT has impressive general knowledge, it can provide decent answers to various questions. However, when it comes to specific domains, its performance may fall short.

Visum — A Cloud Cost Optimization Platform

The worldwide infrastructure as a service (IaaS) market grew 41.4% in 2021, to total $90.9 billion, up from $64.3 billion in 2020. It is expected to be as high as $121.62 billion in 2022.

Monitoring Spark Streaming on K8s with Prometheus and Grafana

Cost Efficiency and Portability are the main reason to migrate Apache Spark workloads from managed services like AWS EMR, Azure Databricks, or HDInsight to Kubernetes. You can learn more about the migration process from AWS EMR to K8s in the following article.

Benchmarking Graviton2 processors with Apache Spark workloads

Amazon EC2 provides a broad portfolio of compute instances, including many that are powered by the latest-generation Intel and AMD processors. AWS Graviton2 processors add even more choice. AWS Graviton2 processors are custom-built by AWS using 64-bit Arm Neoverse cores to enable the best price-performance for workloads running on Amazon EC2

Processing costs measurement on multi-tenant EMR clusters

One of the 5 pillars of the Well-Architectured Framework is Cost Optimization. The Cost Optimization pillar focuses on avoiding unnecessary costs, selecting the most appropriate resource types, analyzing spend over time, scaling in/out in order to meet business needs without overspending.

Migrating Apache Spark workloads from AWS EMR to Kubernetes

ESG research found that 43% of respondents considering cloud as their primary deployment for Apache Spark. And it makes a lot of sense because the cloud provides scalability, reliability, availability, and massive economies of scale.

Monitoring the performance of software teams using Github, Jira, and Grafana

There are a bunch of good articles on the web about transitioning to fully remote work, my favorite one is “The Remote Manifesto” by GitLab. In addition, if you somewhat like us, and you are trying to build a data-driven team, you probably will need some good metrics to rely on in order to monitor your team’s performance

Monitoring Distributed Jetty Servers in K8s using Prometheus and Grafana

Monitoring and alerting is a mandatory part of any software system running in a production environment. To keep software systems healthy, to optimize performance and resource utilization, you need a unified operational view, real-time granular data, and historical reference.

No-Code Data Collect API on AWS

This article is all about moving data into Big Data Pipelines running on AWS. Since most data pipelines have 5 steps in common: collection -> storage-> processing -> analysis-> visualization, AWS has a very solid foundation for building all these steps.

Handling Data Skew in Apache Spark

One of the well-known problems in parallel computational systems is data skewness. Usually, in Apache Spark, data skewness is caused by transformations that change data partitioning like join, groupBy, and orderBy.

VerticaDB performance test with Locust.io

Locust is all about coding. You can manage all your tests in source control, share with your team, you can easily add, remove, fix any test, and you can automatically deploy it to any environment.

No-Code Data Collect API

Building a data pipeline that handles 1,000,000 and more events per second is not a trivial task. To handle such big traffic, all data pipeline components should be designed and implemented properly. Fortunately, not all data pipeline components should be built from scratch.

An honest AWS MSK review - July 2019

AWS MSK is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data. Amazon MSK makes it easy to ingest and process streaming data in real time ...

A Scala tutorial for Java developers

Scala was first introduced in January 2004 by Martin Odersky, it is JVM based and statically typed programming language. Scala supports both object-oriented and functional programming paradigms. The most well-known products written in Scala are Apache Spark, Apache Kafka, Apache Flink ...