Declare
Data

Get Started in ~30 Seconds

DeclareData Fuse is a blazing fast drop-in alternative for running Spark code that operates without the need for high overhead and expensive Spark clusters. It is designed to run on your local machine as well as in production, providing the same APIs and functionality as Apache Spark.

Overview

Challenge

  • - Spark is slow in production for ≤5TB on-demand data workloads
  • - High overhead and complexity of managing Spark clusters
  • - Spark code workflows and test/deploy iterations are too slow
  • - High cloud costs for maintaining Spark clusters

Fuse Server vs Apache Spark?

  • 🚀 Process terabyte-scale workflows faster locally and in production
  • 💰 Cut cloud costs by 50%+ without compromising speed
  • 🔥 Speed and simplicity without the overhead of Spark clusters
  • 😄 Accelerate data engineering workflows, increase happiness
  • ✨ Keep your existing Spark codebase

Key Components

  • DeclareData Fuse Server: The blazing fast, simple drop-in Spark alternative
  • Fuse Python: PySpark-compatible Python client library for DeclareData Fuse
  • Fuse Go (experimental): Go (golang) client library for DeclareData Fuse

Server Setup

Run the latest DeclareData Fuse server using docker:

$ docker run -p 8080:8080 -p 3000:3000 ghcr.io/declaredata/fuse:latest

The server is now running and ready to accept connections on port 8080 and an optional web interface on port 3000.

Client Installation

Use pip to install the DeclareData Fuse client:

$ pip install declaredata_fuse

Updating the DeclareData Fuse client:

$ pip install --upgrade declaredata_fuse

Usage by Example

Your PySpark code should work without modifications! No need to learn new APIs or refactor your codebase. See also official Github repository for more examples.


# Initialize a new session
from declaredata_fuse.session import FuseSession

# Connect to DeclareData Fuse Server
fs = FuseSession.builder.getOrCreate()

# Read CSV data
df = fs.read.csv("data.csv")

# Show initial data
print("Initial Dataset:")
df.show(5)

# Filter for a specific state
print("\nCalifornia Data:")
ca_data = df.where(df.state_abbr == "CA")
ca_data.show(5)

# Sort by population
print("\nStates Sorted by Population:")
sorted_data = df.sort("population", ascending=False)
sorted_data.show(5)

# Select specific columns
print("\nYear and State Only:")
year_state = df.select("year", "state_abbr")
year_state.show(5)

fs.stop()

# Initialize a new session
from declaredata_fuse.session import FuseSession

# Connect to DeclareData Fuse Server
fs = FuseSession.builder.getOrCreate()

# Use multiple datasets and sources
population_df = fs.read.csv("data.csv")
income_df = fs.read.csv("income.csv")

# Population Rankings
print("Top States by Population:")
population_df.sort("population", ascending=False).show(5)

# Year-by-Year Population Analysis
print("\nYear 2000 Population Data:")
pop_2000 = population_df.where(population_df.year == 2000)
pop_2000.sort("population", ascending=False).show(5)

print("\nYear 2001 Population Data:")
pop_2001 = population_df.where(population_df.year == 2001)
pop_2001.sort("population", ascending=False).show(5)

# State-Specific Analysis
print("\nCalifornia Population Trends:")
ca_pop = population_df.where(population_df.state_abbr == "CA")
ca_pop.sort("year").show(5)

print("\nTexas Population Trends:")
tx_pop = population_df.where(population_df.state_abbr == "TX")
tx_pop.sort("year").show(5)

# Major Population Centers
print("\nStates with Population > 20M:")
major_states = population_df.where(population_df.population >= 20000000)
major_states.sort("year").show(5)

fs.stop()

More Examples

Complex operations like window functions work seamlessly.


from declaredata_fuse.session import FuseSession

fs = FuseSession.builder.getOrCreate()

population_df = fs.read.csv("data.csv")
income_df = fs.read.csv("income.csv")

print("Population Data Sample:")
population_df.show(3)

print("\nUnique States in Dataset:")
states = population_df.select("state_abbr").distinct()
states.show(10)

print("\nLarge States (Population > 20M):")
large_states = population_df.where(population_df.population >= 20000000)
large_states.sort("population", ascending=False).show(5)

print("\nYear 2000 Population Rankings:")
y2000 = population_df.where(population_df.year == 2000)
y2000.sort("population", ascending=False).show(5)

print("\nTop States by Income and Population:")
combined = population_df.join(
    income_df,
    ["state_abbr", "year"]
)
combined.sort("total_income", ascending=False).show(5)

print("\nHighest Population States by Year:")
high_pop = population_df.where(population_df.population >= 20000000)
high_pop.sort("year").show(5)

fs.stop()
                

Other complex operations like joins and filters work seamlessly.


from declaredata_fuse.session import FuseSession
import declaredata_fuse.functions as F
from declaredata_fuse.window import Window

fs = FuseSession.builder.getOrCreate()

taxi_df = fs.read.csv("taxi.csv")
weather_df = fs.read.csv("weather.csv")

print("Initial Taxi Data:")
taxi_df.show(5)

print("\nWeather Conditions:")
weather_df.show(5)

print("\nHighest Fares by Date:")
taxi_df.groupBy("date").agg(
    F.first("fare_amount").alias("highest_fare_of_day")
).sort(
    "highest_fare_of_day", ascending=False
).show(5)

print("\nTrips During Snow:")
combined = taxi_df.join(
    weather_df, ["date"]
)
snow_trips = combined.where(combined.condition == "Snow")
snow_trips.sort("fare_amount", ascending=False).show(5)

print("\nFare Rankings by Weather Condition:")
window_spec = Window.partitionBy("condition").orderBy(F.desc("fare_amount"))
combined.withColumn(
    "fare_rank", F.rank().over(window_spec)
).where(
    F.col("fare_rank") <= 3
).show(9)

fs.stop()
                

Deploy

Deploying your code is as simple as running it locally. WIP.


              TODO: Deployment instructions, coming soon.