Showing 146 open source projects for "hadoop"

View related business solutions
  • Our Free Plans just got better! | Auth0 Icon
    Our Free Plans just got better! | Auth0

    With up to 25k MAUs and unlimited Okta connections, our Free Plan lets you focus on what you do best—building great apps.

    You asked, we delivered! Auth0 is excited to expand our Free and Paid plans to include more options so you can focus on building, deploying, and scaling applications without having to worry about your security. Auth0 now, thank yourself later.
    Try free now
  • Outgrown Windows Task Scheduler? Icon
    Outgrown Windows Task Scheduler?

    Free diagnostic identifies where your workflow is breaking down—with instant analysis of your scheduling environment.

    Windows Task Scheduler wasn't built for complex, cross-platform automation. Get a free diagnostic that shows exactly where things are failing and provides remediation recommendations. Interactive HTML report delivered in minutes.
    Download Free Tool
  • 1
    Apache Bigtop

    Apache Bigtop

    Bigtop is an Apache Foundation project for Infrastructure Engineers

    Apache Bigtop is a project focused on building and packaging the Hadoop ecosystem and related big data components. It provides a consistent framework for testing, packaging, and deploying Hadoop distributions, including tools like HDFS, YARN, Spark, Hive, HBase, and more. By maintaining cross-platform builds (RPMs, DEBs, Docker images, and Kubernetes support), Bigtop makes it easier for organizations to deploy big data stacks in different environments.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 2
    Apache Phoenix

    Apache Phoenix

    Mirror of Apache Phoenix

    Apache Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications by combining the best of both worlds. The power of standard SQL and JDBC APIs with full ACID transaction capabilities and the flexibility of late-bound, schema-on-read capabilities from the NoSQL world by leveraging HBase as its backing store. Apache Phoenix is fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 3
    Apache Impala

    Apache Impala

    Apache Impala

    Impala provides low latency and high concurrency for BI/analytic queries on the Hadoop ecosystem, including Iceberg, open data formats, and most cloud storage options. Impala also scales linearly, even in multitenant environments. Impala is integrated with native Hadoop security and Kerberos for authentication, and via the Ranger module, you can ensure that the right users and applications are authorized for the right data.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 4
    Apache HBase

    Apache HBase

    Get random, realtime read/write access to your Big Data

    ...A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options. Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX. Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
    Downloads: 5 This Week
    Last Update:
    See Project
  • Atera - The depth of a full-stack IT platform, with the power of AI. Icon
    Atera - The depth of a full-stack IT platform, with the power of AI.

    Atera introduces your autonomous AI agent - Ensure operational efficiency at any scale with 24/7 autonomous IT support.

    Atera prioritizes security and compliance through robust protections that align with industry standards. Our AI-driven features were built on responsible AI principles and empower IT teams to work efficiently while maintaining trust and compliance.
    Learn More
  • 5
    SageMaker Spark

    SageMaker Spark

    A Spark library for Amazon SageMaker

    ...With SageMaker Spark, you can train on Amazon SageMaker from Spark DataFrames using Amazon-provided ML algorithms like K-Means clustering or XGBoost, and make predictions on DataFrames against SageMaker endpoints hosting your trained models, and, if you have your own ML algorithms built into SageMaker compatible Docker containers, you can use SageMaker Spark to train and infer on DataFrames with your own algorithms -- all at Spark scale. SageMaker Spark depends on hadoop-aws-2.8.1. To run Spark applications that depend on SageMaker Spark, you need to build Spark with Hadoop 2.8. However, if you are running Spark applications on EMR, you can use Spark built with Hadoop 2.7.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 6
    Apache Drill

    Apache Drill

    Apache Drill is a distributed MPP query layer for self describing data

    ...Drill supports a variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, Swift, NAS and local files. A single query can join data from multiple datastores. For example, you can join a user profile collection in MongoDB with a directory of event logs in Hadoop.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 7
    HugeGraph

    HugeGraph

    A graph database that supports more than 100+ billion data

    ...HugeGraph supports fast import performance in the case of more than 10 billion Vertices and Edges Graph, millisecond-level OLTP query capability, and can be integrated into big data platforms like Hadoop or Spark for OLAP analysis. The main scenarios of HugeGraph include correlation search, fraud detection, and knowledge graph. Not only supports Gremlin graph query language and RESTful API but also provides commonly used graph algorithm APIs. To help users easily implement various queries and analyses, HugeGraph has a full range of accessory tools, such as supporting distributed storage, data replication, scaling horizontally, and supports many built-in backends of storage engines.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 8
    Luigi

    Luigi

    Python module that helps you build complex pipelines of batch jobs

    ...The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else. You can build pretty much any task you want, but Luigi also comes with a toolbox of several common task templates that you use. It includes support for running Python mapreduce jobs in Hadoop, as well as Hive, and Pig, jobs. It also comes with file system abstractions for HDFS, and local files that ensures all file system operations are atomic.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 9
    Apache Hudi

    Apache Hudi

    Upserts, Deletes And Incremental Processing on Big Data

    Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage). Apache Hudi is a transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • Smart Business Texting that Generates Pipeline Icon
    Smart Business Texting that Generates Pipeline

    Create and convert pipeline at scale through industry leading SMS campaigns, automation, and conversation management.

    TextUs is the leading text messaging service provider for businesses that want to engage in real-time conversations with customers, leads, employees and candidates. Text messaging is one of the most engaging ways to communicate with customers, candidates, employees and leads. 1:1, two-way messaging encourages response and engagement. Text messages help teams get 10x the response rate over phone and email. Business text messaging has become a more viable form of communication than traditional mediums. The TextUs user experience is intentionally designed to resemble the familiar SMS inbox, allowing users to easily manage contacts, conversations, and campaigns. Work right from your desktop with the TextUs web app or use the Chrome extension alongside your ATS or CRM. Leverage the mobile app for on-the-go sending and responding.
    Learn More
  • 10
    Genie

    Genie

    Distributed Big Data Orchestration Service

    Genie is a completely open source distributed job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.
    Downloads: 1 This Week
    Last Update:
    See Project
  • 11
    IoTDB

    IoTDB

    Apache IoTDB

    Apache IoTDB (Database for Internet of Things) is an IoT native database with high performance for data management and analysis, deployable on the edge and the cloud. Due to its light-weight architecture, high performance and rich feature set together with its deep integration with Apache Hadoop, Spark and Flink, Apache IoTDB can meet the requirements of massive data storage, high-speed data ingestion and complex data analysis in the IoT industrial fields. In the scene of factories, there are tens of devices under LAN network. IoTDB can be installed on a local controller server in the factory to receive data from those devices. ...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 12
    ClickHouse

    ClickHouse

    A fast open-source OLAP database management system

    ClickHouse® is a fast, open-source column-oriented database management system that can generate analytical data reports through SQL queries in real time. According to several independent benchmarks, it far exceeds other comparable column-oriented database management systems, working even up to 1000 times faster. It is able to process hundreds of millions to more than a billion rows and tens of gigabytes of data per single server per second. Apart from its blazing speed, ClickHouse is highly...
    Downloads: 16 This Week
    Last Update:
    See Project
  • 13
    XGBoost

    XGBoost

    Scalable and Flexible Gradient Boosting

    ...It also offers parallel tree boosting (GBDT, GBRT or GBM) that can quickly and accurately solve many data science problems. XGBoost can be used for Python, Java, Scala, R, C++ and more. It can run on a single machine, Hadoop, Spark, Dask, Flink and most other distributed environments, and is capable of solving problems beyond billions of examples.
    Downloads: 2 This Week
    Last Update:
    See Project
  • 14
    ANTLR

    ANTLR

    Parser generator to read, process, or translate structured text

    ...It’s widely used in academia and industry to build all sorts of languages, tools, and frameworks. Twitter search uses ANTLR for query parsing, with over 2 billion queries a day. The languages for Hive and Pig, the data warehouse and analysis systems for Hadoop, both use ANTLR. Lex Machina uses ANTLR for information extraction from legal texts. Oracle uses ANTLR within SQL Developer IDE and their migration tools. NetBeans IDE parses C++ with ANTLR. The HQL language in the Hibernate object-relational mapping framework is built with ANTLR.
    Downloads: 8 This Week
    Last Update:
    See Project
  • 15
    syslog-ng

    syslog-ng

    Log management solution that improves the performance of SIEM

    ...Search billions of logs in seconds using full text queries with Boolean operators to pinpoint critical logs. syslog-ng Store Box provides secure, tamper-proof storage and custom reporting to demonstrate compliance. syslog-ng can deliver data from a wide variety of sources to Hadoop, Elasticsearch, MongoDB, and Kafka as well as many others. syslog-ng flexibly routes log data from X sources to Y destinations. Instead of deploying multiple agents on hosts, organizations can unify their log data collection and management. syslog-ng Store Box provides automated archiving, tamper-proof encrypted storage, granular access controls to protect log data. ...
    Downloads: 13 This Week
    Last Update:
    See Project
  • 16
    SeaweedFS

    SeaweedFS

    Distributed storage system for blobs, objects, files, and data lake

    ...Blob store has O(1) disk seek, local tiering, cloud tiering. Filer supports cross-cluster active-active replication, Kubernetes, POSIX, S3 API, encryption, Erasure Coding for warm storage, FUSE mount, Hadoop, WebDAV. SeaweedFS is an independent Apache-licensed open source project with its ongoing development made possible because of the community. SeaweedFS is a simple and highly scalable distributed file system. There are two objectives, to store billions of files, and to serve the files fast! SeaweedFS started as an Object Store to handle small files efficiently. ...
    Downloads: 2 This Week
    Last Update:
    See Project
  • 17
    Jupyter Enterprise Gateway

    Jupyter Enterprise Gateway

    Enables Jupyter Notebooks to share resources across clusters

    ...By default, the Jupyter framework runs kernels locally - potentially exhausting the server of resources. By leveraging the functionality of the underlying resource management applications like Hadoop YARN, Kubernetes, and others.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 18
    Apache Flink

    Apache Flink

    Stream processing framework with powerful stream

    Apache Flink is a distributed engine for stateful computations over data streams and batches, designed for low-latency processing at scale. Its core runtime executes dataflow graphs with fine-grained backpressure and checkpointing, allowing applications to recover consistently from failures. Flink’s event-time model and watermarks enable accurate out-of-order processing, windowing, and complex time semantics that typical real-time systems struggle with. Developers program against high-level...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 19

    JRecord

    Read Cobol data files in Java

    ...The source is now available at https://github.com/bmTas/JRecord Projects using JRecord include: * https://github.com/thospfuller/rcoboldi - Cobol File in R * https://github.com/tmalaska/CopybookInputFormat - Cobol files in Hadoop * https://github.com/gss2002/copybook_formatter * https://github.com/gss2002/ftp2hdfs has some code that allows ftping RDW files directly from the Mainframe into Hadoop/HDFS as a mapreduce job or standalone client.
    Downloads: 34 This Week
    Last Update:
    See Project
  • 20
    TensorFlowOnSpark

    TensorFlowOnSpark

    TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters

    By combining salient features from the TensorFlow deep learning framework with Apache Spark and Apache Hadoop, TensorFlowOnSpark enables distributed deep learning on a cluster of GPU and CPU servers. It enables both distributed TensorFlow training and inferencing on Spark clusters, with a goal to minimize the amount of code changes required to run existing TensorFlow programs on a shared grid.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 21
    OpenTSDB

    OpenTSDB

    A scalable, distributed time series database

    OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. OpenTSDB was written to address a common need: store, index and serve metrics collected from computer systems (network gear, operating systems, applications) at a large scale, and make this data easily accessible and graphable. Store and serve massive amounts of time series data without losing granularity. Generate graphs from the GUI, pull from the HTTP API, choose an open source front-end. OpenTSDB...
    Downloads: 0 This Week
    Last Update:
    See Project
  • 22
    spatial-framework-for-hadoop

    spatial-framework-for-hadoop

    The Spatial Framework for Hadoop allows developers

    The Spatial Framework for Hadoop allows developers and data scientists to use the Hadoop data processing system for spatial data analysis. For tools, samples, and tutorials that use this framework, head over to GIS Tools for Hadoop. At the root level of this repository, you can build a single jar with everything in the framework using Apache Ant. Alternatively, you can build a jar at the root level of each framework component.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 23
    SZT-bigdata

    SZT-bigdata

    SZT‑bigdata is an open source project

    SZT‑bigdata is an open-source project analyzing real Shenzhen metro (subway) card usage data using big‑data frameworks like Spark, Hadoop, Hive, Kafka, Flink, ClickHouse, HBase, and Elasticsearch. Aimed at exploring transit passenger flow patterns and system optimization using a variety of Scala-based technologies.
    Downloads: 0 This Week
    Last Update:
    See Project
  • 24
    Open Source Data Quality and Profiling

    Open Source Data Quality and Profiling

    World's first open source data quality & data preparation project

    ...This tool is developing high performance integrated data management platform which will seamlessly do Data Integration, Data Profiling, Data Quality, Data Preparation, Dummy Data Creation, Meta Data Discovery, Anomaly Discovery, Data Cleansing, Reporting and Analytic. It also had Hadoop ( Big data ) support to move files to/from Hadoop Grid, Create, Load and Profile Hive Tables. This project is also known as "Aggregate Profiler" Resful API for this project is getting built as (Beta Version) https://sourceforge.net/projects/restful-api-for-osdq/ apache spark based data quality is getting built at https://sourceforge.net/projects/apache-spark-osdq/
    Downloads: 3 This Week
    Last Update:
    See Project
  • 25
    geometry-api-java

    geometry-api-java

    The Esri Geometry API for Java enables developers to write apps

    The Esri Geometry API for Java can be used to enable spatial data processing in 3rd-party data-processing solutions. Developers of custom MapReduce-based applications for Hadoop can use this API for spatial processing of data in the Hadoop system. The API is also used by the Hive UDF’s and could be used by developers building geometry functions for 3rd-party applications such as Cassandra, HBase, Storm and many other Java-based “big data” applications.
    Downloads: 0 This Week
    Last Update:
    See Project