🔧 AI-Powered News AggregationApril 28, 2026✅ Tests passing

News Data Pipeline

A Python library that provides a pipeline for fetching, cleaning, and summarizing news data. It enables developers to integrate AI-powered content curation into their own applications by providing modular components for API access, preprocessing, and summarization.

View on GitHub Download ZIP

Share:X / Twitter LinkedIn Reddit Hacker News

What It Does

Fetch news articles from NewsAPI.
Clean and preprocess fetched news articles.
Summarize news content using a mock summarization pipeline.

Installation

Install the required dependencies using pip:

pip install requests pandas

Usage

Run the script from the command line:

python news_data_pipeline.py --api_key YOUR_NEWSAPI_KEY --topic "Artificial Intelligence" --page_size 5

Replace YOUR_NEWSAPI_KEY with your NewsAPI key and specify the topic you want to fetch news for.

Source Code

import requests
import pandas as pd
from unittest.mock import MagicMock

class NewsPipeline:
    def __init__(self, api_keys):
        """
        Initialize the NewsPipeline with API keys.

        :param api_keys: Dictionary containing API keys for supported news APIs.
        """
        self.api_keys = api_keys
        self.summarizer = MagicMock()

    def fetch_news(self, topic, api="newsapi", page_size=10):
        """
        Fetch news articles based on a topic.

        :param topic: Topic to search for.
        :param api: API to use (default: 'newsapi').
        :param page_size: Number of articles to fetch (default: 10).
        :return: List of news articles.
        """
        if api == "newsapi":
            url = "https://newsapi.org/v2/everything"
            params = {
                "q": topic,
                "pageSize": page_size,
                "apiKey": self.api_keys.get("newsapi")
            }
            response = requests.get(url, params=params)
            if response.status_code == 200:
                return response.json().get("articles", [])
            else:
                response.raise_for_status()
        else:
            raise ValueError(f"Unsupported API: {api}")

    def clean_data(self, articles):
        """
        Clean and preprocess the fetched news articles.

        :param articles: List of raw news articles.
        :return: DataFrame of cleaned articles.
        """
        data = []
        for article in articles:
            data.append({
                "title": article.get("title"),
                "description": article.get("description"),
                "content": article.get("content"),
                "url": article.get("url"),
                "publishedAt": article.get("publishedAt")
            })
        return pd.DataFrame(data)

    def summarize(self, text, max_length=50):
        """
        Summarize a given text using a pre-trained model.

        :param text: Text to summarize.
        :param max_length: Maximum length of the summary.
        :return: Summarized text.
        """
        if not text:
            return ""
        summaries = self.summarizer(text, max_length=max_length, min_length=10, do_sample=False)
        return summaries[0]["summary_text"] if summaries else ""

    def fetch_and_summarize(self, topic, api="newsapi", page_size=10):
        """
        Fetch, clean, and summarize news articles based on a topic.

        :param topic: Topic to search for.
        :param api: API to use (default: 'newsapi').
        :param page_size: Number of articles to fetch (default: 10).
        :return: List of summarized articles.
        """
        articles = self.fetch_news(topic, api, page_size)
        cleaned_data = self.clean_data(articles)
        cleaned_data["summary"] = cleaned_data["content"].apply(self.summarize)
        return cleaned_data.to_dict(orient="records")

if __name__ == "__main__":
    import argparse
    import json

    parser = argparse.ArgumentParser(description="News Data Pipeline")
    parser.add_argument("--api_key", required=True, help="API key for NewsAPI")
    parser.add_argument("--topic", required=True, help="Topic to fetch news for")
    parser.add_argument("--page_size", type=int, default=10, help="Number of articles to fetch")
    args = parser.parse_args()

    pipeline = NewsPipeline(api_keys={"newsapi": args.api_key})
    try:
        result = pipeline.fetch_and_summarize(topic=args.topic, page_size=args.page_size)
        print(json.dumps(result, indent=2))
    except Exception as e:
        print(f"Error: {e}")

Community

Downloads

···

Rate this tool

No ratings yet — be the first!

Details

Tool Name: news_data_pipeline
Category: AI-Powered News Aggregation
Generated: April 28, 2026
Tests: Passing ✅
Fix Loops: 2

Quick Install

Clone just this tool:

git clone --depth 1 --filter=blob:none --sparse \
  https://github.com/ptulin/autoaiforge.git
cd autoaiforge
git sparse-checkout set generated_tools/2026-04-28/news_data_pipeline
cd generated_tools/2026-04-28/news_data_pipeline
pip install -r requirements.txt 2>/dev/null || true
python news_data_pipeline.py

Links

View source on GitHub Raw README.md